EDN Admin
Well-known member
I was tracking down a problem this morning that was causing pegged CPUs on several of our 12-core production servers. It came down to a Regex expression. Ive seen the posts here about exponential Regex expressions, but this seems a little different.
The string being searched is quite short (short enough that even most exponential algorithms would finish pretty quickly), and if we tweak the expression to include looking for the end of the string as well as the end of the sequence were replacing,
the same operation finishes nearly instantly. That seems like a bug to me. Shouldnt Regex ALWAYS stop looking at the end of the string?
Here is the code:<br/>
<span style="font-size:9.5pt; font-family:Consolas <span style="color:blue string html =
<span style="color:#A31515 "<!--BEGIN QUALTRICS POLL--> <script type=text/javascript> var q_poll_f = function(){var s=document";
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:blue string embeddedScriptComments = <span style="color:#A31515
@"(/*.*?*/|//.*?[nr])";
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:blue string scriptPattern = <span style="color:#2B91AF
String.Format(<span style="color:#A31515 @"(?script<[ nr]*script[^>]*>(.*?{0}?)*<[ nr]*/script[^>]*>)", embeddedScriptComments);
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:green //string scriptPattern = String.Format(@"(?script<[ nr]*script[^>]*>(.*?{0}?)*(<[ nr]*/script[^>]*>|$))", embeddedScriptComments);
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:green // the pattern includes the comment and script sub-patterns
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:blue string pattern = <span style="color:#2B91AF String.Format(<span style="color:#A31515 @"(?s)({0})", scriptPattern);
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:#2B91AF Regex re = <span style="color:blue new
<span style="color:#2B91AF Regex(pattern, <span style="color:#2B91AF RegexOptions.IgnoreCase);
<span style="font-size:9.5pt; font-family:Consolas <span style="color:green // remove all comments and scripts from the page...
<span style="font-size:9.5pt; font-family:Consolas html = re.Replace(html,
<span style="color:#A31515 "");
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:#2B91AF Debug.WriteLine(html);
The commented-out line is our fix that makes it run instantly, which just adds end-of-string as a condition for which to end the matching sequence.
Thanks,
-James
View the full article
The string being searched is quite short (short enough that even most exponential algorithms would finish pretty quickly), and if we tweak the expression to include looking for the end of the string as well as the end of the sequence were replacing,
the same operation finishes nearly instantly. That seems like a bug to me. Shouldnt Regex ALWAYS stop looking at the end of the string?
Here is the code:<br/>
<span style="font-size:9.5pt; font-family:Consolas <span style="color:blue string html =
<span style="color:#A31515 "<!--BEGIN QUALTRICS POLL--> <script type=text/javascript> var q_poll_f = function(){var s=document";
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:blue string embeddedScriptComments = <span style="color:#A31515
@"(/*.*?*/|//.*?[nr])";
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:blue string scriptPattern = <span style="color:#2B91AF
String.Format(<span style="color:#A31515 @"(?script<[ nr]*script[^>]*>(.*?{0}?)*<[ nr]*/script[^>]*>)", embeddedScriptComments);
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:green //string scriptPattern = String.Format(@"(?script<[ nr]*script[^>]*>(.*?{0}?)*(<[ nr]*/script[^>]*>|$))", embeddedScriptComments);
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:green // the pattern includes the comment and script sub-patterns
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:blue string pattern = <span style="color:#2B91AF String.Format(<span style="color:#A31515 @"(?s)({0})", scriptPattern);
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:#2B91AF Regex re = <span style="color:blue new
<span style="color:#2B91AF Regex(pattern, <span style="color:#2B91AF RegexOptions.IgnoreCase);
<span style="font-size:9.5pt; font-family:Consolas <span style="color:green // remove all comments and scripts from the page...
<span style="font-size:9.5pt; font-family:Consolas html = re.Replace(html,
<span style="color:#A31515 "");
<span style="font-size:9.5pt; font-family:Consolas
<span style="color:#2B91AF Debug.WriteLine(html);
The commented-out line is our fix that makes it run instantly, which just adds end-of-string as a condition for which to end the matching sequence.
Thanks,
-James
View the full article