Web Images Videos Maps News Groups Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
regex high cpu utilization
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  2 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
rh  
View profile  
 More options May 19 2006, 6:32 am
Newsgroups: microsoft.public.dotnet.framework, microsoft.public.dotnet.languages.csharp, microsoft.public.dotnet.general
From: "rh" <rhashem...@hotmail.com>
Date: 18 May 2006 13:32:24 -0700
Local: Fri, May 19 2006 6:32 am
Subject: regex high cpu utilization
hi all,
take the following 2 c# lines:
1) str = Regex.Replace(str, ".*AAA", "");
2) str = Regex.Replace(str, "^.*AAA", "");

notice that the only difference is that the pattern in line 2 has a
starter
marker (^). if str is large and does not contain the pattern, line 1
takes much much longer to complete than line 2 and cpu goes nearly
full throttle.

i have reproduced this behavior in .net 1.1 and 2.0. but jscript does
not exhibit this.

is this expected behavior? both lines do the exact same operation,
expect for line 2 we're explicitly instruction regex to start at the
beginning.
i'm baffled.

thanks,
rh


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Barry Kelly  
View profile  
 More options May 19 2006, 7:05 am
Newsgroups: microsoft.public.dotnet.framework, microsoft.public.dotnet.languages.csharp, microsoft.public.dotnet.general
From: Barry Kelly <barry.j.ke...@gmail.com>
Date: Thu, 18 May 2006 22:05:21 +0100
Local: Fri, May 19 2006 7:05 am
Subject: Re: regex high cpu utilization

This is one possible expected behaviour under the current
implementation, which does some kinds of optimizations (such as
compiling to a dynamic assembly in memory if you wish) but not others
(as you have seen). Here's the way each one works in a string of length
n, assuming no optimization of any kind:

1)
Start at first character,
     match next character * n times
     end of string found, so back-track 1
     try to match A, then end of string found | backtrack to .*
     backtrack 1, try to match AA, then end of string | backtrack to .*
     backtrack 1, try to match AAA, if found then success
     not found, so backtrack the .* matching
Start at second character,
     match next character * n-1 times
     end of string found, so back-track 1
     try to match A, then end of string found
     backtrack 1, try to match AA, then end of string
     backtrack 1, try to match AA, then end of string
     not found, so backtrack the .* matching
Start at third character, ...

As you can see, this will take time proportional to n*n, or quadratic
time.

2)
Start at first character,
     match next character * n times
     end of string found, so back-track 1
     try to match A, then end of string found
     backtrack 1, try to match AA, then end of string
     backtrack 1, try to match AA, then end of string
     not found, so *NO MATCH*

This one can exit early because the '^' forces the expression to only
match the start of the string. It takes time proportional to n, or
linear time.

Now, semantically, one can see that '.*AAA' is simply the start of the
string up to AAA, and the whole string could be found quickly by just
searching for AAA. Since the '*' operator in .NET Regex is greedy, the
'^' is implied.

I don't know if the JScript test you ran has the same semantics, but it
appears it certainly has a different implementation with some more
optimizations. To make it quicker, you can use '^' yourself.

(I think this is a .net framework issue, not a general issue or a C#
issue, so I suppose it should have been posted to that newsgroup only.)

-- Barry

--
http://barrkel.blogspot.com/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google