.NET Framework Bookmark and Share   
 index > Regular Expressions > Why does this Pattern Fail
 

Why does this Pattern Fail

Let me throw this one out to the group, for this one has me stumped. The the pattern in multiline option

^((?:\s*)(?<Token>[^\s]+))+

will return data in a speedy fashion.

But if I add the $

^((?:\s*)(?<Token>[^\s]+))+$

with the Multiline option...it runs/runs in expresso. Why? I would expect the $ to anchor it and safely return...

Also if the first (?:\s*) is changed to (?:\s+) the pattern immediately returns but with nothing found. Hmmmm

Test Data:

#606 20:00:00
UPDT000
#607 20:00:00
UPDT200 ALL
#608 20:00:00
UPDT020 HD0
#609 20:00:01

OPM034 MADN PCVY WED 12/08/09 20:00:00 HRHR
MADN ORIG TERM STRM PANS SANS
UPDT011 HD0
#610 20:00:52
UPDT700 CHECKING PATCHES ON HD0 20:00:52
UPDT701 HD0 FORMAT PATCH LEVEL 0025 20:01:13
UPDT020 HD1
#611 20:01:13
CLI PCVY WED 20:01:58 12/08/09
CLED DN 277 5493
CLNG DN 636 222 1011 TG 40 PCVY CE 1 1 04 01 06
#612
UPDT011 HD1
#613 20:02:08
UPDT700 CHECKING PATCHES ON HD1 20:02:08
UPDT701 HD1 FORMAT PATCH LEVEL 0025 20:02:27
UPDT020 MO0
#614 20:02:29
UPDT011 MO0
#615 20:03:37
UPDT700 CHECKING PATCHES ON MO0 20:03:37
UPDT701 MO0 FORMAT PATCH LEVEL 0025 20:03:57
UPDT710 CHECKING PATCHES IN MEM 20:03:57
UPDT711 MEM AT PATCH LEVEL 0025 20:04:01
UPDT720 CHECKING RFILES ON DISK 20:04:01
UPDT721 RFILES ARE SYNCHRONIZED 20:04:30
UPDT001
#616 20:04:30
UPDT007 20:04:30




William Wegerson (www.OmegaCoder.Com)
OmegaMan
A couple of observations. When I plug this into C#, the behavior differs from what you describe. Yes, the first pattern does as you say, and adding the $ in the second does appear to hang up (run for a very long time). But, changing (?:\s*) to (?:\s+) does not return nothing for me. It returns a match that instead of starting at character index 0, begins at character index 16. This is a clue.

I normally think of ^ and $ as anchoring the start and end of a string. But in multiline mode, ^ anchors the beginning of a line not the string. And $ anchors the end of a line, not of the entire string.

So for each character match after the first line, it backtracks to the beginning and expands the match. It will just take exponentially longer with each additional line you add.

The greediness eats up the lines, but if you went to non-greedy, if it solved the hang problem, it would do so at the cost of creating many many small matches.

If you would like each line to be its own match, try this pattern...

string pattern = @"^((?:[\s-[\n\r]]*)(?<Token>[^\s]+))+";


Les Potter, Xalnix Corporation, Yet Another C# Blog
xalnix
Ok....\s includes newline characters....that is one clue.
William Wegerson (www.OmegaCoder.Com)
OmegaMan
Very interesting, try this

^((?:\s*)(?<Token>[^\s]+))+\b

Also, remove the multiline option, highlight just the $ or all the pattern and then click on Validate.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
  • Edited byJohnGrove Saturday, August 15, 2009 3:38 PM
  •  
JohnGrove
When I validate the pattern all is fine, when I merely run it it hangs.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
A couple of observations. When I plug this into C#, the behavior differs from what you describe. Yes, the first pattern does as you say, and adding the $ in the second does appear to hang up (run for a very long time). But, changing (?:\s*) to (?:\s+) does not return nothing for me. It returns a match that instead of starting at character index 0, begins at character index 16. This is a clue.

I normally think of ^ and $ as anchoring the start and end of a string. But in multiline mode, ^ anchors the beginning of a line not the string. And $ anchors the end of a line, not of the entire string.

So for each character match after the first line, it backtracks to the beginning and expands the match. It will just take exponentially longer with each additional line you add.

The greediness eats up the lines, but if you went to non-greedy, if it solved the hang problem, it would do so at the cost of creating many many small matches.

If you would like each line to be its own match, try this pattern...

string pattern = @"^((?:[\s-[\n\r]]*)(?<Token>[^\s]+))+";


Les Potter, Xalnix Corporation, Yet Another C# Blog
xalnix
Clever Les,
You are employing your new found knowledge

[\s-[\n\r]]
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
Thanks guys...been too busy to dedicate time to the forums. As always thanks for the insights.
William Wegerson (www.OmegaCoder.Com)
OmegaMan
So for each character match after the first line, it backtracks to the beginning and expands the match. It will just take exponentially longer with each additional line you add.
backtracks to the beginning... of the current line or the very first match?
William Wegerson (www.OmegaCoder.Com)
OmegaMan
Thanks for the follow up William. I would guess the beginning hence the time..
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
Sorry, I have something to understand.

OmegaMan,
1 What's this whole pattern to catch in the sample data? each line with text?
2 What's the use of (?:\s*), why not just \s*, I can't see areferencein the pattern.

Xalnix,
3What's the meaning of [\s-[\n\r]], all \s except \r and \n? is this a new feature of C# regex?
I never see a subtractive operation ofsets in other RegEx, that's wonderful.
But if so, why not just write [ \t], when I test, just write ^( +[^\s]+)+$ , then each line catched.
www.wonderstudio.cn
Eping Wang
(?:\s*)--> means match the expression, but do not capture it.

[\s-[\n\r]] all \s except \r and \n --> is not new, but it certainly shows the power of regex. See here and see Les's example. It is known as the Character Substitution Pattern.

//Les example
([\d-[57]] all digits except for5 and 7.




John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove

John, thanks.
But
1 What' the difference between \s* and (?:\s*) in this pattern, I know it's a non-capture group, but why group here?

2 -[ ] is a feature of .Net Regex? this operation are not performed in several other RegEx as I've tested.
and I can't find a description in several Regex books.

So all patterns like [a-zA-Z0-9] can be written as [\w-[_]] ?


www.wonderstudio.cn
Eping Wang
Hi Eping,
The reasonWilliam probably did what he did on #1 is because he didn't want to "capture" the \s. I usually do this in patterns on subpatterns that are part of your search for finding my match but I didn't want to group them as a capture.. It merely assists in finding the match. Actually Eping as simple as that sounds, before I understood that, I had difficultly understanding that concept.

#2.
Not a lot of documentation exists that I could find, maybe Les can chime in since he introduced me to it. But, at the very least, you understand it.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
(?: ) is the same as ( ) but it will not be a part of the final capture. Meaning if I do this on a US Social Security Number with the dashes such as

123-456-7890

I only care about the 10 numbers and not the dashes so

(?<First>\d\d\d)(?:-)(?<Second>\d\d\d)(?:-)(?<Third>\d\d\d\d)

so my match at [0] is

123-456-0789

but my match captures are

"First" or [1] = 123
"Second" or [2] = 456
"Third" or [3] = 0789

Therefore I get the whole item as a match but when I need to drill down into the match to extract individual components I can. Think if (?: ) as an anchor to your match that has nothing to do with the final data.
William Wegerson (www.OmegaCoder.Com)
OmegaMan

Thanks.

But, Omegman, for your new example,
if I write the pattern as

(?<First>\d\d\d)-(?<Second>\d\d\d)-(?<Third>\d\d\d\d)

what's the difference? I think the resultwill be the same as yours,
I just can't understand why group - here?


www.wonderstudio.cn
Eping Wang
So for each character match after the first line, it backtracks to the beginning and expands the match. It will just take exponentially longer with each additional line you add.
backtracks to the beginning... of the current line or the very first match?
William Wegerson (www.OmegaCoder.Com)

backtracks to the beginning of the very first match.
Les Potter, Xalnix Corporation, Yet Another C# Blog
xalnix
Xalnix,
3What's the meaning of [\s-[\n\r]], all \s except \r and \n? is this a new feature of C# regex?
I never see a subtractive operation ofsets in other RegEx, that's wonderful.
But if so, why not just write [ \t], when I test, just write ^( +[^\s]+)+$ , then each line catched.
www.wonderstudio.cn

Eping,
I wrote [\s-[\n\r]] more as an example. I think it's most useful with \w as in [\w-[_]]. When working with International character sets, this means a lot more than [a-zA-Z0-9]. Similarly, when you turn on the IgnorePatternWhitespace option, [ \t] is not the same as [\s-[\n\r]].

I don't know if this construction is valid in any other Regex besides the .NET Regex. I seldom do any Regex work outside of .NET or ASP.NET.
Les Potter, Xalnix Corporation, Yet Another C# Blog
xalnix
Thank you very much, Xalnix. You've given me a precise answer.

This construction is considered as [\w\[\]_-] as usual in other Regex.
And I also have an idea, if the pattern can be written as [\w^_] would be simpler.
if we wantjustsymbol ^, just write [\w\^_] to avoid confusion.
www.wonderstudio.cn
Eping Wang
Similarly, when you turn on the IgnorePatternWhitespace option, [ \t] is not the same as [\s-[\n\r]].

Les The ignore pattern whitespace only applies to the parser reading the pattern and not how it processes whitespace within the target text. To quote (RegexOptions Enumeration ) (Bolding is Mine):

Eliminates unescaped white space from the pattern and enables comments marked with #. However, the IgnorePatternWhitespace value does not affect or eliminate white space in character classes .

Am I missing something?


William Wegerson (www.OmegaCoder.Com)
OmegaMan
Yep! I stand corrected. the space in the pattern [ \t] would NOT be ignored when IgnorePatternWhitespace is used.

(so many rules, so little brain capacity, sorry for the bad info)
Les Potter, Xalnix Corporation, Yet Another C# Blog
xalnix

You can use google to search for other answers

Custom Search

More Threads

• Optionally Searching for > 1 Word
• Unexpected results in regular expression
• Finding getDate() function only when its not a parameter
• Validatiom email
• Find a tag
• Need a 1 to 1 match between character substitutions in the RegEx.Replace method
• Getting list of images in HTML page
• URL Validation help
• VB and Regex issue in macro...
• Regular epressions union