|
I need to parse a string of data that will appear in the following format: <container id> (/<item number> <quantity>)*
The string will always begin with a 5-digit container id followed by repeating item number (6 characters) and quantity (3 digits)pairs preceded by a forward slash delimiter. I have tried the following regex pattern:
^(?<container>\d{5})(?:.*/(?:<itemNumber>\w{5})(?:<quantity>\d{3}))+
When I run it through RegEx, I only get the last item/qty pair in the string. However, if I change the trailing plus sign to {3}, thereby limiting the repetition to only three pairs, it works fine and I get three captures for the "itemNumber" and "quantity" groups.
What is the difference? Why will it work with an explicit limiter but not when I use a plus sign? It is not possible for me to hard-code the limiter since the number of item/qty pairs will never by fixed.
| | SonOfPirate | Please provide samples of your data so we can investigate a solution. John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com | | JohnGrove | Here is a simple example:
12345 /ABCDE123 /FGHIJ456 /KLMNO789
I would expect match.Groups to return 3 items:
- The first would be named "container" and contain a single Capture with a value of "12345".
- The second would be named "itemNumber" and contain three (3) items in its Captures collection: "ABCDE", "FGHIJ" and "KLMNO"
- The third would be named "quantity" and also contain three (3) items in its Captures collection: "123", "456" and "789"
This works when I use the explicit limiter, {3}, but only returns a single Capture for each group ("12345", "KLMNO" and "789") when I use a plus sign or asterisk for the repetition.
One other thing I noticed after testing further, it appears that everything works perfectly when the ".*/" - for the delimiter - is REMOVED! | | SonOfPirate | Perhaps something like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
String example = "12345 /ABCDE123 /FGHIJ456 /KLMNO789";
String pattern = @"(?<Container>\d{5})\s\/
(?<ItemNumber1>\w{5})(?<Quantity1>\d{3})\s\/
(?<ItemNumber2>\w{5})(?<Quantity2>\d{3})\s\/
(?<ItemNumber3>\w{5})(?<Quantity3>\d{3})";
Regex rx = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
Match m = rx.Match(example);
if (m.Success)
{
Console.WriteLine("Container: {0}", m.Groups["Container"].Value);
Console.WriteLine("");
Console.WriteLine("ItemNumber1: {0}", m.Groups["ItemNumber1"].Value);
Console.WriteLine("Quantity1: {0}", m.Groups["Quantity1"].Value);
Console.WriteLine("");
Console.WriteLine("ItemNumber2: {0}", m.Groups["ItemNumber2"].Value);
Console.WriteLine("Quantity2: {0}", m.Groups["Quantity2"].Value);
Console.WriteLine("");
Console.WriteLine("ItemNumber3: {0}", m.Groups["ItemNumber3"].Value);
Console.WriteLine("Quantity3: {0}", m.Groups["Quantity3"].Value);
}
Console.ReadLine();
}
}
}
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
| | JohnGrove | //Output
Container: 12345
ItemNumber1: ABCDE Quantity1: 123
ItemNumber2: FGHIJ Quantity2: 456
ItemNumber3: KLMNO Quantity3: 789
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com | | JohnGrove | Again, the pair (itemNumber/quantity) can repeat 1-n times. I cannot use an explicit limiter whether it is in the form "{3}" or coded as you have in the example. I could have 50 occurrances in the data string I am matching and then 22 in the next string.
| | SonOfPirate | Then change this like so:
Match m = rx.Match(example); while (m.Success) { Console.WriteLine("Container: {0}", m.Groups["Container"].Value); Console.WriteLine(""); Console.WriteLine("ItemNumber1: {0}", m.Groups["ItemNumber1"].Value); Console.WriteLine("Quantity1: {0}", m.Groups["Quantity1"].Value); Console.WriteLine(""); Console.WriteLine("ItemNumber2: {0}", m.Groups["ItemNumber2"].Value); Console.WriteLine("Quantity2: {0}", m.Groups["Quantity2"].Value); Console.WriteLine(""); Console.WriteLine("ItemNumber3: {0}", m.Groups["ItemNumber3"].Value); Console.WriteLine("Quantity3: {0}", m.Groups["Quantity3"].Value); Console.WriteLine(""); m = m.NextMatch(); } Console.ReadLine();
If I misunderstood you, provide about 4 or 5 samples.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com | | JohnGrove | Yea, obviously there is a misunderstanding. I am not concerned about the Groups - I am trying to work with multiple Captures for the groups.
Let's go with a simpler example. I have a data pattern such as "product, serial, serial, serial, serial, serial, ..." where the string will always begin with a 6 digit product number followed by 1 or more 8-character serial numbers and is comma-delimited (whitespace before the delimiter is ignored). I can have just one serial number or I can have 100 serial numbers in the string. I have no idea.
I want to capture the product number and each serial number that appears in the string. I've tried this regular expression pattern:
"^(?<product>\d{6})(?:\s*,(?<serial>\w{8}))+"
Using the following code:
Match m = regex.Match(input);
if (m.Success)
{
string[] groupNames = regex.GetGroupNames();
RegularExpressions.Capture capture;
RegularExpressions.Group group;
StringBuilder db = new StringBuilder();
foreach (string groupName in groupNames)
{
// There is always a group "0" at index 0 that contains the full string - ignore!
if (groupName != "0")
{
group = match.Groups[groupName];
for (int i = 0; i < group.Captures.Count; i++)
{
capture = group.Captures[i];
sb.AppendFormat("'{0}': {1}", groupName, capture.Value);
sb.AppendLine();
}
}
}
Console.Write(sb.ToString());
}
Given the data:
123456 ,A1B2C3D4 ,E5F6G7H8 ,I9J0K1L2 ,M3N4O5P6
I would expect the output to be:
'product': 123456
'serial': A1B2C3D4
'serial': E5F6G7H8
'serial': I9J0K1L2
'serial': M3N4O5P6
This is not the case when I use the plus sign as the repeater. Instead I get:
'product': 123456
'serial': M3N4O5P6
However, if I change the pattern to:
"^(?<product>\d{6})(?:\s*,(?<serial>\w{8})){4}"
I get the expected/desired output. Except I can't use this pattern because the number of repetitions is 1-n. Oh, and using {1,} behaves the same as the plus sign - which was expected.
So, my question is how to setup the pattern so I can support 1-n repetitions of the serial number group, with the preceding comma delimiter, so that the named group will return multiple captures when they occur in the input string? | | SonOfPirate | I tested your code and didn't have any trouble.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
String test = "123456 ,A1B2C3D4 ,E5F6G7H8 ,I9J0K1L2 ,M3N4O5P6";
Regex rx = new Regex(@"^(?<product>\d{6})(?:\s*,(?<serial>\w{8}))+");
Match m = rx.Match(test);
if (m.Success)
{
Capture capture;
Group group;
StringBuilder sb = new StringBuilder();
string[] groupNames = rx.GetGroupNames();
foreach (string groupName in groupNames)
{
// There is always a group "0" at index 0 that contains the full string - ignore!
if (groupName != "0")
{
group = m.Groups[groupName];
for (int i = 0; i < group.Captures.Count; i++)
{
capture = group.Captures[i];
sb.AppendFormat("'{0}': {1}", groupName, capture.Value);
sb.AppendLine();
}
}
}
}
}
}
}
//Output'product': 123456 'serial': A1B2C3D4 'serial': E5F6G7H8 'serial': I9J0K1L2 'serial': M3N4O5P6
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com | | JohnGrove | Ah, I figured out the difference! In my testing, instead of "\s*" before the comma delimiter, I was using ".*". Apparently this is what's causing the problem. I take the exact code you have above and use the period, I get only the last serial - just as I was describing.
Any idea why the period in this place would make such a difference?
| | SonOfPirate | SonOfPirate, To get at embedded groups within captures is not intuitively obvious. But here is an example that should handle everything you throw at it. Notice how the individually named groups are accessed to get to their captured values...
string[] tests = {
"12345 /abcde123 /fghij456 /klmno789",
"22222 /bbbbb222 /ccccc333",
"44444 /ddddd444 /eeeee555 /fffff666 /ggggg777",
};
string pattern = @"(?<container>\d{5})(?<serial>[^/]*/(?<itemNumber>\w{5})(?<quantity>\d{3}))+";
Console.WriteLine(" Cont Item Qnty");
foreach (string test in tests)
{
foreach (Match mx in Regex.Matches(test, pattern))
{
for(int idx = 0; idx < mx.Groups["serial"].Captures.Count; ++idx)
{
Console.WriteLine("{0}: {1}: {2}", mx.Groups["container"].Value, mx.Groups["itemNumber"].Captures[idx].Value,mx.Groups["quantity"].Captures[idx].Value);
}
}
}
Les Potter, Xalnix Corporation, Yet Another C# Blog | | xalnix | Ah, I like the negation preceding the delimiter. That will allow us to ignore any characters rather than just whitespace which is what I was after using the period.
What is so different, though, about the period that it causes the problems I was experiencing yet these other constructs work just fine?
| | SonOfPirate | .*/ is greedy. It will capture every character (except \n) including /. So if you have x/xxxxxxx/xxxxxx/ where x is any character except /, then .*/ will grab everything before and including the last slash.
Les Potter, Xalnix Corporation, Yet Another C# Blog | | xalnix | If RegexOptions is set to SingleLine, then .* will also capture \n John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com | | JohnGrove |
|