.NET Framework Bookmark and Share   
 index > Regular Expressions > Multiple Captures with Nested Groups
 

Multiple Captures with Nested Groups

I need to parse a string of data that will appear in the following format: <container id> (/<item number> <quantity>)*

The string will always begin with a 5-digit container id followed by repeating item number (6 characters) and quantity (3 digits)pairs preceded by a forward slash delimiter. I have tried the following regex pattern:

^(?<container>\d{5})(?:.*/(?:<itemNumber>\w{5})(?:<quantity>\d{3}))+

When I run it through RegEx, I only get the last item/qty pair in the string. However, if I change the trailing plus sign to {3}, thereby limiting the repetition to only three pairs, it works fine and I get three captures for the "itemNumber" and "quantity" groups.

What is the difference? Why will it work with an explicit limiter but not when I use a plus sign? It is not possible for me to hard-code the limiter since the number of item/qty pairs will never by fixed.
SonOfPirate
Please provide samples of your data so we can investigate a solution.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove

Here is a simple example:

12345 /ABCDE123 /FGHIJ456 /KLMNO789

I would expect match.Groups to return 3 items:

  • The first would be named "container" and contain a single Capture with a value of "12345".
  • The second would be named "itemNumber" and contain three (3) items in its Captures collection: "ABCDE", "FGHIJ" and "KLMNO"
  • The third would be named "quantity" and also contain three (3) items in its Captures collection: "123", "456" and "789"

This works when I use the explicit limiter, {3}, but only returns a single Capture for each group ("12345", "KLMNO" and "789") when I use a plus sign or asterisk for the repetition.

One other thing I noticed after testing further, it appears that everything works perfectly when the ".*/" - for the delimiter - is REMOVED!

SonOfPirate

Perhaps something like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            String example = "12345 /ABCDE123 /FGHIJ456 /KLMNO789";
            String pattern = @"(?<Container>\d{5})\s\/
                            (?<ItemNumber1>\w{5})(?<Quantity1>\d{3})\s\/
                            (?<ItemNumber2>\w{5})(?<Quantity2>\d{3})\s\/
                            (?<ItemNumber3>\w{5})(?<Quantity3>\d{3})";                            
            Regex rx = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
            Match m = rx.Match(example);
            if (m.Success)
            {
                Console.WriteLine("Container:    {0}", m.Groups["Container"].Value);
                Console.WriteLine("");
                Console.WriteLine("ItemNumber1:  {0}", m.Groups["ItemNumber1"].Value);                
                Console.WriteLine("Quantity1:    {0}", m.Groups["Quantity1"].Value);
                Console.WriteLine("");
                Console.WriteLine("ItemNumber2:  {0}", m.Groups["ItemNumber2"].Value);
                Console.WriteLine("Quantity2:    {0}", m.Groups["Quantity2"].Value);
                Console.WriteLine("");
                Console.WriteLine("ItemNumber3:  {0}", m.Groups["ItemNumber3"].Value);
                Console.WriteLine("Quantity3:    {0}", m.Groups["Quantity3"].Value);
            }
            Console.ReadLine();
        }
    }
}

John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com

JohnGrove

//Output

Container: 12345

ItemNumber1: ABCDE
Quantity1: 123

ItemNumber2: FGHIJ
Quantity2: 456

ItemNumber3: KLMNO
Quantity3: 789


John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
Again, the pair (itemNumber/quantity) can repeat 1-n times. I cannot use an explicit limiter whether it is in the form "{3}" or coded as you have in the example. I could have 50 occurrances in the data string I am matching and then 22 in the next string.
SonOfPirate

Then change this like so:

Match m = rx.Match(example);
while (m.Success)
{
Console.WriteLine("Container: {0}", m.Groups["Container"].Value);
Console.WriteLine("");
Console.WriteLine("ItemNumber1: {0}", m.Groups["ItemNumber1"].Value);
Console.WriteLine("Quantity1: {0}", m.Groups["Quantity1"].Value);
Console.WriteLine("");
Console.WriteLine("ItemNumber2: {0}", m.Groups["ItemNumber2"].Value);
Console.WriteLine("Quantity2: {0}", m.Groups["Quantity2"].Value);
Console.WriteLine("");
Console.WriteLine("ItemNumber3: {0}", m.Groups["ItemNumber3"].Value);
Console.WriteLine("Quantity3: {0}", m.Groups["Quantity3"].Value);
Console.WriteLine("");
m = m.NextMatch();
}
Console.ReadLine();

If I misunderstood you, provide about 4 or 5 samples.


John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove

Yea, obviously there is a misunderstanding. I am not concerned about the Groups - I am trying to work with multiple Captures for the groups.

Let's go with a simpler example. I have a data pattern such as "product, serial, serial, serial, serial, serial, ..." where the string will always begin with a 6 digit product number followed by 1 or more 8-character serial numbers and is comma-delimited (whitespace before the delimiter is ignored). I can have just one serial number or I can have 100 serial numbers in the string. I have no idea.

I want to capture the product number and each serial number that appears in the string. I've tried this regular expression pattern:

"^(?<product>\d{6})(?:\s*,(?<serial>\w{8}))+"

Using the following code:

Match m = regex.Match(input);

if (m.Success)
{
    string[] groupNames = regex.GetGroupNames();

    RegularExpressions.Capture capture;
    RegularExpressions.Group group;
    StringBuilder db = new StringBuilder();

    foreach (string groupName in groupNames)
    {
        // There is always a group "0" at index 0 that contains the full string - ignore!
        if (groupName != "0")
        {
            group = match.Groups[groupName];

            for (int i = 0; i < group.Captures.Count; i++)
            {
                capture = group.Captures[i];

                sb.AppendFormat("'{0}': {1}", groupName, capture.Value);
                sb.AppendLine();
            }
        }
    }

    Console.Write(sb.ToString());
}

Given the data:

123456 ,A1B2C3D4 ,E5F6G7H8 ,I9J0K1L2 ,M3N4O5P6

I would expect the output to be:

'product': 123456
'serial': A1B2C3D4
'serial': E5F6G7H8
'serial': I9J0K1L2
'serial': M3N4O5P6

This is not the case when I use the plus sign as the repeater. Instead I get:

'product': 123456
'serial': M3N4O5P6

However, if I change the pattern to:

"^(?<product>\d{6})(?:\s*,(?<serial>\w{8})){4}"

I get the expected/desired output. Except I can't use this pattern because the number of repetitions is 1-n. Oh, and using {1,} behaves the same as the plus sign - which was expected.

So, my question is how to setup the pattern so I can support 1-n repetitions of the serial number group, with the preceding comma delimiter, so that the named group will return multiple captures when they occur in the input string?

SonOfPirate

I tested your code and didn't have any trouble.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            String test = "123456 ,A1B2C3D4 ,E5F6G7H8 ,I9J0K1L2 ,M3N4O5P6";
            Regex rx = new Regex(@"^(?<product>\d{6})(?:\s*,(?<serial>\w{8}))+");
            Match m = rx.Match(test);
            if (m.Success)
            {
                Capture capture;
                Group group;
                StringBuilder sb = new StringBuilder();

                string[] groupNames = rx.GetGroupNames();
                foreach (string groupName in groupNames)
                {
                    // There is always a group "0" at index 0 that contains the full string - ignore!
                    if (groupName != "0")
                    {
                        group = m.Groups[groupName];

                        for (int i = 0; i < group.Captures.Count; i++)
                        {
                            capture = group.Captures[i];
                            sb.AppendFormat("'{0}': {1}", groupName, capture.Value);
                            sb.AppendLine();
                        }
                    }
                }

            }
        }
    }
}

//Output

'product': 123456
'serial': A1B2C3D4
'serial': E5F6G7H8
'serial': I9J0K1L2
'serial': M3N4O5P6
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove

Ah, I figured out the difference! In my testing, instead of "\s*" before the comma delimiter, I was using ".*". Apparently this is what's causing the problem. I take the exact code you have above and use the period, I get only the last serial - just as I was describing.

Any idea why the period in this place would make such a difference?

SonOfPirate
SonOfPirate,
To get at embedded groups within captures is not intuitively obvious. But here is an example that should handle everything you throw at it. Notice how the individually named groups are accessed to get to their captured values...
            string[] tests = {
                 "12345 /abcde123 /fghij456 /klmno789",
                 "22222 /bbbbb222 /ccccc333",
                 "44444 /ddddd444 /eeeee555 /fffff666 /ggggg777",
                              };
            string pattern = @"(?<container>\d{5})(?<serial>[^/]*/(?<itemNumber>\w{5})(?<quantity>\d{3}))+";
            Console.WriteLine(" Cont  Item   Qnty");
            foreach (string test in tests)
            {
                foreach (Match mx in Regex.Matches(test, pattern))
                {
                    for(int idx = 0; idx < mx.Groups["serial"].Captures.Count; ++idx)
                    {
                        Console.WriteLine("{0}: {1}: {2}", mx.Groups["container"].Value, mx.Groups["itemNumber"].Captures[idx].Value,mx.Groups["quantity"].Captures[idx].Value);
                    }
                }
            }


Les Potter, Xalnix Corporation, Yet Another C# Blog
xalnix
Ah, I like the negation preceding the delimiter. That will allow us to ignore any characters rather than just whitespace which is what I was after using the period.

What is so different, though, about the period that it causes the problems I was experiencing yet these other constructs work just fine?
SonOfPirate
.*/ is greedy. It will capture every character (except \n) including /. So if you have x/xxxxxxx/xxxxxx/ where x is any character except /, then .*/ will grab everything before and including the last slash.
Les Potter, Xalnix Corporation, Yet Another C# Blog
xalnix
If RegexOptions is set to SingleLine, then .* will also capture \n
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove

You can use google to search for other answers

Custom Search

More Threads

• How to write expression with 9 digits and one comma?
• question about back references
• PO Box Expression - A little extra help.
• How to match regular expression
• Table Values Extractions
• Building regex which has constraint for specific chars
• vs2008 regex control H tool.
• Multiline Option appears not to work.
• Regular Expression to replace img tag
• Regex to find a string with a specified format