.NET Framework Bookmark and Share   
 index > Regular Expressions > Regex uppercase words
 

Regex uppercase words

I have the following strings ex:
"Microsoft's Products are the best products in the market."
"Microsoft Developer Network has many helpful links including support991, downloads, forums in the market."


I want to capture the Upper case words that start each sentence in a group1( in this case:"Microsoft's Products","Microsoft Developer Network"), and the rest of the string up to "in the market." in a group2(in this case:"are the best products","has many helpful links including support991, downloads, forums")

I have something like this so far (?<group1>([A-Z]\w+\s)+) but since I am new to Regex I have no idea how to continue.

Thanks in advance
klevis
First a couple of questions:
What defines a sentence?
Is it a period at the end of a string of characters?
Is each sentence on it's own line?

Assuming that each sentence is on it's own line then you can use some nesting to achieve what you're looking for. See the following:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace TestingRegex
{
    class Program
    {
        static void Main(string[] args)
        {
            string test = @"Microsoft's Products are the best products in the market.
Microsoft Developer Network has many helpful links including support991, downloads, forums in the market.";

            Regex regex = new Regex(@"^(?<group1>(?:[A-Z][\w']+ )+)(?<group2>.+)$", RegexOptions.Multiline);

            MatchCollection matches = regex.Matches(test);

            foreach (Match match in matches)
            {
                Console.Write("group1: ");
                Console.WriteLine(match.Groups["group1"]);

                Console.Write("group2: ");
                Console.WriteLine(match.Groups["group2"]);

                Console.WriteLine();
            }
        }
    }
}



Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.
  • Marked As Answer byklevis Thursday, September 03, 2009 6:08 PM
  •  
inetscan
The regex I provided worked just fine for me:

            System.Text.RegularExpressions.Regex regextest =
                new System.Text.RegularExpressions.Regex(@"(?<group1>(([A-Z].*?(?=\x20[^A-Z]))+?))(?<group2>(.*?(?=\x20(in\x20the\x20market\.))))");;
            string test1 = @"Microsoft's Products are the best products in the market.";
            string test2 = @"Microsoft Developer Network has many helpful links including support991, downloads, forums in the market.";

            System.Text.StringBuilder outVal = new System.Text.StringBuilder();
            foreach (System.Text.RegularExpressions.Match hold in regextest.Matches(test1))
                outVal.AppendLine("group1 \"" + hold.Groups["group1"].Value + "\" / group2 \"" + hold.Groups["group2"].Value + '"');

            outVal.AppendLine("--------------End of Test 1------------------");

            foreach (System.Text.RegularExpressions.Match hold in regextest.Matches(test2))
                outVal.AppendLine("group1 \"" + hold.Groups["group1"].Value + "\" / group2 \"" + hold.Groups["group2"].Value + '"');

            MessageBox.Show(outVal.ToString());
  • Marked As Answer byklevis Thursday, September 03, 2009 6:37 PM
  •  
syntaxeater
You would need to use word boundaries

\b(?<Caps>[A-Z]\w+'?\w)\b
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
Improve on this with the aforementioned word boundaries (instead of the Lookahead Assertions ?=), but here ya go:

(?<group1>(([A-Z].*?(?=\x20[^A-Z]))+?))(?<group2>(.*?(?=\x20(in\x20the\x20market\.))))

syntaxeater
Actually, I didn't read closely enough, you wanted to catch the rest as well.

\b(?<Begin>[A-Z]\w+'?\w)(?<Rest>.+)?
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
  • Proposed As Answer bysyntaxeater Thursday, September 03, 2009 4:55 PM
  •  
JohnGrove
First a couple of questions:
What defines a sentence?
Is it a period at the end of a string of characters?
Is each sentence on it's own line?

Assuming that each sentence is on it's own line then you can use some nesting to achieve what you're looking for. See the following:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace TestingRegex
{
    class Program
    {
        static void Main(string[] args)
        {
            string test = @"Microsoft's Products are the best products in the market.
Microsoft Developer Network has many helpful links including support991, downloads, forums in the market.";

            Regex regex = new Regex(@"^(?<group1>(?:[A-Z][\w']+ )+)(?<group2>.+)$", RegexOptions.Multiline);

            MatchCollection matches = regex.Matches(test);

            foreach (Match match in matches)
            {
                Console.Write("group1: ");
                Console.WriteLine(match.Groups["group1"]);

                Console.Write("group2: ");
                Console.WriteLine(match.Groups["group2"]);

                Console.WriteLine();
            }
        }
    }
}



Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.
  • Marked As Answer byklevis Thursday, September 03, 2009 6:08 PM
  •  
inetscan
Actually, I didn't read closely enough, you wanted to catch the rest as well.

\b(?<Begin>[A-Z]\w+'?\w)(?<Rest>.+)?
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com

This will only capture the first upper case word in group one in Begin and not all adjacent uppercase words at the beginning of a sentence.

Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.
  • Edited byinetscan Thursday, September 03, 2009 5:30 PMadded quote
  •  
inetscan
Yes that is true, after I saw your solution [which seemed to be a more honed response], I was hoping the user would have enough initiative to try all and figure that out for himself.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
Actually, come to think of it - word boundaries might need more than just apostrophe awareness. Take for instance:

Microsoft WPF: Rich Internet Applications need a conversion wizard for it to be accepted in the market.

I can also think of scenarios for ( ) ; - , " ` / & among other misc characters.

Is this beyond the scope of what you need?
  • Edited bysyntaxeater Thursday, September 03, 2009 5:33 PMbotched the RIA acronym.
  •  
syntaxeater
It is all a matter of "knowing" your data. Like inetscan said previously,

"First a couple of questions:
What defines a sentence?
Is it a period at the end of a string of characters?
Is each sentence on it's own line?

These are questions the "poster" needs to unravel since we are not privy to that.

John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove

Thanks for the replys but none of the regexes above worked. A sentence ends with a period and each sentence does have its own line. Your regex was capturing "Microsoft's Products are the best products in the " and "Microsoft Developer Network has many helpful links including " in the first group.
I wanted to get only the words that start with an uppercase in the group1, meaning get the "Microsoft's Products" or the "Microsoft Developer Network" and on the other group2 get the "are the best products" or "has many helpful links including support991, downloads, forums" without including "in the market". I am trying to play around with your regexes but no luck till now.

Thank you for your help.

klevis
list your data. If it is too big, send a sample.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
This regex will only catch the first upper case word(in this case: Microsoft's or Microsoft)and then the rest. I want to capture all the uppercase words that start the sentence. And then capture the rest.
klevis
Try the full sample I provided above:

The output is the following:

group1: Microsoft's Products 
group2: are the best products in the market.

group1: Microsoft Developer Network 
group2: has many helpful links including support991, downloads, forums in the market.
Is that not what you're looking for?
Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.
inetscan
First a couple of questions:
What defines a sentence?
Is it a period at the end of a string of characters?
Is each sentence on it's own line?

Assuming that each sentence is on it's own line then you can use some nesting to achieve what you're looking for. See the following:

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Text.RegularExpressions;



namespace TestingRegex

{

    class Program

    {

        static void Main(string[] args)

        {

            string test = @"Microsoft's Products are the best products in the market.

Microsoft Developer Network has many helpful links including support991, downloads, forums in the market.";



            Regex regex = new Regex(@"^(?<group1>(?:[A-Z][\w']+ )+)(?<group2>.+)$", RegexOptions.Multiline);



            MatchCollection matches = regex.Matches(test);



            foreach (Match match in matches)

            {

                Console.Write("group1: ");

                Console.WriteLine(match.Groups["group1"]);



                Console.Write("group2: ");

                Console.WriteLine(match.Groups["group2"]);



                Console.WriteLine();

            }

        }

    }

}





Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.

Thanks for your answer. I had some stupid problems with visual studio. I just tried your code and it is exactly what I needed. Most of the sentences have apostophes in them and not other char like ( ) ; - , " ` / &.
By the way what is the best site to learn regexes?

  • Unmarked As Answer byklevis Thursday, September 03, 2009 6:07 PM
  • Marked As Answer byklevis Thursday, September 03, 2009 6:01 PM
  •  
klevis

By the way what is the best site to learn regexes?


I like: http://www.regular-expressions.info/
and for creating/testing regex expressions RegexBuddy is your friend

Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.
inetscan
I just checked the site and it looks great. I also tried RegexBuddy to test your expression and surprisingly it didnt work. Probably I haven't checked the right options. But anywat thanks again for the solution. Keep up the good work.

-Klevis
klevis
The regex I provided worked just fine for me:

            System.Text.RegularExpressions.Regex regextest =
                new System.Text.RegularExpressions.Regex(@"(?<group1>(([A-Z].*?(?=\x20[^A-Z]))+?))(?<group2>(.*?(?=\x20(in\x20the\x20market\.))))");;
            string test1 = @"Microsoft's Products are the best products in the market.";
            string test2 = @"Microsoft Developer Network has many helpful links including support991, downloads, forums in the market.";

            System.Text.StringBuilder outVal = new System.Text.StringBuilder();
            foreach (System.Text.RegularExpressions.Match hold in regextest.Matches(test1))
                outVal.AppendLine("group1 \"" + hold.Groups["group1"].Value + "\" / group2 \"" + hold.Groups["group2"].Value + '"');

            outVal.AppendLine("--------------End of Test 1------------------");

            foreach (System.Text.RegularExpressions.Match hold in regextest.Matches(test2))
                outVal.AppendLine("group1 \"" + hold.Groups["group1"].Value + "\" / group2 \"" + hold.Groups["group2"].Value + '"');

            MessageBox.Show(outVal.ToString());
  • Marked As Answer byklevis Thursday, September 03, 2009 6:37 PM
  •  
syntaxeater
Hey Klevis, try Expresso. And it is free! You will have to register it, but it is free.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove
The regex I provided worked just fine for me:

            System.Text.RegularExpressions.Regex regextest =

                new System.Text.RegularExpressions.Regex(@"(?<group1>(([A-Z].*?(?=\x20[^A-Z]))+?))(?<group2>(.*?(?=\x20(in\x20the\x20market\.))))");;

            string test1 = @"Microsoft's Products are the best products in the market.";

            string test2 = @"Microsoft Developer Network has many helpful links including support991, downloads, forums in the market.";



            System.Text.StringBuilder outVal = new System.Text.StringBuilder();

            foreach (System.Text.RegularExpressions.Match hold in regextest.Matches(test1))

                outVal.AppendLine("group1 \"" + hold.Groups["group1"].Value + "\" / group2 \"" + hold.Groups["group2"].Value + '"');



            outVal.AppendLine("--------------End of Test 1------------------");



            foreach (System.Text.RegularExpressions.Match hold in regextest.Matches(test2))

                outVal.AppendLine("group1 \"" + hold.Groups["group1"].Value + "\" / group2 \"" + hold.Groups["group2"].Value + '"');



            MessageBox.Show(outVal.ToString());


You are right. It actually works, now that I double checked. Idont know why it didnt before. It's a bit more complicated for me as a rookie but it's an interesting way ofdoingit.
klevis
Hey Klevis, try Expresso. And it is free! You will have to register it, but it is free.
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com

Great Tool. Thanks
klevis
I just checked the site and it looks great. I also tried RegexBuddy to test your expression and surprisingly it didnt work. Probably I haven't checked the right options. But anywat thanks again for the solution. Keep up the good work.

-Klevis
Not sure why that would happen, see this screenshot .

Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.
inetscan
I just checked the site and it looks great. I also tried RegexBuddy to test your expression and surprisingly it didnt work. Probably I haven't checked the right options. But anywat thanks again for the solution. Keep up the good work.

-Klevis
Not sure why that would happen, see this screenshot .

Jerry Schulist

Please remember to mark replies which answer your question as answers and vote for replies which are helpful.

I compared your screenshot with mine and I saw that I had "Free-spacing" checked. Well its my first time using it so I have to get used to the commands.
klevis

You can use google to search for other answers

Custom Search

More Threads

• Loop from Resources
• how do you parse just the first instance of an xml element node
• Regular expression to extract two string
• Don't Understand ?:
• Regex to match text invalid character
• Microsoft Studio 2005, regex find in files spanning more than one line, How to get Find Results window to display ALL lines foun
• exclude {
• match all words not equal to some pattern
• Problem after migrating to .NET 3.5
• Need Raguler Expression