.NET Framework Bookmark and Share   
 index > Regular Expressions > Filtering Special Characters Quickly From A String
 

Filtering Special Characters Quickly From A String

I need to only allow (basically every key that you can see on the keyboard) and drop the rest.

I would probably use a regular expression to do this but Im not that great with them at this time.

Does anyone have an example of how to do this?
LearningVisualC2005
What is the set of allowable characters? The drop the rest is vague. Also different cultures have different keys on the keyboard....
William Wegerson (www.OmegaCoder.Com)
OmegaMan
OmegaMan is right on the money. However, here's something to get you started with a standard US keyboard. You can follow the same repeating pattern to add whatever characters are "legal".

string pattern = @"^([A-Za-z0-9]|\~|\`|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\x5F|\+|\{|\}|\:|\x22|\<|\>|\?|\\|\s|\[|\]|\|\;|\'|\,|\.|\/)*$";
            Regex r = new Regex(pattern,RegexOptions.Multiline);

            string input = Console.ReadLine();
            if (r.IsMatch(input))
                Console.WriteLine("MATCH!");//DO WORK
            else
                Console.WriteLine("NO MATCH!");//DO WORK
basically you just add "|\+" to the end of the string (before the last )*$. Translated to english that means | OR \+ A PLUS SIGN (\ is an escape character).
おろ?
P.Brian.Mackey
i want to keep anything that is ASCII 33 through 126. and drop anything that is not.


so if it came across this line
"Test ß Data"
"Testing Ag§ain"

it would return
"Test Data"
"Testing Again"
LearningVisualC2005
That should be prettymuch exactly what I posted.
おろ?
P.Brian.Mackey
im looking to strip everything that is not ASCII 33-126.

that function only identifies if there is invalid characters
LearningVisualC2005
You just need to check characters instead of strings andimplement at little object oriented programming at the point where I said //DO WORK...

        public static string CheckString(string input)
        {
            string ret = null;
            foreach (char aChar in input)
            {
                string pattern = @"([A-Za-z0-9]|\~|\`|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\x5F|\+|\{|\}|\:|\x22|\<|\>|\?|\\|\s|\[|\]|\|\;|\'|\,|\.|\/)";
                Regex r = new Regex(pattern);
                if (r.IsMatch(aChar.ToString()))
                    ret += aChar;
            }
            return ret;
        }
        
            static void Main(string[] args)
        {
            string input = "I am a test string ü with 2 invalid characters ü";
            Console.WriteLine(CheckString(input));<br/>	}<br/>
Let us know if you need more clarification
おろ?
P.Brian.Mackey
how do I add an underscore to that filter?

a |\_
tells me
Unrecognized escape sequence \_.
LearningVisualC2005
I already added the underscore, it should be recognized and it was when I tested it. Have you tried it? I used the hex escape code instead of "\_", which is the \x5F.
おろ?
P.Brian.Mackey
i want to keep anything that is ASCII 33 through 126. and drop anything that is not.


so if it came across this line
"Test ß Data"
"Testing Ag§ain"

it would return
"Test Data"
"Testing Again"
33 to 126 I believe is patently wrong for space is at 20; hence your above example would result in "TestData" not "Test Data". I have a pattern which works for that range and whitespace as space or tab. Here is the code:

string data = "Test ß Data Testing Ag§ain! ^zebra^";

Console.WriteLine( Regex.Replace( data, @"([^\x21-\x7A\s])", string.Empty ) );

// Ouputs: Test  Data Testing Again! ^zebra^

Note I moved you to the Regular Expressions form which is a great place to learn and ask questions about regex. Check out its informative top level post .Net Regex Resources Reference , which is a useful reference for beginners to experts.

Also one can use this free tool (Expresso ) to test and learn about out regex patterns outside of ones .Net code.
William Wegerson (www.OmegaCoder.Com)
OmegaMan
how do I add an underscore to that filter?

a |\_
tells me
Unrecognized escape sequence \_.

You missed his @ which told the c# compiler to treat the string as a literal and not substitute the escapes. "\_" is not the same as @"\_" . HTH

William Wegerson (www.OmegaCoder.Com)
OmegaMan
i want to keep anything that is ASCII 33 through 126. and drop anything that is not.


so if it came across this line
"Test ß Data"
"Testing Ag§ain"

it would return
"Test Data"
"Testing Again"
33 to 126 I believe is patently wrong for space is at 20; hence your above example would result in "TestData" not "Test Data". I have a pattern which works for that range and whitespace as space or tab. Here is the code:

string
 data = "Test ß Data Testing Ag§ain! ^zebra^"
;

Console.WriteLine( Regex.Replace( data, @"([^\x21-\x7A\s])" , string .Empty ) );

// Ouputs: Test Data Testing Again! ^zebra^

Note I moved you to the Regular Expressions form which is a great place to learn and ask questions about regex. Check out its informative top level post .Net Regex Resources Reference , which is a useful reference for beginners to experts.

Also one can use this free tool (Expresso ) to test and learn about out regex patterns outside of ones .Net code.
William Wegerson (www.OmegaCoder.Com )
Thanks for the reply,


would scenario would be faster?
1. Test for IsMatch (i dont expect too many characters to come through that are invalid)
and then remove if a match is found?

or 2. just execute a replace on each and ever statement.


the only reason I ask for speed is that i have to do this on about 30+million fields
LearningVisualC2005
Good question. Maybe I should write a blog article on that. ;-)

It doesn't say in the docs, but if IsMatch stops at the very first match, then yes conceivably it would be faster to do #1. If IsMatch is poorly written and internally actually gets all the matches and just verifies that total > 0, then it could be slower.

If its a target rich environment where you know 4 out of 5 fields have problems, just do the replace on all. If its opposite, do the check first then replace. That is my advice.
William Wegerson (www.OmegaCoder.Com)
OmegaMan
thanks for the reply.

if you do write an article please update the thread with the link. thanks
LearningVisualC2005
I think using a traditional string method is faster and better than using Regex.
I don't know how to write the code in C#, and it's also a little complex is VB,
(VB6 have to convert string into array of byte)
I would write a peice of fake code here,

Let's resume you have a input string as str


For i = 1 to str.length { // resume string character array based on 1
if ((ord(str[i]) <= 33) or(ord(str[i]) >= 126 )) then
str[i] = chr(32);
}
str = Replace(str,chr(32),''); // delete all invalid characters

ord() return the ascii value of a character
chr() return a character of the number

I always use this kind of code in Delphi,
I think it would be easy to convert this code into C#
It's very very fast, and simple.
Actually, this is just what Regex execute Omega's pattern in background.



www.wonderstudio.cn
Eping Wang

You can use google to search for other answers

Custom Search

More Threads

• Help.. Regex.Replace Method...
• double qouted string
• Regex for today's date and above?
• Very Complex Regex to gather group values
• Regex Extracting Columns
• Help with Regular Expression Regex.replace
• Doing the match in C# from VB code
• Replacing <img ... sources with regEx?
• exec method doesn't work in IE8
• Why the regex doesn't catch "Extra" group?