.NET Framework Bookmark and Share   
 index > Regular Expressions > SubExpression between html tags
 

SubExpression between html tags

Hello.

I have the following input text:

<DIV class=content>
<H2>A-Z</H2>
<UL class=listcol>
<LI><A href="http://www.djmick.co.uk/photos01/anson_mount_pictures.htm">Anson Mount</A>
<LI><A href="http://www.djmick.co.uk/photos01/troy_montero_pictures.htm">Troy Montero</A>
<LI><A href="http://www.djmick.co.uk/gallery16/tyson_beckford_pictures.htm">Tyson Beckfor</A>
<LI><A href="http://www.djmick.co.uk/photos01/zen_gesner_pictures.htm">Zen Gesner</A> </LI></UL>
<DIV class=clear></DIV></DIV>
<DIV class=clear></DIV><!-- ######## 20 PIXELS SPACE _ IMPORTANT FIX ########### -->
<DIV class=space></DIV><!-- ######## 20 PIXELS SPACE _ IMPORTANT FIX ########### -->
<DIV class=content>

I would like to match actors name in <LI> ...</A>

i need to match only between <DIV class=content> tags because i have similar matches in other tags like:
<LI><A href="http://www.djmick.co.uk/mobile.htm">Sexy Babes</A> - idon'twant this

I'vebuilt the following RegEx but itdoesn'twork:

<DIV class=content>(<li><a href=.*>(?<Title>.+)</a>)<DIV class=content>

Thanks

Lirons

If you mean there're also other links in you string,

I think using Regex twice is better than one long Regex.

str = your whole string

reg.pattern = "<DIV class=content>.+?<DIV class=content>"

or use split() function put each content into an array of string

and now str = each content

then you can use John's pattern.

if you haves other urls, you can use this

reg.pattern = "<li><a[^>]+>([^<]+)</a>"


if there would be additional HTML tag inside link, such <b> or <i>

the pattern should be altered to <li><a[^>]+>(.+?)</a>

If you mean you want only the text between links, John's pattern is also correct,

you must get the grouped value not the whole match as his code shown.


www.wonderstudio.cn
  • Marked As Answer byLirons Friday, September 04, 2009 2:45 PM
  •  
Eping Wang
Thanks for the reply

but this pattern is returning all patterns a like.

i need to match only between the tags i mentioned (<DIV class=content>)

thanks
Lirons
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            String html = @"<DIV class=content> 
                              <H2>A-Z</H2>
                              <UL class=listcol> 
                               <LI>
                                 <A href=""http://www.djmick.co.uk/photos01/anson_mount_pictures.htm"">Anson Mount</A>
                               <LI>
                                 <A href=""http://www.djmick.co.uk/photos01/troy_montero_pictures.htm"">Troy Montero</A> 
                               <LI>     
                                 <A href=""http://www.djmick.co.uk/gallery16/tyson_beckford_pictures.htm"">Tyson Beckfor</A>
                               <LI>
                                 <A href=""http://www.djmick.co.uk/photos01/zen_gesner_pictures.htm"">Zen Gesner</A> 
                               </LI>
                              </UL>
                            <DIV class=clear></DIV>
                            </DIV>
                            <DIV class=clear></DIV>
                                <!-- ######## 20 PIXELS SPACE _ IMPORTANT FIX ########### -->
                            <DIV class=space></DIV>
                                <!-- ######## 20 PIXELS SPACE _ IMPORTANT FIX ########### -->
                            <DIV class=content>"; 
            String pattern = @"<A\shref=""(?<Url>http:\/\/www\.djmick\.co\.uk\/[^""]*)"">(?<Actor>[^<]*)";
            Regex rx = new Regex(pattern, RegexOptions.IgnoreCase);
            Match m = rx.Match(html);
            while (m.Success)
            {
                Console.WriteLine("Url: {0}", m.Groups["Url"].Value);
                Console.WriteLine("Actor: {0}", m.Groups["Actor"].Value);
                Console.WriteLine("");
                m = m.NextMatch();
            }
            Console.ReadLine();

        }
    }
}

//Output

Url: http://www.djmick.co.uk/photos01/anson_mount_pictures.htm
Actor: Anson Mount

Url: http://www.djmick.co.uk/photos01/troy_montero_pictures.htm
Actor: Troy Montero

Url: http://www.djmick.co.uk/gallery16/tyson_beckford_pictures.htm
Actor: Tyson Beckfor

Url: http://www.djmick.co.uk/photos01/zen_gesner_pictures.htm
Actor: Zen Gesner


John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com

  • Edited byJohnGrove Friday, September 04, 2009 1:59 PM
  • Edited byJohnGrove Friday, September 04, 2009 5:07 PM
  •  
JohnGrove

If you mean there're also other links in you string,

I think using Regex twice is better than one long Regex.

str = your whole string

reg.pattern = "<DIV class=content>.+?<DIV class=content>"

or use split() function put each content into an array of string

and now str = each content

then you can use John's pattern.

if you haves other urls, you can use this

reg.pattern = "<li><a[^>]+>([^<]+)</a>"


if there would be additional HTML tag inside link, such <b> or <i>

the pattern should be altered to <li><a[^>]+>(.+?)</a>

If you mean you want only the text between links, John's pattern is also correct,

you must get the grouped value not the whole match as his code shown.


www.wonderstudio.cn
  • Marked As Answer byLirons Friday, September 04, 2009 2:45 PM
  •  
Eping Wang
I was just about to put here theappropriatehtml string , but Eping was ahead.

Eping, you got it right, i've done that already, made two maches as you wrote.

I'm anewbie, are you sure it ispreferredover the one line solution?
Lirons

John,

I justfollowed your steps. It's very happy to help others here together with you and other experts.

About my software, sorry, I just find big bugs in my software - Easy GREP, and some are not easy to solve,
(But the HTML Cleaner series is rather stable.)

Liron,

In my experience, splitting complexwork into small pieces is always a good choice.
Here,each pattern is easier to read, and less error possibility, easy to find problems.
And to Regex engine, big complex pattern will consume more time to analyze and perform.
I think the only bad thing is thatthe Regex looks not so "grand".


www.wonderstudio.cn
Eping Wang
OK than, Thank you very much!
Lirons

You can use google to search for other answers

Custom Search

More Threads

• how to NOT match text in tag
• regexp for html
• Regular expression is never finishing match! Help me tune it, please...
• Multiple Length Checks
• regular expressions
• Help with RegEx Seek-n-Find
• Comparing 2 Arraylist...Need to get a summary of all mismatches. wud be grateful for any help
• regex question
• How to create 2 or more Groups?
• Get a string from a HTML-element