.NET Framework Bookmark and Share   
 index > Regular Expressions > Visual Basic 2008 MSHTML To Parse Text, Visual Basic 2008 MSHTML To Extract Links Text etc..
 

Visual Basic 2008 MSHTML To Parse Text, Visual Basic 2008 MSHTML To Extract Links Text etc..

I have a few questions regarding the MSHTML method to parse & extract text or links:

1- Is this more reliable than using a Webbrowser1 control when parsing or extracting text from thousands of url requests?

2- Could someone show a code sample for a MSHTML method & have the http request navigate via a textbox1.text control?


Oh, I for got to mention the code that I'm currently using:


Public Class Form1
    Private Sub Form1_Load(ByVal sender As System.Object, _
    ByVal e As System.EventArgs) Handles MyBase.Load
        Me.TextBox1.Multiline = True
        Me.TextBox1.ScrollBars = ScrollBars.Both
        'above only for showing the sample
        Dim Doc As mshtml.IHTMLDocument2
        Doc = New mshtml.HTMLDocumentClass
        Dim wbReq As Net.HttpWebRequest = _
            DirectCast(Net.WebRequest.Create("http://msdn.microsoft.com/"), _
            Net.HttpWebRequest)
        Dim wbResp As Net.HttpWebResponse = _
        DirectCast(wbReq.GetResponse(), Net.HttpWebResponse)
        Dim wbHCol As Net.WebHeaderCollection = wbResp.Headers
        Dim myStream As IO.Stream = wbResp.GetResponseStream()
        Dim myreader As New IO.StreamReader(myStream)
        Doc.write(myreader.ReadToEnd())
        Doc.close()
        wbResp.Close()

        'the part below is not completly done for all tags. 
        'it can (will) be necessary to tailor that to your needs.
   
     Dim sb As New System.Text.StringBuilder
        For i As Integer = 0 To Doc.all.length - 1
            Dim hElm As mshtml.IHTMLElement = _
            DirectCast(Doc.all.item(i), mshtml.IHTMLElement)
            Select Case hElm.tagName.ToLower
                Case "body", "html", "head", "form"
                Case Else
                    If hElm.innerText <> "" Then
                        sb.Append(hElm.innerText & vbCrLf)
                    End If
            End Select
        Next
        TextBox1.Text = sb.ToString
    End Sub
End Class 


It would be greatly appreciated.

Kind regards,

Valarez P.
VBNETMAN

1. MSHTML is the COM object that is the engine of past versions of Internet Explorer, although I'm not sure if it still is in IE8.
2. The native code WebBrowser control is a wrapper around MSHTML and a very cursory check of MSDN suggests the .NET WebBrowser control is a wrapper around it as well.

In general, using the WebBrowser control in .NET is likely to be easier for you than going directly against MSHTML.

As an unsolicited editorial based on a fair amount of experience using MSHTML via the native code WebBrowser control in C++, I believe that many questions that appear in this forum on extracting web page elements would be made simpler and less error-prone by using the .NET WebBrowser control. Of course, if the actual find requirement is simple then regular expressions may have a time advantage over loading a page into the WebBrowser control and having it parsed.

How to proceed with your problem, especially the navigation part, depends on whether you're fielding page requests on a server or creating a browser window and gaining access to it.

In either case, to extract the value of a text control the function you need is IHTMLDocument3::getElementById, assuming the input field containing the target URL has an id value. If not, you will need IHTMLDocument3::getElementsByTagName and parade through the collection until you find the one you want based on some other criteria.

This link contains a good overview example of creating and manipulating a browser window, and illustrates both IHTMLDocument3::getElementById and navigation.

http://weblogs.asp.net/joberg/articles/405283.aspx

The example is C#. If you have difficulty converting C# to VB, perhaps you can get some help from folks fluent in both.

Ed McElroy

  • Edited byE McElroy Tuesday, September 22, 2009 12:55 AM
  •  
E McElroy

1. MSHTML is the COM object that is the engine of past versions of Internet Explorer, although I'm not sure if it still is in IE8.
2. The native code WebBrowser control is a wrapper around MSHTML and a very cursory check of MSDN suggests the .NET WebBrowser control is a wrapper around it as well.

In general, using the WebBrowser control in .NET is likely to be easier for you than going directly against MSHTML.

As an unsolicited editorial based on a fair amount of experience using MSHTML via the native code WebBrowser control in C++, I believe that many questions that appear in this forum on extracting web page elements would be made simpler and less error-prone by using the the .NET WebBrowser control. Of course, if the actual find requirement is simple then regular expressions may have a time advantage over loading a page into the WebBrowser control and having it parsed.

How to proceed with your problem, especially the navigation part, depends on whether you're fielding page requests on a server or creating a browser window and gaining access to it.

In either case, to extract the value of a text control the function you need is IHTMLDocument3::getElementById, assuming the input field containing the target URL has an id value. If not, you will need IHTMLDocument3::getElementsByTagName and parade through the collection until you find the one you want based on some other criteria.

This link contains a good overview example of creating and manipulating a browser window, and illustrates both IHTMLDocument3::getElementById and navigation.

http://weblogs.asp.net/joberg/articles/405283.aspx

The example is C#. If you have difficulty converting C# to VB, perhaps you can get some help from folks fluent in both.

Ed McElroy



Thank you Ed for your quick reply,

Yes, I have been using the Webbrowser1 and evenimplemented webbrowser2 and 3 to try and speed things up a bit, but still seems to be a bit buggy and takes quite a bit of time.

I really became quite curious & decided to ask this question when I noticed that there are quite a fewprograms out there that are extremely fast with parsing many urls (1000's)in such a short period of time and returning the targeted data (or data to extract) into a specified field& later save/export to a .txt or .cvs file.

I then tried to accomplish this same concept of parsing and extracting from many web pagesby using the webbrowser1 2 and 3, but ran into the issues of the program not being as fast &accurate. It's verybuggy & at times freezes up. I think it's because the webbrowsers aren't capable of handling soo many requests at a time.

So I began to think that using the webbrowser control isn't the best way and there there is a better, more effective way of accomplishing this. At least it came off to me that these other programmers are using "a better way".:)

Anyway, I didn't mention that I'm a noob at this and couldn't quite understand the link that you directed me to. So Iwas wondering if you could show an example for the Regular Expressions method or MSHTML.

Could you or someoneshow an examplefor extracting the text Hiphopkitshop@gmail.comfrom the following page:
http://www.google.com/search?hl=en&rlz=1T4DKUS_enUS288US288&q=hiphopkitshop%40gmail.com&start=0&sa=N

If you could accomplish these 2features as well:

1- Navigate via textbox1.text to the above url.
2- extract HIPHOPKITSHOP@GMAIL.COM (one of my emails) and send to textbox2.text with multiplelines.

I would really appreciate if you or someone could show a code sample for this. It should only take a few minutes to do this.

Thanks to whomever can help me on this. I will study and the code sample that anyone can demonstrate.







VBNETMAN
The link you gave results in a google search results page. The example target string, HIPHOPKITSHOP@GMAIL.COM, appears both in text and as part of an href value. When it appears in text, there are formatting tags around the '@' character.

Knowing ahead of time what the target string looks like on a specific page, it's pretty easy to construct a regular expression that will locate the string on that page. The problem is, this is not going to be useful. Presumably, you need to locate email addresses or urls when you don't know ahead of time what the string is or what its exact setting is. There are a great many threads in this forum on locating urls in specific circumstances. A general case regular expression to do this strikes me as being rather complicated and quite possibly error prone. I would guess that some combination of programming and regular expressions would be required for a reliable solution. However, this problem is one that is likely to have already been solved, in fact solved many times over, so my advice is to check the programming web sites looking for existing code that will do this.

As for the .NET WebBrowser, my experience is not with it but with MSHTML and the native-code WebBrowser in C++. If you had some difficulty with the .NET WebBrowser control, I suspect that the problems may have been from inexperience using it. Here is a link to a much simpler example, a walkthrough which takes a URL in a textbox and uses the .NET WebBrowser to navigate to that page:

http://msdn.microsoft.com/en-us/library/aa290341(VS.71).aspx

Regretfully, I don't think I can add much additional help on this so I'm going to defer to others in the forum who, hopefully, will have some specific recommendations on general case url extraction code and developer web sites which might be good places to check.

Ed McElroy
E McElroy
"I really became quite curious & decided to ask this question when I noticed that there are quite a fewprograms out there that are extremely fast with parsing many urls (1000's)in such a short period of time and returning the targeted data"

In my knowledge these tools don't use MSHTML nor Webbrowser.
They just use a connecting componentto get the data as string or stream.
For example,in VB6 you can use inet in MSInet.ocx
in Delphi you can use Indy. I don't knowif .Net have these components.
Eping Wang
"I really became quite curious & decided to ask this question when I noticed that there are quite a fewprograms out there that are extremely fast with parsing many urls (1000's)in such a short period of time and returning the targeted data"

In my knowledge these tools don't use MSHTML nor Webbrowser.
They just use a connecting componentto get the data as string or stream.
For example,in VB6 you can use inet in MSInet.ocx
in Delphi you can use Indy. I don't knowif .Net have these components.

To be extact, here are 2 videos of the type of applications that I'm interested in creating:


1- http://www.youtube.com/watch?v=CqDjbBT-gHo&feature=fvw

2- http://www.youtube.com/watch?v=bXPTDKkQcR4

Something similar to these programs. I want to be able to extract particular text from hundreds or even thousands of web pages simultaneously and I noticed that using webbrowser control causes errors.
VBNETMAN
Sorry, youtube is banned in China, I can't see them.

1 use inet to retrieve webpages as string
2 use regex and string functions to parse the string
3 doing such work I recommend using win32 instead of .net for better performance
4 never use advanced components like mshtml or webbrowser(except one page)

Achieve the goal is one thing, make it practical in real world is another thing.
To obtain high performance is not an easy thing.
www.wonderstudio.cn
Eping Wang
Just to add an addendum to Eping's comments:

1. As someone who's spent way too many years in assembler, C, and C++ (when I should have been in investment banking), I'd be the last to argue against the potential speed of a native code application. However, I think most people concluded some time ago that the increased development speed and reduced exposure to error offered by the newer managed frameworks and their languages is more important in all but advanced scientific applications.

2.Afew yearsago I tested the .NET regular expressions engine by looking for the letter 'e' in a 1,200+ page document and came back with a MatchCollection with over 180,000 Match objects in just a couple of seconds, and that was on a machine slower than those available today. Of course there are better and quicker ways to look for the letter 'e'; this was just a test of how fast .NET can put together all the supporting structures involved in a Match object. I was quite impressed.

3. I have read in the C++ forum but haven't personally verified that TR1 regular expressions are not as full-featured as those in .NET or Perl or Java. If true, the significance is that it would make an error-prone task, general-case URL extraction, even more difficult.

4. In addition to Eping's suggestion of INET, MSXML may have the capability that you need and MSXML is usable in script languages.

5. To me, the major difficulty of this problem lies not in how to get the web pages but how to extract URLs in the general case. As I mentioned, I'm sure this problem has been satisfactorily solved many times so if this were my problem, I would be checking as many developer web sites as needed to get some already developed code to handle this. Once you have that code, the rest is not particularly difficult but will require you to spend some time in MSDN with either MSXML, INET, or something else that can pull in web pages as a stream.

Looking at your postings, my hunch is that your experience is somewhat limited and we can conclude that you're not likely to be working on the missile defense system. In that context, I would stick with .NET or some kind of script language - Perl would be a very good possibility because there's likely to be URL extraction code already written in Perl available on Perl developer sites. I would let the choice of programming language and technology be determined by the best code you can get from developer sites.

I certainly agree with Eping's observation that using MSHTML or its wrapper the WebBrowser control will be much slower than analyzing a web page as a text stream. I also agree with Eping's aphorism at the end but I will add to it that reliability is the hardest of all to achieve,and all of it is useless if you're late to market with the product.

Ed McElroy
E McElroy
Just to piggyback of Ed, read William's article titled:

Are .NET Regular Expressions Fast Enough For You
John Grove - TFD Group, Senior Software Engineer, EI Division, http://www.tfdg.com
JohnGrove

Thanks, Ed, elaborate and comprehensive.


www.wonderstudio.cn
Eping Wang

You can use google to search for other answers

Custom Search

More Threads

• Whole phrase Search and Replace, with simarl words
• How to verify a string is valid Aspx Url or not?
• Regular expression for email address and display name.
• Reporting Services (2005) - Convert string to expression
• Regular expression issue
• Url Rewriting - Trouble with optional query string parameters
• Regular Expression for password
• Is there a way to have a named group stated ONCE and refer to it in the regex?
• Need line based on the pattern
• How to extract and replace with regex