Getting HTML DOM from URL…

January 5, 2007

I came across many of the sites and forums with topic reading similar to “How can I get HtmlDocument from a URL”. There were many forums which had suggested some good ideas on it. One of them was using the AxWebBrowser (MS WebBrowser Active X) component. I also got curious about the problem. I googled a lot and found out that there was a API createDocumentFromUrl() in the IHTMLDocument2 interface of mshtml 4.0.

After googling a lot on this API, I learned that there was some issues with the use of this API. Here are some standerd issues which developers might have came accross.

1. VB .NET implementation runs ok, but not the C# one.  :)

2. AccessViolation : Attempted to read or write protected memory.

3. ReadyState never changes from “loading”.

Here’s a solution to all the above things.

// code starts here
// Project needs reference to mshtml 4 
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;
 [ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPersist
{
void GetClassID(Guid pClassId);
}
[ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPersistStreamInit : IPersist
{
new void GetClassID([In,Out] ref Guid pClassId);
[
return : MarshalAs(UnmanagedType.I4)]
[
PreserveSig()]
int IsDirty();
[
return : MarshalAs(UnmanagedType.I4)]
[
PreserveSig()]
void Load(UCOMIStream pStm); //System.Runtime.InteropServices.ComTypes.IStream
[return: MarshalAs(UnmanagedType.I4)]
[
PreserveSig()]
void Save(UCOMIStream pStm, [In,MarshalAs(UnmanagedType.Bool)] bool fClearDirty);//System.Runtime.InteropServices.ComTypes.IStream
void GetMaxSize([Out]long pCbSize);
[
return: MarshalAs(UnmanagedType.I4)]
[
PreserveSig()]
void InitNew();
}
//This is the COM interops which will be helpful in the AccessViolation issue :) 
//Heres a function which takes URL as the input parameter and prepares HTMLDocument object from it.  private string GetHTML(string url)
{
mshtml.
HTMLDocumentClass htmldoc;
htmldoc =
new mshtml.HTMLDocumentClass();
mshtml.
IHTMLDocument2 htmldoc2;
mshtml.
IHTMLDocument3 htmldoc3;
HTMLDocument doc2 = new HTMLDocument();
// This is ver important part of the code
//If not done it raises exception.
//[AccessViolationOccured: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.]
IPersistStreamInit ips = (IPersistStreamInit)htmldoc;
ips.InitNew();
htmldoc2 = (mshtml.
IHTMLDocument2)htmldoc.createDocumentFromUrl(url, null);
while (htmldoc2.readyState != “complete”)
{
//This is also a important part, without this DoEvents() appz hangs on to the “loading”
Application.DoEvents();
}
htmldoc3 = (mshtml.
IHTMLDocument3)htmldoc2;
return htmldoc3.documentElement.innerHTML;
}
//Code ends here

Depending on your need you can modify the function. The topic helped me a lot in understanding the MSHTML library and a bit of COM. As a .NET developer I’m not much awared of the COM technology.If anyone finds bugs in the above code then just post it in the comments section.

-Bugs! 

About these ads

7 Responses to “Getting HTML DOM from URL…”

  1. Poonam Sheth Says:

    Hie, ur codes help us to gain our programming language, thanks. Poonam from jondhale here.

  2. Jeff Says:

    Thank you!!! After days of struggling to find a solution, you have solved the problem.

  3. Khayralla Says:

    Hi
    I use your code but always I got an error at the line
    IPersistStreamInit ips = (IPersistStreamInit)htmldoc;
    the error is:
    Unable to cast COM object of type ‘mshtml.HTMLDocumentClass’ to interface type ‘IPersistStreamInit’. This operation failed because the QueryInterface call on the COM component for the interface with IID ‘{3E64EFD9-EF55-4C1E-9FFE-4BD1251A9A6F}’ failed due to the following error: No such interface supported (Exception from HRESULT: 0×80004002 (E_NOINTERFACE)).

  4. Vaibhav Says:

    Hi Khayralla,
    Pls. send me the class code at :
    vaibhav[dot]gaikwad[at]gmail[dot]com

    I’ll try to dig that out for you.
    -Bugs!

  5. Bugs! Says:

    Hi,
    Your COM GUIDs are wrong, please verify it as per the blog article or MSDN.
    The code will work fine if the GUIDs are right.

    -Bugs!

  6. nadeeraynd Says:

    Thank You very much!!!!, it’s Realy working..
    i need to fill web form using this, so i come up with this cord, but it seems not working because there is no such a account in there, where i created, but it identify all text fields & button correctly…

    can you help with that please… !!!

  7. nadeeraynd Says:

    Thank You very much!!!!, it’s Realy working..
    i need to fill web form using this, so i come up with this cord, but it seems not working because there is no such a account in there, where i created, but it identify all text fields & button correctly…

    ////
    using System;
    using System.Collections.Generic;
    using System.ComponentModel;
    using System.Data;
    using System.Drawing;
    using System.Linq;
    using System.Text;
    using System.Windows.Forms;
    using System.Runtime.InteropServices;
    using System.Runtime.InteropServices.ComTypes;
    using mshtml;
    using SHDocVw;
    using System.Threading;

    namespace getHtmlFrom_url_worldpress.comWindowsForms1
    {
    class Class1
    {
    //public String url = “http://www.bookmarkplace.com/new_user”;
    [ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    interface IPersist
    {
    void GetClassID(Guid pClassId);
    }
    [ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    interface IPersistStreamInit : IPersist
    {
    void GetClassID([In, Out] ref Guid pClassId);
    [return: MarshalAs(UnmanagedType.I4)]
    [PreserveSig()]
    int IsDirty();
    [return: MarshalAs(UnmanagedType.I4)]
    [PreserveSig()]
    void Load(UCOMIStream pStm); //System.Runtime.InteropServices.ComTypes.IStream
    [return: MarshalAs(UnmanagedType.I4)]
    [PreserveSig()]
    void Save(UCOMIStream pStm, [In, MarshalAs(UnmanagedType.Bool)] bool fClearDirty);//System.Runtime.InteropServices.ComTypes.IStream
    void GetMaxSize([Out]long pCbSize);
    [return: MarshalAs(UnmanagedType.I4)]
    [PreserveSig()]
    void InitNew();
    }

    //
    public void GetHTML(string url)
    {
    mshtml.HTMLDocumentClass htmldoc;
    htmldoc = new mshtml.HTMLDocumentClass();
    mshtml.IHTMLDocument2 htmldoc2;
    mshtml.IHTMLDocument3 htmldoc3;
    HTMLDocument doc2 = new HTMLDocument();
    // This is ver important part of the code
    //If not done it raises exception.
    //[AccessViolationOccured: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.]
    IPersistStreamInit ips = (IPersistStreamInit)htmldoc;
    ips.InitNew();
    htmldoc2 = (mshtml.IHTMLDocument2)htmldoc.createDocumentFromUrl(url, null);
    while (htmldoc2.readyState != “complete”)
    {
    //This is also a important part, without this DoEvents() appz hangs on to the “loading”
    Application.DoEvents();
    }
    htmldoc3 = (mshtml.IHTMLDocument3)htmldoc2;
    //return htmldoc3.documentElement.innerHTML;

    mshtml.IHTMLDocument3 document = null;

    document = htmldoc3 as mshtml.IHTMLDocument3;

    //I’m getting elements by tagname input
    mshtml.IHTMLElementCollection colHTML = document.getElementsByTagName(“input”);

    //Loop over them to find the ones you want
    //This is not pretty here because I just did this
    //so it will keep the code together and simple
    //Ideally, you want to get you filter criteria from
    //a config or database etc…You might also use
    //an interface that defines what you need
    foreach (mshtml.HTMLInputElement el in colHTML)
    {
    //Example gets an input element with name=username
    if (el.id == “user_username”)
    {
    el.value = “nade123nade123nad”;
    }
    if (el.id == “user_password1″)
    {
    el.value = “nade123123nade”;
    }
    if (el.id == “user_confirm_password”)
    {
    el.value = “nade123123nade”;
    }
    if (el.id == “user_first_name”)
    {
    el.value = “nadenadedamme”;
    }
    if (el.id == “user_last_name”)
    {
    el.value = “dammenade”;
    }
    if (el.id == “user_email”)
    {
    el.value = “nadsam89@yahoo.com”;
    }
    if (el.id == “user_year”)
    {
    el.value = “1984″;
    }
    if (el.id == “user_month”)
    {
    el.value = “11″;
    }
    if (el.id == “user_day”)
    {
    el.value = “20″;
    }
    //Create button and click to submit
    if (el.type == “submit” && el.name == “commit”)
    {
    mshtml.HTMLInputElement btnSubmit = el;
    btnSubmit.click();
    Thread.Sleep(10000000);
    }
    //
    }
    //Code ends here
    //

    }
    }
    }

    ////

    can you help with that please… !!!


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: