Getting HTML DOM from URL…
January 5, 2007
I came across many of the sites and forums with topic reading similar to “How can I get HtmlDocument from a URL”. There were many forums which had suggested some good ideas on it. One of them was using the AxWebBrowser (MS WebBrowser Active X) component. I also got curious about the problem. I googled a lot and found out that there was a API createDocumentFromUrl() in the IHTMLDocument2 interface of mshtml 4.0.
After googling a lot on this API, I learned that there was some issues with the use of this API. Here are some standerd issues which developers might have came accross.
1. VB .NET implementation runs ok, but not the C# one.
2. AccessViolation : Attempted to read or write protected memory.
3. ReadyState never changes from “loading”.
Here’s a solution to all the above things.
// code starts here
// Project needs reference to mshtml 4
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;
[ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPersist
{
void GetClassID(Guid pClassId);
}
[ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPersistStreamInit : IPersist
{
new void GetClassID([In,Out] ref Guid pClassId);
[return : MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
int IsDirty();
[return : MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
void Load(UCOMIStream pStm); //System.Runtime.InteropServices.ComTypes.IStream
[return: MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
void Save(UCOMIStream pStm, [In,MarshalAs(UnmanagedType.Bool)] bool fClearDirty);//System.Runtime.InteropServices.ComTypes.IStream
void GetMaxSize([Out]long pCbSize);
[return: MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
void InitNew();
}
//This is the COM interops which will be helpful in the AccessViolation issue
//Heres a function which takes URL as the input parameter and prepares HTMLDocument object from it. private string GetHTML(string url)
{
mshtml.HTMLDocumentClass htmldoc;
htmldoc = new mshtml.HTMLDocumentClass();
mshtml.IHTMLDocument2 htmldoc2;
mshtml.IHTMLDocument3 htmldoc3;
HTMLDocument doc2 = new HTMLDocument();
// This is ver important part of the code
//If not done it raises exception.
//[AccessViolationOccured: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.]
IPersistStreamInit ips = (IPersistStreamInit)htmldoc;
ips.InitNew();
htmldoc2 = (mshtml.IHTMLDocument2)htmldoc.createDocumentFromUrl(url, null);
while (htmldoc2.readyState != “complete”)
{
//This is also a important part, without this DoEvents() appz hangs on to the “loading”
Application.DoEvents();
}
htmldoc3 = (mshtml.IHTMLDocument3)htmldoc2;
return htmldoc3.documentElement.innerHTML;
}
//Code ends here
Depending on your need you can modify the function. The topic helped me a lot in understanding the MSHTML library and a bit of COM. As a .NET developer I’m not much awared of the COM technology.If anyone finds bugs in the above code then just post it in the comments section.
-Bugs!
January 25, 2007 at 4:39 am
Hie, ur codes help us to gain our programming language, thanks. Poonam from jondhale here.
November 8, 2007 at 5:14 pm
Thank you!!! After days of struggling to find a solution, you have solved the problem.
March 3, 2008 at 3:13 pm
Hi
I use your code but always I got an error at the line
IPersistStreamInit ips = (IPersistStreamInit)htmldoc;
the error is:
Unable to cast COM object of type ‘mshtml.HTMLDocumentClass’ to interface type ‘IPersistStreamInit’. This operation failed because the QueryInterface call on the COM component for the interface with IID ‘{3E64EFD9-EF55-4C1E-9FFE-4BD1251A9A6F}’ failed due to the following error: No such interface supported (Exception from HRESULT: 0×80004002 (E_NOINTERFACE)).
March 4, 2008 at 10:34 am
Hi Khayralla,
Pls. send me the class code at :
vaibhav[dot]gaikwad[at]gmail[dot]com
I’ll try to dig that out for you.
-Bugs!
March 11, 2008 at 7:03 pm
Hi,
Your COM GUIDs are wrong, please verify it as per the blog article or MSDN.
The code will work fine if the GUIDs are right.
-Bugs!
February 11, 2011 at 3:14 am
Thank You very much!!!!, it’s Realy working..
i need to fill web form using this, so i come up with this cord, but it seems not working because there is no such a account in there, where i created, but it identify all text fields & button correctly…
can you help with that please… !!!
February 11, 2011 at 3:16 am
Thank You very much!!!!, it’s Realy working..
i need to fill web form using this, so i come up with this cord, but it seems not working because there is no such a account in there, where i created, but it identify all text fields & button correctly…
////
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Runtime.InteropServices;
using System.Runtime.InteropServices.ComTypes;
using mshtml;
using SHDocVw;
using System.Threading;
namespace getHtmlFrom_url_worldpress.comWindowsForms1
{
class Class1
{
//public String url = “http://www.bookmarkplace.com/new_user”;
[ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPersist
{
void GetClassID(Guid pClassId);
}
[ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
interface IPersistStreamInit : IPersist
{
void GetClassID([In, Out] ref Guid pClassId);
[return: MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
int IsDirty();
[return: MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
void Load(UCOMIStream pStm); //System.Runtime.InteropServices.ComTypes.IStream
[return: MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
void Save(UCOMIStream pStm, [In, MarshalAs(UnmanagedType.Bool)] bool fClearDirty);//System.Runtime.InteropServices.ComTypes.IStream
void GetMaxSize([Out]long pCbSize);
[return: MarshalAs(UnmanagedType.I4)]
[PreserveSig()]
void InitNew();
}
//
public void GetHTML(string url)
{
mshtml.HTMLDocumentClass htmldoc;
htmldoc = new mshtml.HTMLDocumentClass();
mshtml.IHTMLDocument2 htmldoc2;
mshtml.IHTMLDocument3 htmldoc3;
HTMLDocument doc2 = new HTMLDocument();
// This is ver important part of the code
//If not done it raises exception.
//[AccessViolationOccured: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.]
IPersistStreamInit ips = (IPersistStreamInit)htmldoc;
ips.InitNew();
htmldoc2 = (mshtml.IHTMLDocument2)htmldoc.createDocumentFromUrl(url, null);
while (htmldoc2.readyState != “complete”)
{
//This is also a important part, without this DoEvents() appz hangs on to the “loading”
Application.DoEvents();
}
htmldoc3 = (mshtml.IHTMLDocument3)htmldoc2;
//return htmldoc3.documentElement.innerHTML;
mshtml.IHTMLDocument3 document = null;
document = htmldoc3 as mshtml.IHTMLDocument3;
//I’m getting elements by tagname input
mshtml.IHTMLElementCollection colHTML = document.getElementsByTagName(“input”);
//Loop over them to find the ones you want
//This is not pretty here because I just did this
//so it will keep the code together and simple
//Ideally, you want to get you filter criteria from
//a config or database etc…You might also use
//an interface that defines what you need
foreach (mshtml.HTMLInputElement el in colHTML)
{
//Example gets an input element with name=username
if (el.id == “user_username”)
{
el.value = “nade123nade123nad”;
}
if (el.id == “user_password1″)
{
el.value = “nade123123nade”;
}
if (el.id == “user_confirm_password”)
{
el.value = “nade123123nade”;
}
if (el.id == “user_first_name”)
{
el.value = “nadenadedamme”;
}
if (el.id == “user_last_name”)
{
el.value = “dammenade”;
}
if (el.id == “user_email”)
{
el.value = “nadsam89@yahoo.com”;
}
if (el.id == “user_year”)
{
el.value = “1984″;
}
if (el.id == “user_month”)
{
el.value = “11″;
}
if (el.id == “user_day”)
{
el.value = “20″;
}
//Create button and click to submit
if (el.type == “submit” && el.name == “commit”)
{
mshtml.HTMLInputElement btnSubmit = el;
btnSubmit.click();
Thread.Sleep(10000000);
}
//
}
//Code ends here
//
}
}
}
////
can you help with that please… !!!