Friday, October 17, 2008

How to load MSHTML with data

Download the sample code for this article.

This week I needed to write code to parse HTML pages and find the IMG tags to extract the images. I saw that the IHTMLDocument2 interface includes the get_images() function, so I figured that this would be a trivial problem to solve, especially since MSHTML supports running outside of a browser window.

This "trivial problem" has now taken me the better part of a week to solve. Most of the solutions I found are buggy, wrong or incomplete, some of which are from Microsoft technical articles.

For this project, I was starting with a raw buffer that contained HTML data. The buffer was multibyte, not Unicode, but the encoding of the buffer was not known in advance, so I couldn't convert the data to Unicode without parsing the page to extract the charset tag. Since parsing is what MSHTML was supposed to do for me in the first place, this made the Unicode calls useless.

1. IMarkupServices

The first article I found was from CodeProject, titled How to identify the different elements in the selection of web pages? Although the article didn't solve my exact problem, it did lead me to IMarkupServices and ParseString(), with supporting information from CodeGuru in Lightweight HTML Parsing Using MSHTML.

Unfortunately, I was never able to make this work. As it turned out, this solution wouldn't have worked anyway because ParseString requires a Unicode string and I couldn't provide one. [Note: I now believe that I needed a message loop to make this work, as in #3 below, but I didn't verify this.] Update 12/21/2010: I was finally able to make IMarkupServices work properly using sample code from MSDN. IMarkupServices does not require a message loop, and you can read HTML with any character set using ParseGlobal instead of ParseString. However, IMarkupServices does have some strange drawbacks. The IHtmlDocument2 object it returns is not fully functional. You cannot QI IPersistStreamNew to serialize it and IHtmlDocument3::get_documentElement simply fails. I've also found comments that IMarkupServices will simply fail on many web pages.

2. IHtmlDocument2::write()

My next try was to use IHtmlDocument2::write(). I found this in another article on CodeProject titled, Loading and parsing HTML using MSHTML. 3rd way. This was the easiest of the various solutions to use because it worked the first time and required no message loop (more on this in #3.) However, the write() function also requires the HTML to be passed as Unicode, which meant that this solution had the same problem as #1.

The only challenge to using IHtmlDocument2::write() is properly setting up the SAFEARRAY. Although the sample code in MSDN shows how to do this, it's complicated and easy to break.

3. IPersistStreamInit::Load()

My third try was to use IPersistStreamInit::Load(). This is the solution recommended by Microsoft in the article Loading HTML content from a Stream. This function caused me no end of aggravation. No matter what I tried, I couldn't get it to work. My call to Load() would return success, but my data wouldn't appear in the document.

I found other people with the same problem. It turns out that a few important details were omitted in the MSDN article. The first hint I found was in a Usenet post on microsoft.public.windows.inetexplorer.ie5.programming.components.webbrowser_ctl, where Mikko Noromaa recommended using CoInitializeEx(NULL,COINIT_MULTITHREADED) instead of CoInitialize(NULL).

My initial results with this change were perfect; for the first time my data could actually be retrieved from IHTMLDocument2. However, when I tried to change the data, I started seeing really weird crashes. Examining the call stack showed that my calls were being marshaled across COM apartments, which shouldn't have been necessary. Eventually, I determined that MSHTML wants to live in an STA (Single Thread Apartment.) When I declared my thread to be an MTA, COM automatically created a new thread to host MSHTML and marshaled my calls to that thread. Something was going wrong in those cross-thread calls and I didn't have the inclination to debug it.

I finally found a blog post by Scott Hanselman that described how to make Load() work properly. Recent versions of MSHTML require a message loop. Apparently, older versions of MSHTML did not. Without a message loop, Load() returns success but the actual work of loading the HTML is performed asynchronously. I had to add this code after Load() in order to make it work:
for (;;)
{
    CComBSTR bstrReady;
    hr = pDoc->get_readyState(&bstrReady);
    if (bstrReady != L"loading")
        break;
    //while( doc.readyState != "complete" )

    MSG msg;
    while (::PeekMessage( &msg, NULL, 0, 0, PM_NOREMOVE))
    {
        if ( !AfxGetApp()->PumpMessage( ) )
        {
        return;
        }
    }
}

Unfortunately, after going through all this time and effort to make Load() function, I discovered a small fact that made all of this effort useless: there's a bug in the MIME sniffing code used by IPersistStream::Load(). The bug is documented in this MSDN KB article:
BUG: PersistStreamInit::Load() Displays HTML Files as Text. You can read the discussion in Scott Hanselman's blog entry to understand why this is an issue. Update 12/21/2010: This bug was fixed in IE7 and the KB article is now marked as "retired content."

4. IPersistFile

My next try was to use IPersistFile, like this:

CComQIPtr<IPersistFile> pPersist(pDoc);
hr = pPersist->Load(L"C:\\abc.html", STGM_READ);
pPersist.Release();

As in #3, IPersistFile::Load() requires a message loop to function properly.

Unfortunately, using this function was not optimal for my situation because I already had the HTML document in memory. I didn't want to write it out to disk again and slow things down.

5. IPersistMoniker

I finally found the correct solution in a post by Craig Monro in microsoft.public.inetsdk.programming.html_objmodel.

The solution is to use IPersistMoniker to feed the stream to MSHTML. By using IPersistMoniker, you avoid the MIME sniffing bug, you don't need a Unicode buffer, and you can use in-memory data.

There is one problem with the solution posted by Mr. Monro. The SetHTML function in his example takes a Unicode string for the HTML data, but this isn't a requirement to use IPersistMoniker with MSHTML. I changed the function to use a byte buffer instead of a Unicode buffer and it worked fine. I also used SHCreateMemStream() to avoid having to make a second copy of the data buffer.

Preventing Execution [Added 12/21/2010]

According to the documentation in Microsoft's WalkAll sample, "If the loaded document is an HTML page which contains scripts, Java Applets and/or ActiveX controls, and those scripts are coded for immediate execution, MSHTML will execute them by default." This is very important to understand if the HTML code you are loading is not trusted because you begin executing the page as soon as it's loaded. This applies to all forms of loading described above except IMarkupServices.

The solution to this problem is not shown in my sample code, but it is shown in the WalkAll sample. Look in the comments at the beginning of the WalkAll sample for "IOleClientSite and IDispatch".

Updating the Document [Added 12/30/2010]

Once you've loaded the HTML document, you often want to update it and save the result. I found it necessary to set designMode to "On" in order to make changes "stick." Otherwise the change would appear to work, but would be discarded when I tried to save the document. You must set designMode to "On" after the document is loaded, because loading a document resets the designMode value.

If you want to save the document, QI for IPersistStreamInit and call Save(). (This doesn't work for documents loaded with IMarkupServices.) However, be aware that MSHTML cannot faithfully recreate the document that you loaded. In other words, if you do a load/save cycle, you can't diff the new file with the original and get a reasonable result. MSHTML normalizes tags, removes newlines, and makes many other changes. There does not appear to be any way to force MSHTML to save a byte-perfect form of the document.

As an alternative to IPersistStreamInit, you can get a pointer to the document element with IHTMLDocument3::get_documentElement(), and then call get_outerHTML(), but this strategy is imperfect for the following reasons:
  • Any DOCTYPE or xml declaration at the beginning of the document is discarded.
  • Any attributes on the BODY tag are discarded.
  • Character encoding is lost because you always end up with wide character Unicode. Even worse, any CHARSET declaration in the HEAD is preserved, so non-English documents can display as garbage.


Performance [Added 12/30/2010]

One of my concerns was the performance that would be offered by the MSHTML control, which is hardly a lightweght control. To better understand the behavior, I ran a series of performance tests on a 3GHz Core i7 processor. What I learned is that the time required to parse HTML is dwarfed by the time required to create an MSHTML object. In my tests, if the MSHTML object is created once and then reused, a thread can load a 2K file about 1000 times in one second. If the MSHTML object is created and destroyed on each iteration, performance drops to 25 loads per second, a 40x performance hit. The lesson is that HSHTML should be created once and reused. The reuse strategy is the fundamental reason IPersistStreamInit exists as a separate interface from IPersistStream.


Conclusion

This is one of the more difficult problems I've worked on lately. Between bad examples, bad documentation, bad 3rd party advice, and a plethora of different strategies, finding this solution was far more difficult than I expected. I hope this article saves some others from the same frustration.

Download the sample code for this article.

12 comments:

  1. Thank you sir, you have indeed saved me some time.

    I am converting a function that used some really bad string::find code and I too have found the documentation on this stuff to be very short of adequate.

    Anyway, I substituted the enclosing program's "keep the UI alive while we think" call (which pumps messages) into your loop, passed the filename (I'm reading from a file named by a c++ string) to the IPersistFile interface using IPersistFile::Load(CComBSTR(filename.c_str()) and all is working for me.

    ReplyDelete
  2. Does the IPersistMoniker solution work with IE 8.0? I can load normal text using the method but it will not load bmp.

    The method documented by microsoft works great with IE 6.0 and 7.0 but fails with IE 8.0.

    ReplyDelete
  3. Loading a bitmap *may* work if you properly set the filename ahead of time. This isn't shown in the article, but I know I ran across that technique while researching the article. Unfortunately, I don't remember the details.

    You didn't give me enough information to comment on "the method documented by Microsoft." Text? Bitmap? Unicode?

    ReplyDelete
  4. I have a jpg / bmp / txt files that are a part of a structured storage document. It would be great if I don't have to create a copy of files.

    Microsoft documentation for loading html content from stream (There are known bugs in it):

    http://msdn.microsoft.com/en-us/library/aa752047(VS.85).aspx

    I have also tried the IPersistMoniker approach with the same results:

    1. Works great with IE 6.0 and 7.0 and 8.0 for text files.

    2. Works great with IE 6.0 and 7.0 for bmp files but does not work for 8.0. Something looks broken.

    My post on google groups:
    http://groups.google.com/group/comp.os.ms-windows.programmer.win32/browse_thread/thread/41322df64a2a827c#

    ReplyDelete
  5. Any clues?

    I did not get what you meant by setting the file name early.

    ReplyDelete
  6. There's a way to set the filename for the page you are loading. I think it requires you to implement another COM interface that's a callback. Sorry I can't be more specific, but that's all I remember.

    ReplyDelete
  7. Have you observed a hang using the "2. IHtmlDocument2::write()" method? I have seen it on a few websites (e.g., https://usaa.com). There are few references on google where someone else has mentioned this and this may have caused due to the use of "frameset" in html.

    ReplyDelete
  8. i run into your article ,i dont understand wht u didnt use createdocumentfromurl i think
    it is working fine and u dont need to use webbrowser

    ReplyDelete
  9. createDocumentFromUrl wants a URL. Presumably a file URL is okay, but my data is already in memory and writing the data to disk would be a significant performance hit for my app, which has to process tens of thousands of documents at a time. In any case, I was already successful using IPersistFile to read disk-based documents.

    ReplyDelete
  10. My working on an application where I need to find specific text in web page. After finding text I need to append a logo and keep hyperlink for for identified text. Can help me in which approach is best and is it possible to provide some code sample.

    ReplyDelete
  11. Hi Josh,

    For questions like this that are specific to your application, I can provide help on a consulting basis. If you are interested, you can contact me by clicking on the Email link in my profile.

    ReplyDelete
  12. Nice article.
    I want to share my experience with methods I ve tried.
    2. IHtmlDocument2::write() - failed to load jquery script;
    3. IPersistStreamInit::Load() - steals focus from containing application

    ReplyDelete