This week I needed to write code to parse HTML pages and find the IMG tags to extract the images. I saw that the IHTMLDocument2 interface includes the get_images() function, so I figured that this would be a trivial problem to solve, especially since MSHTML supports running outside of a browser window.
This "trivial problem" has now taken me the better part of a week to solve. Most of the solutions I found are buggy, wrong or incomplete, some of which are from Microsoft technical articles.
For this project, I was starting with a raw buffer that contained HTML data. The buffer was multibyte, not Unicode, but the encoding of the buffer was not known in advance, so I couldn't convert the data to Unicode without parsing the page to extract the charset tag. Since parsing is what MSHTML was supposed to do for me in the first place, this made the Unicode calls useless.
1. IMarkupServicesThe first article I found was from CodeProject, titled How to identify the different elements in the selection of web pages? Although the article didn't solve my exact problem, it did lead me to IMarkupServices and ParseString(), with supporting information from CodeGuru in Lightweight HTML Parsing Using MSHTML.
2. IHtmlDocument2::write()My next try was to use IHtmlDocument2::write(). I found this in another article on CodeProject titled, Loading and parsing HTML using MSHTML. 3rd way. This was the easiest of the various solutions to use because it worked the first time and required no message loop (more on this in #3.) However, the write() function also requires the HTML to be passed as Unicode, which meant that this solution had the same problem as #1.
The only challenge to using IHtmlDocument2::write() is properly setting up the SAFEARRAY. Although the sample code in MSDN shows how to do this, it's complicated and easy to break.
3. IPersistStreamInit::Load()My third try was to use IPersistStreamInit::Load(). This is the solution recommended by Microsoft in the article Loading HTML content from a Stream. This function caused me no end of aggravation. No matter what I tried, I couldn't get it to work. My call to Load() would return success, but my data wouldn't appear in the document.
I found other people with the same problem. It turns out that a few important details were omitted in the MSDN article. The first hint I found was in a Usenet post on microsoft.public.windows.inetexplorer.ie5.programming.components.webbrowser_ctl, where Mikko Noromaa recommended using CoInitializeEx(NULL,COINIT_MULTITHREADED) instead of CoInitialize(NULL).
My initial results with this change were perfect; for the first time my data could actually be retrieved from IHTMLDocument2. However, when I tried to change the data, I started seeing really weird crashes. Examining the call stack showed that my calls were being marshaled across COM apartments, which shouldn't have been necessary. Eventually, I determined that MSHTML wants to live in an STA (Single Thread Apartment.) When I declared my thread to be an MTA, COM automatically created a new thread to host MSHTML and marshaled my calls to that thread. Something was going wrong in those cross-thread calls and I didn't have the inclination to debug it.
I finally found a blog post by Scott Hanselman that described how to make Load() work properly. Recent versions of MSHTML require a message loop. Apparently, older versions of MSHTML did not. Without a message loop, Load() returns success but the actual work of loading the HTML is performed asynchronously. I had to add this code after Load() in order to make it work:
hr = pDoc->get_readyState(&bstrReady);
if (bstrReady != L"loading")
//while( doc.readyState != "complete" )
while (::PeekMessage( &msg, NULL, 0, 0, PM_NOREMOVE))
if ( !AfxGetApp()->PumpMessage( ) )
Unfortunately, after going through all this time and effort to make Load() function, I discovered a small fact that made all of this effort useless: there's a bug in the MIME sniffing code used by IPersistStream::Load(). The bug is documented in this MSDN KB article:
BUG: PersistStreamInit::Load() Displays HTML Files as Text. You can read the discussion in Scott Hanselman's blog entry to understand why this is an issue. Update 12/21/2010: This bug was fixed in IE7 and the KB article is now marked as "retired content."
4. IPersistFileMy next try was to use IPersistFile, like this:
CComQIPtr<IPersistFile> pPersist(pDoc); hr = pPersist->Load(L"C:\\abc.html", STGM_READ); pPersist.Release();
As in #3, IPersistFile::Load() requires a message loop to function properly.
Unfortunately, using this function was not optimal for my situation because I already had the HTML document in memory. I didn't want to write it out to disk again and slow things down.
5. IPersistMonikerI finally found the correct solution in a post by Craig Monro in microsoft.public.inetsdk.programming.html_objmodel.
The solution is to use IPersistMoniker to feed the stream to MSHTML. By using IPersistMoniker, you avoid the MIME sniffing bug, you don't need a Unicode buffer, and you can use in-memory data.
There is one problem with the solution posted by Mr. Monro. The SetHTML function in his example takes a Unicode string for the HTML data, but this isn't a requirement to use IPersistMoniker with MSHTML. I changed the function to use a byte buffer instead of a Unicode buffer and it worked fine. I also used SHCreateMemStream() to avoid having to make a second copy of the data buffer.
Preventing Execution [Added 12/21/2010]According to the documentation in Microsoft's WalkAll sample, "If the loaded document is an HTML page which contains scripts, Java Applets and/or ActiveX controls, and those scripts are coded for immediate execution, MSHTML will execute them by default." This is very important to understand if the HTML code you are loading is not trusted because you begin executing the page as soon as it's loaded. This applies to all forms of loading described above except IMarkupServices.
The solution to this problem is not shown in my sample code, but it is shown in the WalkAll sample. Look in the comments at the beginning of the WalkAll sample for "IOleClientSite and IDispatch".
Updating the Document [Added 12/30/2010]Once you've loaded the HTML document, you often want to update it and save the result. I found it necessary to set designMode to "On" in order to make changes "stick." Otherwise the change would appear to work, but would be discarded when I tried to save the document. You must set designMode to "On" after the document is loaded, because loading a document resets the designMode value.
If you want to save the document, QI for IPersistStreamInit and call Save(). (This doesn't work for documents loaded with IMarkupServices.) However, be aware that MSHTML cannot faithfully recreate the document that you loaded. In other words, if you do a load/save cycle, you can't diff the new file with the original and get a reasonable result. MSHTML normalizes tags, removes newlines, and makes many other changes. There does not appear to be any way to force MSHTML to save a byte-perfect form of the document.
As an alternative to IPersistStreamInit, you can get a pointer to the document element with IHTMLDocument3::get_documentElement(), and then call get_outerHTML(), but this strategy is imperfect for the following reasons:
- Any DOCTYPE or xml declaration at the beginning of the document is discarded.
- Any attributes on the BODY tag are discarded.
- Character encoding is lost because you always end up with wide character Unicode. Even worse, any CHARSET declaration in the HEAD is preserved, so non-English documents can display as garbage.
Performance [Added 12/30/2010]One of my concerns was the performance that would be offered by the MSHTML control, which is hardly a lightweght control. To better understand the behavior, I ran a series of performance tests on a 3GHz Core i7 processor. What I learned is that the time required to parse HTML is dwarfed by the time required to create an MSHTML object. In my tests, if the MSHTML object is created once and then reused, a thread can load a 2K file about 1000 times in one second. If the MSHTML object is created and destroyed on each iteration, performance drops to 25 loads per second, a 40x performance hit. The lesson is that HSHTML should be created once and reused. The reuse strategy is the fundamental reason IPersistStreamInit exists as a separate interface from IPersistStream.
ConclusionThis is one of the more difficult problems I've worked on lately. Between bad examples, bad documentation, bad 3rd party advice, and a plethora of different strategies, finding this solution was far more difficult than I expected. I hope this article saves some others from the same frustration.
Download the sample code for this article.