Technical Blog for Jim Beveridge: October 2008

Thursday, October 23, 2008

DreamWeaver CS4 - First Look

I finally broke down and bought DreamWeaver CS4 today. This is the first time I've used Dreamweaver since 1997. It works a lot better on a 4GB Core 2 Duo machine than it did on a Pentium II 350 :-)

Anyway, here are my initial impressions.

The Good

It doesn't crash! In the past eight years I don't think I ever used GoLive for more than 40 minutes straight without crashing. DW hasn't crashed yet.
Works under Windows Vista.
The upgrade registration was painless. I entered my version of GoLive and my GoLive serial #. No problems.
DW supports sftp.
I/O to the remote sftp server is fast, in direct contrast to every other web dev tool I've tried.
Template page updates are fast. GoLive was always very slow.
Live View allows you to see the browser view without going to the browser. Pretty slick.
Dual Monitor mode.

The Bad

It took me a long time to figure out how to activate the Import Golive Site command. It's not in any of the menus. Looking for "import golive" in Dreamweaver Help brings up numerous old articles about DW CS3, which worked completely differently. I think I finally found the answer by searching on "golive to dreamweaver cs4" (without the quotes) but the first several search results were still for CS3. The answer can be found at Dreamweaver for GoLive users, http://www.adobe.com/devnet/dreamweaver/golive_migration/
After going through the effort to find and install it, the GoLive import tool didn't work at all. It refused to open the site file. My workaround was to use the GL2DW tool from CS3, which is installed in GoLive instead of DW. GL2DW crashed halfway through, but it did enough that I was able to determine what to do myself.
I got stuck on how to open an existing site. Turns out you use "New Site..." even for an existing site. From what I've heard, this will just be the first in a long list of Dreamweaver-isms that I'll run into.
Where's the "remove whitespace" command? Every other tool I've ever used, even Microsoft Expression, does this automatically when the site is uploaded.
Only Subversion is supported for source control, which is a little strange since SourceSafe is supported as a test server. Not a big deal, but SourceSafe integration would have been handy.

Friday, October 17, 2008

How to load MSHTML with data

Download the sample code for this article.

This week I needed to write code to parse HTML pages and find the IMG tags to extract the images. I saw that the IHTMLDocument2 interface includes the get_images() function, so I figured that this would be a trivial problem to solve, especially since MSHTML supports running outside of a browser window.

This "trivial problem" has now taken me the better part of a week to solve. Most of the solutions I found are buggy, wrong or incomplete, some of which are from Microsoft technical articles.

For this project, I was starting with a raw buffer that contained HTML data. The buffer was multibyte, not Unicode, but the encoding of the buffer was not known in advance, so I couldn't convert the data to Unicode without parsing the page to extract the charset tag. Since parsing is what MSHTML was supposed to do for me in the first place, this made the Unicode calls useless.

1. IMarkupServices

The first article I found was from CodeProject, titled How to identify the different elements in the selection of web pages? Although the article didn't solve my exact problem, it did lead me to IMarkupServices and ParseString(), with supporting information from CodeGuru in Lightweight HTML Parsing Using MSHTML.

~~Unfortunately, I was never able to make this work. As it turned out, this solution wouldn't have worked anyway because ParseString requires a Unicode string and I couldn't provide one.~~ ~~[Note: I now believe that I needed a message loop to make this work, as in #3 below, but I didn't verify this.]~~ Update 12/21/2010: I was finally able to make IMarkupServices work properly using sample code from MSDN. IMarkupServices does not require a message loop, and you can read HTML with any character set using ParseGlobal instead of ParseString. However, IMarkupServices does have some strange drawbacks. The IHtmlDocument2 object it returns is not fully functional. You cannot QI IPersistStreamNew to serialize it and IHtmlDocument3::get_documentElement simply fails. I've also found comments that IMarkupServices will simply fail on many web pages.

2. IHtmlDocument2::write()

My next try was to use IHtmlDocument2::write(). I found this in another article on CodeProject titled, Loading and parsing HTML using MSHTML. 3rd way. This was the easiest of the various solutions to use because it worked the first time and required no message loop (more on this in #3.) However, the write() function also requires the HTML to be passed as Unicode, which meant that this solution had the same problem as #1.

The only challenge to using IHtmlDocument2::write() is properly setting up the SAFEARRAY. Although the sample code in MSDN shows how to do this, it's complicated and easy to break.

3. IPersistStreamInit::Load()

My third try was to use IPersistStreamInit::Load(). This is the solution recommended by Microsoft in the article Loading HTML content from a Stream. This function caused me no end of aggravation. No matter what I tried, I couldn't get it to work. My call to Load() would return success, but my data wouldn't appear in the document.

I found other people with the same problem. It turns out that a few important details were omitted in the MSDN article. The first hint I found was in a Usenet post on microsoft.public.windows.inetexplorer.ie5.programming.components.webbrowser_ctl, where Mikko Noromaa recommended using CoInitializeEx(NULL,COINIT_MULTITHREADED) instead of CoInitialize(NULL).

My initial results with this change were perfect; for the first time my data could actually be retrieved from IHTMLDocument2. However, when I tried to change the data, I started seeing really weird crashes. Examining the call stack showed that my calls were being marshaled across COM apartments, which shouldn't have been necessary. Eventually, I determined that MSHTML wants to live in an STA (Single Thread Apartment.) When I declared my thread to be an MTA, COM automatically created a new thread to host MSHTML and marshaled my calls to that thread. Something was going wrong in those cross-thread calls and I didn't have the inclination to debug it.

I finally found a blog post by Scott Hanselman that described how to make Load() work properly. Recent versions of MSHTML require a message loop. Apparently, older versions of MSHTML did not. Without a message loop, Load() returns success but the actual work of loading the HTML is performed asynchronously. I had to add this code after Load() in order to make it work:



for (;;)

{

    CComBSTR bstrReady;

    hr = pDoc->get_readyState(&bstrReady);

    if (bstrReady != L"loading")

        break;

    //while( doc.readyState != "complete" )



    MSG msg;

    while (::PeekMessage( &msg, NULL, 0, 0, PM_NOREMOVE))

    {

        if ( !AfxGetApp()->PumpMessage( ) )

        {

        return;

        }

    }

}

Unfortunately, after going through all this time and effort to make Load() function, I discovered a small fact that made all of this effort useless: there's a bug in the MIME sniffing code used by IPersistStream::Load(). The bug is documented in this MSDN KB article:
BUG: PersistStreamInit::Load() Displays HTML Files as Text. You can read the discussion in Scott Hanselman's blog entry to understand why this is an issue. Update 12/21/2010: This bug was fixed in IE7 and the KB article is now marked as "retired content."

4. IPersistFile

My next try was to use IPersistFile, like this:

CComQIPtr<IPersistFile> pPersist(pDoc);
hr = pPersist->Load(L"C:\\abc.html", STGM_READ);
pPersist.Release();

As in #3, IPersistFile::Load() requires a message loop to function properly.

Unfortunately, using this function was not optimal for my situation because I already had the HTML document in memory. I didn't want to write it out to disk again and slow things down.

5. IPersistMoniker

I finally found the correct solution in a post by Craig Monro in microsoft.public.inetsdk.programming.html_objmodel.

The solution is to use IPersistMoniker to feed the stream to MSHTML. By using IPersistMoniker, you avoid the MIME sniffing bug, you don't need a Unicode buffer, and you can use in-memory data.

There is one problem with the solution posted by Mr. Monro. The SetHTML function in his example takes a Unicode string for the HTML data, but this isn't a requirement to use IPersistMoniker with MSHTML. I changed the function to use a byte buffer instead of a Unicode buffer and it worked fine. I also used SHCreateMemStream() to avoid having to make a second copy of the data buffer.

Preventing Execution [Added 12/21/2010]

According to the documentation in Microsoft's WalkAll sample, "If the loaded document is an HTML page which contains scripts, Java Applets and/or ActiveX controls, and those scripts are coded for immediate execution, MSHTML will execute them by default." This is very important to understand if the HTML code you are loading is not trusted because you begin executing the page as soon as it's loaded. This applies to all forms of loading described above except IMarkupServices.

The solution to this problem is not shown in my sample code, but it is shown in the WalkAll sample. Look in the comments at the beginning of the WalkAll sample for "IOleClientSite and IDispatch".

Updating the Document [Added 12/30/2010]

Once you've loaded the HTML document, you often want to update it and save the result. I found it necessary to set designMode to "On" in order to make changes "stick." Otherwise the change would appear to work, but would be discarded when I tried to save the document. You must set designMode to "On" after the document is loaded, because loading a document resets the designMode value.

If you want to save the document, QI for IPersistStreamInit and call Save(). (This doesn't work for documents loaded with IMarkupServices.) However, be aware that MSHTML cannot faithfully recreate the document that you loaded. In other words, if you do a load/save cycle, you can't diff the new file with the original and get a reasonable result. MSHTML normalizes tags, removes newlines, and makes many other changes. There does not appear to be any way to force MSHTML to save a byte-perfect form of the document.

As an alternative to IPersistStreamInit, you can get a pointer to the document element with IHTMLDocument3::get_documentElement(), and then call get_outerHTML(), but this strategy is imperfect for the following reasons:

Any DOCTYPE or xml declaration at the beginning of the document is discarded.
Any attributes on the BODY tag are discarded.
Character encoding is lost because you always end up with wide character Unicode. Even worse, any CHARSET declaration in the HEAD is preserved, so non-English documents can display as garbage.

Performance [Added 12/30/2010]

One of my concerns was the performance that would be offered by the MSHTML control, which is hardly a lightweght control. To better understand the behavior, I ran a series of performance tests on a 3GHz Core i7 processor. What I learned is that the time required to parse HTML is dwarfed by the time required to create an MSHTML object. In my tests, if the MSHTML object is created once and then reused, a thread can load a 2K file about 1000 times in one second. If the MSHTML object is created and destroyed on each iteration, performance drops to 25 loads per second, a 40x performance hit. The lesson is that HSHTML should be created once and reused. The reuse strategy is the fundamental reason IPersistStreamInit exists as a separate interface from IPersistStream.

Conclusion

This is one of the more difficult problems I've worked on lately. Between bad examples, bad documentation, bad 3rd party advice, and a plethora of different strategies, finding this solution was far more difficult than I expected. I hope this article saves some others from the same frustration.

Download the sample code for this article.

Sunday, October 12, 2008

GigE File Sharing Performance - 96MB/sec!

I wrote numerous blog entries last year about my difficulties getting GigE (Gigabit Ethernet) to work properly. Yesterday I upgraded both my Vista client and my Windows 2003 Server to 4GB RAM (see the end of this article for hardware configurations.) Suddenly my resource-constrained systems had lots of room to play. I reran my performance tests over my Gigabit Ethernet and came up with some very unexpected results.

First I used the DOS "copy" command on the Vista client to copy a 256MB file from the server to the client. The file was not cached on the server, and it transferred about 15MB/sec. This was the same performance I was getting before the RAM upgrade.

Next I repeated the same copy. The file was completely cached on the server and the file transferred at about 25MB/sec.

Finally, I again used the DOS "copy" command on the Vista client to copy the file from the client back to the server. The transfer peaked at 96MB/sec!! (A 256MB file copies VERY quickly at that speed.)

On the one hand, this is a contrived test - in the real world we almost never have the luxury of copying a file that's already cached in RAM. However, the tests lead me to several useful conclusions:

The tests show the peak performance of Windows file sharing when you take the disks out of the equation. At 96MB/sec, that's 85% of the practical maximum of 112MB/sec. Considering that the filesharing protocol itself has overhead, we are actually running at somewhere between 90% and 95% of the theoretical max. That's fantastic.
The file sharing protocol's latency is negligible. If it were significant, I wouldn't have reached the above performance numbers. Instead, the transfer rate of the hard drives is the primary performance constraint. Both test systems have SATA 7200RPM drives. I expect that if I had 10,000 RPM RAID 5 drives, my maximum performance when copying non-cached files would improve dramatically.
Something very strange is happening when copying from the server back down to the client. Why this operation is peaking at 25MB/sec is not clear.
Jumbo frames are completely unnecessary for peak performance. They might cut down on the CPU load, but even that's debatable when interrupt coalescing is enabled on the Ethernet card.

One set of datapoints that still need to be measured is to rerun from the command prompt on the server. In prior tests, it mattered which system initiated the file copy.

Finally, I accidentally performed the above tests with Virtual PC 2007 running on the Vista client. Virtual PC cut the transfer rates by 30 to 60%. The final copy peaked at 40MB/sec instead of 96MB/sec. Oddly enough, VMware Server was running on the Windows 2003 Server for all of the tests, and it had two virtual machines active. So while Virtual PC had a significant impact on network performance, it appears that VMware Server had no impact.

Client Configuration
Vista SP1
P5B Deluxe Wifi
Core 2 Duo 2.4 GHz
On-board Marvell Yukon Ethernet Card
  - Interrupt coalescing enabled
  - Jumbo frames disabled
4GB RAM
GigE registry update applied

Windows 2003 x64 Server Configuration
P5B Deluxe Wifi
Core 2 Duo 2.13 GHz
On-board Marvell Yukon Ethernet Card
  - Interrupt coalescing enabled
  - Jumbo frames disabled
4GB RAM
Ethernet adapters on the VMware virtual machines were running in Bridged mode.

Saturday, October 11, 2008

Passive (PASV) ftp in Windows 7

I was controlling a customer's computer and trying to upload a file to our corporate ftp server. I could connect and log in, but any other command would cause the ftp client to hang. I was using the ftp command built in to XP because I didn't have privileges to install a 3rd party ftp client. It was frustrating, but I didn't have time to debug the problem.

Today I had the same problem on a system in the office when trying to connect to a new server. However, this time I was running a different ftp server (vsftpd) that gave me a helpful error message:

200 PORT command successful. Consider using PASV.

I tried the obvious, which was to restart ftp from the command prompt and then type PASV, but this gave the error "Invalid command." I found several posts that said that the ftp client built-in to Windows doesn't support passive mode. The good news is that ~~they are wrong~~ this is fixed in newer versions of Windows.

With Windows 7, enter this command to enter passive mode:

QUOTE PASV

With vsftpd, I was rewarded with this response, and everything started working:

227 Entering Passive Mode

As other have noted in the comments, this does not work on Windows XP.

Technical Blog for Jim Beveridge