Tuesday, November 13, 2007

Corrupt \Boot\BCD in Windows Vista

Two nights ago, I started installing a video driver upgrade and went home for the night. The next morning I came in to discover that something had failed in the upgrade and the system was hung on a black screen. Even Remote Desktop access wasn't working.

I hit the Reset button. The system started to reboot. As per usual when I use the Reset button, the RAID array was incensed that I had interrupted its normal operation and took its revenge by taking four hours to validate the mirror. 89% of the way through validating the mirror, I received my first Vista Blue Screen of Death (BSOD). Something about a power driver not handling an event.

No problem. I hit the Reset button again. What's another four hours between friends?

Except this time, the system didn't reboot. It went through the BIOS check, started to boot Vista, and presented me with this error:

File: \Boot\BCD
Status: 0xc0000034
Info: The Windows Boot Configuration Data file is missing required information

The Microsoft Knowledgebase has a helpful article on this subject, which should have worked, but didn't. The error messages during the failed repair didn't clarify the cause of the problem.

I tried booting Vista from DVD and running recovery. Rebuilding the BCD file is a standard operation, except that it didn't work. The Recovery Manager said that it couldn't save the file. Curiouser and curiouser.

I restored the BCD file from backup (Acronis True Image pays for itself... again.)

I rebooted the system and got the exact same error. I went back in with the Recovery Manager and discovered that the file \Boot\BCD was gone, even though I had just restored it.

The final solution was to use Acronis to restore track zero and the Master Boot Record (MBR) from my last image backup. There never was a problem with the BCD file, that was just a red herring. I'm not sure how you'd solve this problem if you didn't have an image backup. It just goes to show the usefulness of image-level backups.

Update (11-15-2007): I discovered that the Vista repair utility created a file named C:\Temp\SrtTrail.txt. This file contained the list of tests that were performed. The last test showed this:

Root cause found: 
---------------------------
No OS files found on disk.

Repair action: Partition table repair
Result: Failed. Error code = 0x3bc3
Time taken = 154722 ms

If I'd seen this file earlier, I would have had a clear indicator to repair track zero instead of stumbling on the solution accidentally.

Legacy Code from ... CP/M?!?

Remember CP/M? Unless you are over 40, my guess is probably not. This week I ran into a bug in our software that traces its roots all the way back to CP/M in the 1970s. Over thirty years later, the code still exists in the Visual Studio 2005 C Runtime Library, waiting for the next innocent victim to run afoul of it.

Here's what happened. We had reverse engineered a data file format and had shipped the first pass to customers. Our only sample of the data file was quite small, about 10 records, but it was enough to determine the file format and make it work. The ten records in the file converted cleanly.

After the product shipped, several customers reported that only 26 records were being loaded. This was obviously incorrect as most of their files had hundreds of records. Our end-of-file handling code was common with numerous other modules and worked fine. The 26 records that did convert did so correctly.

Even when we instrumented our software to give more insight into what was happening at customer sites, we found nothing. Our software converted 26 records, detected end-of-file, and exited. WTF?

When we finally found a customer willing to share his data file, we ran our software in the debugger and got exactly the same results. 26 records were converted and the software exited cleanly, with no errors.

The lightbulb didn't go off over my head until I was looking at the data file with the cygwin "cat -v" command, which shows ASCII codes 0 through 31 as control characters. This particular data file had two bytes for each record ID and record numbering started at 0. The 27th record contained ID 26 (0x1a) which showed up as ^Z. Does that ring any bells? If you ever developed for CP/M (or for MS-DOS 1.0) it should.

Thirty years ago, CP/M only tracked the number of blocks in each file, not the number of bytes. By convention, Ctrl-Z was used to denote "end of file" for text files. MS-DOS 1.0, which bore a striking resemblance to CP/M internally, followed this same convention. At the time, the C Runtime Library understood this convention and automatically generated an "end of file" condition when ^Z was encountered. Today, the VS2005 C Runtime Library still contains that code and generates the end-of-file condition even if the exact length of the file extends beyond that point.

The bug in our software was that the file had been opened in text mode instead of binary mode. Normally this is easy to detect because the records become out of sync as they are read, but by some coincidence of the data layout, the data in this file was read in perfectly up until that ^Z.

So now I have one more reason to dislike CP/M, although (admittedly) in this day and age it seems somewhat pointless to carry a grudge against a dead operating system that was designed to run off of 8" floppy disks. Old habits are hard to break.