Technical Blog for Jim Beveridge: Unicode BOM Handling in the C Run-time Library

Monday, January 12, 2009

Unicode BOM Handling in the C Run-time Library

Visual Studio 2005 and later include Unicode BOM (Byte-Order Mark) support, but I found the documentation somewhat lacking. Here are a few hints.

One of the primary points of confusion for me was what you were defining by setting the encoding. The answer is that you are providing a hint as to the encoding of the file (but only a hint. If the file has a BOM, then that BOM takes precedence.)

All calls you make to read or write the file must be with Unicode APIs. If you try to use an ANSI API, the C-Runtime library (CRT) will assert. This means that the CRT will do character set conversion between Unicode and the file's encoding, but won't do character set conversion between the local code page and the file's encoding. For example, you'll get an assertion if you open the file with ccs=utf-8 and then try to use fgets to read the data.

Other points:

The CRT will not perform any BOM handling if you do not specify a ccs= encoding. This means that backwards compatibility is retained because the CRT does not perform any behind-the-scenes processing if you don't ask it to.
Most BOM formats are not supported. For example, UTF-7, UTF-32 and especially UCS-16 big-endian are not handled.
If you specify a specify a ccs= encoding, then the BOM will be automatically removed from the data stream. However, you need to be careful of file positioning calls such as fseek and rewind because the bom will only be skipped when the file is first opened. For example, if you do fopen, fread, rewind, fread, then the second fread will read the BOM and the first fread will not.
The file encoding is respected when writing, so the number of characters actually written may be lesser or greater than the buffer size you wrote.
If you open the file in binary mode, then any ccs= specification will be ignored and no BOM handling will be performed.
Apparently the CRT does not provide a documented way to determine the encoding of the file.

Technical Blog for Jim Beveridge

Monday, January 12, 2009

Unicode BOM Handling in the C Run-time Library

No comments:

Post a Comment

Blog Archive

Labels