Byte Order Marker

XML > Byte Order Marker

The Byte Order Marker (BOM) is a series of byte values placed on the beginning of an encoded text stream (or file). This data allows the reader to correctly decide which character encoding to use when decoding the stream back into a sequence of characters. The use of byte order markers within files and streams is not specific to XML, but its typical to see them in use when XML data is streamed or stored in files.

The first thing to appreciate is there are a number of ways in which a sequence of characters can be represented in a binary form. The process of turning characters into binary data is termed character encoding.

The following table is a list of the standard BOM's in use for common encodings.

Encoding Representation (hexadecimal) Representation (decimal) Representation (ISO-8859-1)
UTF-8[t 1] EF BB BF 239 187 191 
UTF-16 (BE) FE FF 254 255 þÿ
UTF-16 (LE) FF FE 255 254 ÿþ
UTF-32 (BE) 00 00 FE FF 0 0 254 255 □□þÿ (□ represents the ASCII null character)
UTF-32 (LE) FF FE 00 00 255 254 0 0 ÿþ□□ (□ represents the ASCII null character)
UTF-7[t 1] 2B 2F 76 38
2B 2F 76 39
2B 2F 76 2B
2B 2F 76 2F
[t 2]
2B 2F 76 38 2D[t 3]
43 47 118 56
43 47 118 57
43 47 118 43
43 47 118 47
43 47 118 56 45
+/v8
+/v9
+/v+
+/v/
+/v8-
UTF-1[t 1] F7 64 4C 247 100 76 ÷dL
UTF-EBCDIC[t 1] DD 73 66 73 221 115 102 115 Ýsfs
SCSU[t 1] 0E FE FF[t 4] 14 254 255 □þÿ (□ represents the ASCII "shift out" character)
BOCU-1[t 1] FB EE 28 251 238 40 ûî(
GB-18030[t 1] 84 31 95 33 132 49 149 51 □1■3 (□ and ■ represent unmapped ISO-8859-1 characters)

Example

As you can see the first 3 byte of the file are 0xEF 0xBB, 0xBF, looking this up in our table we can see that the encoding is UTF-8.

The rest of the file can therefore be read and decoded using this encoding.

Using the UTF-8 character encoding the last 4 bytes in the file represent the character value 8482 which is the character '™'.

The result string is therefore

XML the smart way™

Common Problems

Most platforms and languages have built into support for decoding character streams, so most of the time you are unaware that they are even in use.

Correctly Reading using .Net Copy Code
using (Stream s = new FileStream("XmlTheSmartWay.txt"))
{
    using (TextReader tr = new StreamReader(s))
    {
        string txt = tr.ReadToEnd();
    }
}

The above sample will read and decode the values correctly, however if support is missing or bypassed then problems can occur. In this example the reader is assuming 1 byte per char, so if the file contains utf-8 data any values above 127 will be incorrectly interpreted.

Incorrectly Reading using .Net Copy Code
StringBuilder sb = new StringBuilder();

using (Stream s = new FileStream("XmlTheSmartWay.txt"))
{
    while ((int value = s.ReadByte()) != -1)
    {
        sb.Append((char)value);
    }
}
string txt = sb.ToString();

See Also

Try Liquid XML Free and see how we can help you today Free Trial