The Byte Order Marker (BOM) is a series of byte values placed on the beginning of an encoded text stream (or file). This data allows the reader to correctly decide which character encoding to use when decoding the stream back into a sequence of characters. The use of byte order markers within files and streams is not specific to XML, but its typical to see them in use when XML data is streamed or stored in files.
The first thing to appreciate is there are a number of ways in which a sequence of characters can be represented in a binary form. The process of turning characters into binary data is termed character encoding.
The following table is a list of the standard BOM's in use for common encodings.
Encoding | Representation (hexadecimal) | Representation (decimal) | Representation (ISO-8859-1) |
---|---|---|---|
UTF-8[t 1] | EF BB BF |
239 187 191 |
 |
UTF-16 (BE) | FE FF |
254 255 |
þÿ |
UTF-16 (LE) | FF FE |
255 254 |
ÿþ |
UTF-32 (BE) | 00 00 FE FF |
0 0 254 255 |
□□þÿ (□ represents the ASCII null character) |
UTF-32 (LE) | FF FE 00 00 |
255 254 0 0 |
ÿþ□□ (□ represents the ASCII null character) |
UTF-7[t 1] | 2B 2F 76 38 [t 2]2B 2F 76 38 2D [t 3] |
43 47 118 56 |
+/v8 |
UTF-1[t 1] | F7 64 4C |
247 100 76 |
÷dL |
UTF-EBCDIC[t 1] | DD 73 66 73 |
221 115 102 115 |
Ýsfs |
SCSU[t 1] | 0E FE FF [t 4] |
14 254 255 |
□þÿ (□ represents the ASCII "shift out" character) |
BOCU-1[t 1] | FB EE 28 |
251 238 40 |
ûî( |
GB-18030[t 1] | 84 31 95 33 |
132 49 149 51 |
□1■3 (□ and ■ represent unmapped ISO-8859-1 characters) |
As you can see the first 3 byte of the file are 0xEF 0xBB, 0xBF, looking this up in our table we can see that the encoding is UTF-8.
The rest of the file can therefore be read and decoded using this encoding.
Using the UTF-8 character encoding the last 4 bytes in the file represent the character value 8482 which is the character '™'.
The result string is therefore
XML the smart way™
Most platforms and languages have built into support for decoding character streams, so most of the time you are unaware that they are even in use.
Correctly Reading using .Net |
Copy Code
|
---|---|
using (Stream s = new FileStream("XmlTheSmartWay.txt")) { using (TextReader tr = new StreamReader(s)) { string txt = tr.ReadToEnd(); } } |
The above sample will read and decode the values correctly, however if support is missing or bypassed then problems can occur. In this example the reader is assuming 1 byte per char, so if the file contains utf-8 data any values above 127 will be incorrectly interpreted.
Incorrectly Reading using .Net |
Copy Code
|
---|---|
StringBuilder sb = new StringBuilder(); using (Stream s = new FileStream("XmlTheSmartWay.txt")) { while ((int value = s.ReadByte()) != -1) { sb.Append((char)value); } } string txt = sb.ToString(); |