In most modern programming languages you don't typically need to be to concerned with the details of XML Encoding as they are taken care of automatically by the framework you are working on (.Net/Java etc). However occasionally you run into problems and then its a good idea to have a familiarity with the concepts in order to solve the problem.
Character data is encoded in a number of different ways by different systems, in order to standardize this the UNICODE Consortium was formed. UNICODE provides a standard mapping that says char value 50 is a '2', char value 169 is a '©' etc.
There are many 1000's of these mappings, they cover the standard Latin alphabet, but also cover many others including Greek, Arabic, Chinese.
There are fundamentally 2 ways in which character data is encoded in order to store it on disk or send it over a network.
This approach says n bytes will be used to encode a single char. This makes it easy for computer programs to work with strings as they are easy to manipulate, but it places a limit on the number of possible characters that can be encoded.
Bytes Per Character | Number of possible Characters |
1 | 0xFF, (255) |
2 | 0xFFFF (65535) |
4 | 0xFFFFFFFF (4294967295) |
The value 0 has a special meaning to mark the end of a string, and is not used to encode a char
1 byte per character was what all applications used to use (and some still are), its clearly not enough so inventive ways were devised to work around this.
2 bytes per character was considered enough i.e. UCS-2 (2-byte Universal Character Set), this was the size of a character within the Windows operating system (future editions moved to UTF-16), however the UNICODE group have exceeded this defining codes larger than 0xFFFF.
4 bytes per character is future proof, the size is used within Linux, and is unlikely to ever be exceed.
When fixed length data is read, a raw character value is read/written (be it 1, 2 or 4 bytes long), the value is mapped to/from its Unicode value.
No mapping needs to be done (as 4 bytes covers all the possible Unicode values).
Typically no mapping is done, the values 0xFFFF and below are read as Unicode values, and its not possible to encode values larger than 0xFFFF.
A mapping is done from the byte value (0-255) and translated through a mapping table (referred to as a code page). This works fine when decoding (reading) as the mapping table contains entries for all 255 possible values, but this still limits the number of possible characters to 255.
When encoding (writing) Unicode values are looked up within the codepage mapping and the resulting value 0-255 is written. If a Unicode value is used that is not defined in the code page, no mapping is possible and a default value is written (typically a ?).
The key point is that by picking a code page you can deicide which 255 characters from the UNICODE set will be used.
It also means that when reading in a document you need to use the code page it was encoded with or the mappings are wrong and the resulting data is garbled.
Code Page 874 (Thai) | Code Page 1256 (Arabic) |
In most code pages the first 127 chars are common (known as the ASCII set), the values 128-255 are defined by the codepage. The tables above show the 256 mapping values for 2 code pages, note the second half each table is quite different.
Multibyte encodings were a solution that allowed applications written to work with 1 byte per character access to the full range of Unicode characters.
The idea is that simple or common characters like the 0-127 values still use 1 byte per character, if you want to use a more exotic char then you write a control value (say 0xC4) which tells the interpreter that another byte is required to fully describe this character. The interpreter then reads the next character (lets say a 0x86), after some complex bit twiddling it comes up with a UNICODE value 0x0106 'Ć'. Other start values can lead to a single character being encoded in 1, 2, 3 or 4 bytes, allowing the whole UNICODE range of characters to be encoded.
This approach is not without its drawbacks as the application can no longer assume that a string 10 characters long takes up 10 bytes (this can lead to some rather nasty bugs). Applications now typically store there data internally as Unicode (as mentioned above either using 2 or 4 bytes per character), and this approach to manipulating data within an application has largely been dropped.
This encoding approach which once looked like a stop gap solution, has been re-purposed as the internet grew. Many of the internet protocols are based on an 7 or 8 bit per character, text based communication channels (HTTP, SMTP, NNTP etc), this left no opportunity for using the full UNICODE range, so Multibyte encoding schemes (notably utf-8) have been piggybacked over the top of them in order to expand the range of characters that can be represented.
There are many Multibyte encoding schemes the most common one being UTF-8. The UTF-8 Wikipedia entry describes the nuts and bolts of performing the encoding in all its gory detail.
When the XML parser reads an XML file or stream its reading binary data, bytes not characters. The first thing it needs to figure out is how may bytes make up a character. If the data is encoded using 2 or 4 bytes pre character, then the stream/file must start with what is referred to as a Byte Order Marker (BOM), if UTF-8 is used the BOM is optional. This potentially tells the reader how to decode the rest of the file/stream.
Encoding | Representation (hexadecimal) | |
---|---|---|
UTF-8 | EF BB BF |
|
UTF-16 (BE) | FE FF |
|
UTF-16 (LE) | FF FE |
|
UTF-32 (BE) | 00 00 FE FF |
|
UTF-32 (LE) | FF FE 00 00 |
By default readers assume a document is encoded using a local code page (one specific to the geographic region), the presence of a BOM at the beginning of the file/stream will change the encoder. However if there is no BOM it will continue reading using the default code page. This is the reason the XML Declaration must be the first thing in the file.
<?xml version="1.0" encoding="UTF-8" ?>
The xml declaration can contain the encoding used to encode the rest of the file.
After the XML Reader has read the XML Declaration, it will change its encoder to the one specified, thus allowing it to correctly interpret the rest of the data in the file.
As you can see the file starts with 0xFF 0xFE, this indicates its a UTF-16 file (Little Endian). So the rest of the file can be decoded accordingly .
The XML Declaration attribute encoding="utf-16" is technically redundant as this data is included in the BOM.
As you can see the file starts with 0xEF 0xBB 0xBF, this indicates its a UTF-8 file. So the rest of the file can be decoded accordingly.
If the file contains a UTF-8 BOM then the XML Declaration attribute encoding="utf-8" is technically redundant. However the BOM is not always present, in these cases the encoding attribute is required.
As you can see the file contains no BOM. So the start of the file is decoded using the default code page.
When the XML Declaration attribute encoding="windows-1252" is read the code page mapping 1252 is loaded and used to decode the rest of the document.
When you have an XML data contained within a string, when the string is serialized care must be taken to encode the data in a way that matches the encoding attribute.
string txt = "<?xml version='1.0' encoding='utf-8'?><MyElm>Complex Chars </MyElm>"; File.WriteAllText("myfilename.xml", txt, Encoding.Unicode);
The following would result in the file being written as UNICODE, but the data indicating its encoded using UTF-8.
UTF-16 is actually a variable length encoding (in exactly the same way UTF-8 is). Most programming libraries will include the ability to decode this stream correctly (i.e. java, .net). However its not uncommon to see it treated as a fixed 2 bytes per character encoding (which is technically wrong).