Multi-Language Support

Before we can look at Multi language issues we must first understand the way in which characters are encoded. There are various ways characters can be encoded, we will look at the main ones, UNICODE, Codepages & MULTIBYTE encoding.

UNICODE

On the windows platform UNICODE encodes each character using 2 bytes. This gives a possible 0xffff (65535) characters. The Unicode standard dictates the code used to represent each character (see http://www.unicode.org/charts/).
Typically, when a unicode document is written to file. The file is prefixed with the 2 bytes 0xFF 0xFE. When the file is read these bytes indicate that the file is unicode, and are ignored.
Windows NT, 2000 & XP all use UNICODE internally to represent characters.

  Table 1.1 - Unicode Characters 0x00 - 0xff
00 01 . 02 . 03 . 04 . 05 . 06 . 07 . 08 . 09 0A 0B 0C 0D 0E . 0F .
10 . 11 . 12 . 13 . 14 . 15 . 16 . 17 . 18 . 19 . 1A . 1B . 1C . 1D . 1E . 1F .
20 21 ! 22 " 23 # 24 $ 25 % 26 & 27 ' 28 ( 29 ) 2A * 2B + 2C , 2D - 2E . 2F /
30 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7 38 8 39 9 3A : 3B ; 3C < 3D = 3E > 3F ?
40 @ 41 A 42 B 43 C 44 D 45 E 46 F 47 G 48 H 49 I 4A J 4B K 4C L 4D M 4E N 4F O
50 P 51 Q 52 R 53 S 54 T 55 U 56 V 57 W 58 X 59 Y 5A Z 5B [ 5C \ 5D ] 5E ^ 5F _
60 ` 61 a 62 b 63 c 64 d 65 e 66 f 67 g 68 h 69 i 6A j 6B k 6C l 6D m 6E n 6F o
70 p 71 q 72 r 73 s 74 t 75 u 76 v 77 w 78 x 79 y 7A z 7B { 7C | 7D } 7E ~ 7F 
80 € 81  82 ‚ 83 ƒ 84 „ 85 86 † 87 ‡ 88 ˆ 89 ‰ 8A Š 8B ‹ 8C Œ 8D  8E Ž 8F 
90  91 ‘ 92 ’ 93 “ 94 ” 95 • 96 – 97 — 98 ˜ 99 ™ 9A š 9B › 9C œ 9D  9E ž 9F Ÿ
A0   A1 ¡ A2 ¢ A3 £ A4 ¤ A5 ¥ A6 ¦ A7 § A8 ¨ A9 © AA ª AB « AC ¬ AD ­ AE ® AF ¯
B0 ° B1 ± B2 ² B3 ³ B4 ´ B5 µ B6 B7 · B8 ¸ B9 ¹ BA º BB » BC ¼ BD ½ BE ¾ BF ¿
C0 À C1 Á C2 Â C3 Ã C4 Ä C5 Å C6 Æ C7 Ç C8 È C9 É CA Ê CB Ë CC Ì CD Í CE Î CF Ï
D0 Ð D1 Ñ D2 Ò D3 Ó D4 Ô D5 Õ D6 Ö D7 × D8 Ø D9 Ù DA Ú DB Û DC Ü DD Ý DE Þ DF ß
E0 à E1 á E2 â E3 ã E4 ä E5 å E6 æ E7 ç E8 è E9 é EA ê EB ë EC ì ED í EE î EF ï
F0 ð F1 ñ F2 ò F3 ó F4 ô F5 õ F6 ö F7 ÷ F8 ø F9 ù FA ú FB û FC ü FD ý FE þ FF ÿ
 Table 1.2 - Unicode characters 0x7f00 - 0x7fff
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 缿
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F 罿
80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF 羿
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF
E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF 翿

Code Pages

On older systems, each character is encoded using a single byte, this provides 0xff (256) possible characters. This system proved limited as there where not enough encodings for all the characters that needed to be represented. The solution was to use a code page. The code page specifies which characters the 256 encodings map to. An application will set its codepage, thus determining which 256 characters it has available to use.
Problems arise with this system when converting data between different code pages. It is normal for many of the characters from the source code page to be missing from the destination code page (and visa versa). This causes data or the interpretation of the data to be corrupted. To avoid this typically a system will try to use the same code page across all the applications.
Table 2.1 & 2.2 each show a sample code page, for a full list see http://www.microsoft.com/globaldev/reference/cphome.mspx

  Table 2.1 - Microsoft Windows Codepage : 932 (Japanese Shift-JIS)
  Table 2.2 - Microsoft Windows Codepage : 1252 (Latin I)

MULTIBYTE

Now this is where things get interesting. There are a number of Multibyte formats, but we are going to look primarily at UTF-8 (Unicode Transformation Format). Multibyte means that variable number of bytes, are used to encode a single character.

So why would anyone have developed such an encoding?
Well, lets look at an example. You develop a new application that uses Unicode internally. Everything's going fine, until you need to pass some information via an older system (say Email). This old system does not understand Unicode, and when you attempt to pass your Unicode strings to it, it comes across a null in the string (0x00 0x41 is the A character) and stops reading, thinking its hit the end of the string.
Multibyte encoding promises a solution. Before passing your string to the older system, you encode your Unicode string as UTF-8. The legacy system can quite happily use the encoded string, even if it can't represent it all correctly on the screen (extended characters will not appear correctly), but all regular chars (0x00-0x7f) are fine. On the other side the data can be extracted and decoded, back into UNICODE.
The advantage of UTF-8 over other encodings like BASE64, UUE, and HEX, is that the message is still (mostly) viewable, with only the extended characters corrupted.

  Table 3.1 - Characters 0x80 - 0xff UTF-8 encodings
Char Decimal char code Hex char code  Hex Unicode char code Decimal Unicode char code Decimal UTF-8 char code   Hex UTF-8 char code  UTF-8 encoded string view using the Windows code page 1252
128 80 00C4 196 195.132 C384 Ä
 129 81 00C5 197 195.133 C385 Ã…
130 82 00C7 199 195.135 C387 Ç
ƒ 131 83 00C9 201 195.137 C389 É
132 84 00D1 209 195.145 C391 Ñ
133 85 00D6 214 195.150 C396 Ö
134 86 00DC 220 195.156 C39C Ü
135 87 00E1 225 195.161 C3A1 á
ˆ 136 88 00E0 224 195.160 C3A0 à
137 89 00E2 226 195.162 C3A2 â
Š 138 8A 00E4 228 195.164 C3A4 ä
139 8B 00E3 227 195.163 C3A3 ã
Π140 8C 00E5 229 195.165 C3A5 ̴
 141 8D 00E7 231 195.167 C3A7 ç
Ž 142 8E 00E9 233 195.169 C3A9 é
 143 8F 00E8 232 195.168 C3A8 è
 144 90 00EA 234 195.170 C3AA ê
145 91 00EB 235 195.171 C3AB ë
146 92 00ED 237 195.173 C3AD í
147 93 00EC 236 195.172 C3AC ì
148 94 00EE 238 195.174 C3AE î
149 95 00EF 239 195.175 C3AF ï
150 96 00F1 241 195.177 C3B1 ñ
151 97 00F3 243 195.179 C3B3 ó
˜ 152 98 00F2 242 195.178 C3B2 ò
153 99 00F4 244 195.180 C3B4 ô
š 154 9A 00F6 246 195.182 C3B6 ö
155 9B 00F5 245 195.181 C3B5 õ
œ 156 9C 00FA 250 195.186 C3BA ú
 157 9D 00F9 249 195.185 C3B9 ù
ž 158 9E 00FB 251 195.187 C3BB û
Ÿ 159 9F 00FC 252 195.188 C3BC ü
160 A0 2020 8224 226.128.160 E280A0 †
¡ 161 A1 00B0 176 194.176 C2B0 °
¢ 162 A2 00A2 162 194.162 C2A2 ¢
£ 163 A3 00A3 163 194.163 C2A3 £
¤ 164 A4 00A7 167 194.167 C2A7 §
¥ 165 A5 2022 8226 226.128.162 E280A2 •
¦ 166 A6 00B6 182 194.182 C2B6 ¶
§ 167 A7 00DF 223 195.159 C39F ß
¨ 168 A8 00AE 174 194.174 C2AE ®
© 169 A9 00A9 169 194.169 C2A9 ©
ª 170 AA 2122 8482 226.132.162 E284A2 â„¢
« 171 AB 00B4 180 194.180 C2B4 ´
¬ 172 AC 00A8 168 194.168 C2A8 ¨
­ 173 AD 2260 8800 226.137.160 E289A0 ≠
® 174 AE 00C6 198 195.134 C386 Æ
¯ 175 AF 00D8 216 195.152 C398 Ø
° 176 B0 221E 8734 226.136.158 E2889E ∞
± 177 B1 00B1 177 194.177 C2B1 ±
² 178 B2 2264 8804 226.137.164 E289A4 ≤
³ 179 B3 2265 8805 226.137.165 E289A5 ≥
´ 180 B4 00A5 165 194.165 C2A5 Â¥
µ 181 B5 00B5 181 194.181 C2B5 µ
182 B6 2202 8706 226.136.130 E28882 ∂
· 183 B7 2211 8721 226.136.145 E28891 ∑
¸ 184 B8 220F 8719 226.136.143 E2888F ∏
¹ 185 B9 03C0 960 207.128 CF80 Ï€
º 186 BA 222B 8747 226.136.171 E288AB ∫
» 187 BB 00AA 170 194.170 C2AA ª
¼ 188 BC 00BA 186 194.186 C2BA º
½ 189 BD 03A9 937 206.169 CEA9 Ω
¾ 190 BE 00E6 230 195.166 C3A6 æ
¿ 191 BF 00F8 248 195.184 C3B8 ø
À 192 C0 00BF 191 194.191 C2BF ¿
Á 193 C1 00A1 161 194.161 C2A1 ¡
 194 C2 00AC 172 194.172 C2AC ¬
à 195 C3 221A 8730 226.136.154 E2889A √
Ä 196 C4 0192 402 198.146 C692 Æ’
Š197 C5 2248 8776 226.137.136 E28988 ≈
Æ 198 C6 2206 8710 226.136.134 E28886 ∆
Ç 199 C7 00AB 171 194.171 C2AB «
È 200 C8 00BB 187 194.187 C2BB »
É 201 C9 2026 8230 226.128.166 E280A6 …
Ê 202 CA 00A0 160 194.160 C2A0  
Ë 203 CB 00C0 192 195.128 C380 À
Ì 204 CC 00C3 195 195.131 C383 Ã
Í 205 CD 00D5 213 195.149 C395 Õ
Î 206 CE 0152 338 197.146 C592 Å’
Ï 207 CF 0153 339 197.147 C593 Å“
Р208 D0 2013 8211 226.128.147 E28093 –
Ñ 209 D1 2014 8212 226.128.148 E28094 —
Ò 210 D2 201C 8220 226.128.156 E2809C “
Ó 211 D3 201D 8221 226.128.157 E2809D ”
Ô 212 D4 2018 8216 226.128.152 E28098 ‘
Õ 213 D5 2019 8217 226.128.153 E28099 ’
Ö 214 D6 00F7 247 195.183 C3B7 ÷
× 215 D7 25CA 9674 226.151.138 E2978A â—Š
Ø 216 D8 00FF 255 195.191 C3BF ÿ
٠217 D9 0178 376 197.184 C5B8 Ÿ
Ú 218 DA 2044 8260 226.129.132 E28184 ⁄
Û 219 DB 20AC 8364 226.130.172 E282AC €
Ü 220 DC 2039 8249 226.128.185 E280B9 ‹
Ý 221 DD 203A 8250 226.128.186 E280BA ›
Þ 222 DE FB01 64257 239.172.129 EFAC81 fi
ß 223 DF FB02 64258 239.172.130 EFAC82 fl
à 224 E0 2021 8225 226.128.161 E280A1 ‡
á 225 E1 00B7 183 194.183 C2B7 ·
â 226 E2 201A 8218 226.128.154 E2809A ‚
ã 227 E3 201E 8222 226.128.158 E2809E „
ä 228 E4 2030 8240 226.128.176 E280B0 ‰
å 229 E5 00C2 194 195.130 C382 Â
æ 230 E6 00CA 202 195.138 C38A Ê
ç 231 E7 00C1 193 195.129 C381 Á
è 232 E8 00CB 203 195.139 C38B Ë
é 233 E9 00C8 200 195.136 C388 È
ê 234 EA 00CD 205 195.141 C38D Í
ë 235 EB 00CE 206 195.142 C38E ÃŽ
ì 236 EC 00CF 207 195.143 C38F Ï
í 237 ED 00CC 204 195.140 C38C ÃŒ
î 238 EE 00D3 211 195.147 C393 Ó
ï 239 EF 00D4 212 195.148 C394 Ô
ð 240 F0 F8FF 63743 239.163.191 EFA3BF 
ñ 241 F1 00D2 210 195.146 C392 Ã’
ò 242 F2 00DA 218 195.154 C39A Ú
ó 243 F3 00DB 219 195.155 C39B Û
ô 244 F4 00D9 217 195.153 C399 Ù
õ 245 F5 0131 305 196.177 C4B1 ı
ö 246 F6 02C6 710 203.134 CB86 ˆ
÷ 247 F7 02DC 732 203.156 CB9C Ëœ
ø 248 F8 00AF 175 194.175 C2AF ¯
ù 249 F9 02D8 728 203.152 CB98 ˘
ú 250 FA 02D9 729 203.153 CB99 Ë™
û 251 FB 02DA 730 203.154 CB9A Ëš
ü 252 FC 00B8 184 194.184 C2B8 ¸
ý 253 FD 02DD 733 203.157 CB9D ˝
þ 254 FE 02DB 731 203.155 CB9B Ë›
ÿ 255 FF 02C7 711 203.135 CB87 ˇ

XML

Xml documents can be encoded using a number of different encodings. The type of encoding is indicated using the encoding tag in the document header (i.e. <?xml version="1.0" encoding="UTF-8"?>).

Writing an XML document to file

When an XML document is persisted as a file, it is safer to consider it in terms as of a stream of bytes as apposed to stream of characters. When an XML document is serialized to a file, an encoding is applied to it. The resulting file will then be correctly encoded given the encoding applied.

Turning an XML document a string

When an XML document is created from a generated class using ToXml (ToXml returns a string). The string returned is encoded as Unicode (except in C++ non-debug builds), however the XML document header does not show any encoding (<?xml version="1.0"?>).

The string returned is Unicode, Unicode is the internal character representation for VB6, .Net & Java, as such if it is written to file or passed to another application, it should be passed as Unicode. If it has to be converted to a 1 byte per character representation prior to this, then data will likely be corrupted if complex characters have been used within the document.

If you need to persist an XML document to a file use ToXmlFile, if you need pass an XML document to another (non-Unicode) application, then should use ToXmlStream.

There is also a problem that commonly occurs in C++ UNICODE applications when dealing with UTF-8 encoded data. If you load a UFT-8 encoded file into a UNICODE application, the temptation is to store it in a UNICODE string (WCHAR*), and the conversion to unicode is often implicit (part of some string/bstr class). However these conversions typically assume the source string is in the local code page, which is rarely UTF-8, and more frequently ANSI. So when the data is converted to UNICODE, the conversion function does not treat the data as UTF-8, and so does not correctly decode it. This results in a UNICODE string which no longer represents the source.
In these circumstances, it is better to either treat the data as binary or to use the appropriate conversion method - utf8 to unicode.

Passing an XML document to a ASCII or ANSI application.

It is common to want to pass the XML document you have created to a non-Unicode application. If you need to do this then you may look first at ToXml, this will provide you with a UNICODE string, however converting this to an ASCII or ANSI string may cause the corruption of complex characters (you loose information going from 2 bytes to 1 byte per character). You could take the string returned from ToXml, and apply your own UTF-8 encoding, however the encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>) would not be present, and the XML parser decoding the document may misinterpret it.
The better solution is to use the ToXmlStream method. This allows you to specify an encoding, and returns a stream of bytes (array of bytes in VB). This byte stream is a representation of the XML Document in the given encoding, containing the correct encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>).

Reference

UTF-8 RFC

 

Descrption Value
Article Created 6/2/2006