Multi-Language Support

In This Topic

Before we can look at Multi language issues we must first understand the way in which characters are encoded. There are various ways characters can be encoded, we will look at the main ones, UNICODE, Codepages and MULTIBYTE encoding.

UNICODE

On the windows platform UNICODE encodes each character using 2 bytes. This gives a possible 0xffff (65535) characters. The Unicode standard dictates the code used to represent each character (see https://www.unicode.org/charts/).
Typically, when a Unicode document is written to file. The file is prefixed with the 2 bytes 0xFF 0xFE. When the file is read these bytes indicate that the file is Unicode, and are ignored.
Windows Operating System uses UNICODE internally to represent characters.

Table 1.1 - Unicode Characters 0x00 - 0xff

(

)

;

[

]

{

}

€

‚

„

†

‡

‰

‹

‘

’

“

”

•

–

—

™

›

Table 1.2 - Unicode characters 0x7f00 - 0x7fff

缀

缁

缂

缃

缄

缅

缆

缇

缈

缉

缊

缋

缌

缍

缎

缏

缐

缑

缒

缓

缔

缕

编

缗

缘

缙

缚

缛

缜

缝

缞

缟

缠

缡

缢

缣

缤

缥

缦

缧

缨

缩

缪

缫

缬

缭

缮

缯

缰

缱

缲

缳

缴

缵

缶

缷

缸

缹

缺

缻

缼

缽

缾

缿

罀

罁

罂

罃

罄

罅

罆

罇

罈

罉

罊

罋

罌

罍

罎

罏

罐

网

罒

罓

罔

罕

罖

罗

罘

罙

罚

罛

罜

罝

罞

罟

罠

罡

罢

罣

罤

罥

罦

罧

罨

罩

罪

罫

罬

罭

置

罯

罰

罱

署

罳

罴

罵

罶

罷

罸

罹

罺

罻

罼

罽

罾

罿

羀

羁

羂

羃

羄

羅

羆

羇

羈

羉

羊

羋

羌

羍

美

羏

羐

羑

羒

羓

羔

羕

羖

羗

羘

羙

羚

羛

羜

羝

羞

羟

羠

羡

羢

羣

群

羥

羦

羧

羨

義

羪

羫

羬

羭

羮

羯

羰

羱

羲

羳

羴

羵

羶

羷

羸

羹

羺

羻

羼

羽

羾

羿

翀

翁

翂

翃

翄

翅

翆

翇

翈

翉

翊

翋

翌

翍

翎

翏

翐

翑

習

翓

翔

翕

翖

翗

翘

翙

翚

翛

翜

翝

翞

翟

翠

翡

翢

翣

翤

翥

翦

翧

翨

翩

翪

翫

翬

翭

翮

翯

翰

翱

翲

翳

翴

翵

翶

翷

翸

翹

翺

翻

翼

翽

翾

翿

Code Pages

On older systems, each character is encoded using a single byte, this provides 0xff (256) possible characters. This system proved limited as there where not enough encodings for all the characters that needed to be represented. The solution was to use a code page. The code page specifies which characters the 256 encodings map to. An application will set its codepage, thus determining which 256 characters it has available to use.
Problems arise with this system when converting data between different code pages. It is normal for many of the characters from the source code page to be missing from the destination code page (and visa versa). This causes data or the interpretation of the data to be corrupted. To avoid this typically a system will try to use the same code page across all the applications.
Table 2.1 & 2.2 each show a sample code pages.

Table 2.1 - Microsoft Windows Codepage : 932 (Japanese Shift-JIS)

Table 2.2 - Microsoft Windows Codepage : 1252 (Latin I)

MULTIBYTE

Now this is where things get interesting. There are a number of Multibyte formats, but we are going to look primarily at UTF-8 (Unicode Transformation Format). Multibyte means that variable number of bytes, are used to encode a single character.

So why would anyone have developed such an encoding?
Well, lets look at an example. You develop a new application that uses Unicode internally. Everything's going fine, until you need to pass some information via an older system (say Email). This old system does not understand Unicode, and when you attempt to pass your Unicode strings to it, it comes across a null in the string (0x00 0x41 is the A character) and stops reading, thinking its hit the end of the string.
Multibyte encoding promises a solution. Before passing your string to the older system, you encode your Unicode string as UTF-8. The legacy system can quite happily use the encoded string, even if it can't represent it all correctly on the screen (extended characters will not appear correctly), but all regular chars (0x00-0x7f) are fine. On the other side the data can be extracted and decoded, back into UNICODE.
The advantage of UTF-8 over other encodings like BASE64, UUE, and HEX, is that the message is still (mostly) viewable, with only the extended characters corrupted.

Table 3.1 - Characters 0x80 - 0xff UTF-8 encodings

Char	Decimal char code	Hex char code	Hex Unicode char code	Decimal Unicode char code	Decimal UTF-8 char code	Hex UTF-8 char code	UTF-8 encoded string view using the Windows code page 1252
€	128	80	00C4	196	195.132	C384	Ã„
	129	81	00C5	197	195.133	C385	Ã…
‚	130	82	00C7	199	195.135	C387	Ã‡
ƒ	131	83	00C9	201	195.137	C389	Ã‰
„	132	84	00D1	209	195.145	C391	Ã‘
…	133	85	00D6	214	195.150	C396	Ã–
†	134	86	00DC	220	195.156	C39C	Ãœ
‡	135	87	00E1	225	195.161	C3A1	Ã¡
ˆ	136	88	00E0	224	195.160	C3A0	Ã
‰	137	89	00E2	226	195.162	C3A2	Ã¢
Š	138	8A	00E4	228	195.164	C3A4	Ã¤
‹	139	8B	00E3	227	195.163	C3A3	Ã£
Œ	140	8C	00E5	229	195.165	C3A5	Ã¥
	141	8D	00E7	231	195.167	C3A7	Ã§
Ž	142	8E	00E9	233	195.169	C3A9	Ã©
	143	8F	00E8	232	195.168	C3A8	Ã¨
	144	90	00EA	234	195.170	C3AA	Ãª
‘	145	91	00EB	235	195.171	C3AB	Ã«
’	146	92	00ED	237	195.173	C3AD	Ã
“	147	93	00EC	236	195.172	C3AC	Ã¬
”	148	94	00EE	238	195.174	C3AE	Ã®
•	149	95	00EF	239	195.175	C3AF	Ã¯
–	150	96	00F1	241	195.177	C3B1	Ã±
—	151	97	00F3	243	195.179	C3B3	Ã³
˜	152	98	00F2	242	195.178	C3B2	Ã²
™	153	99	00F4	244	195.180	C3B4	Ã´
š	154	9A	00F6	246	195.182	C3B6	Ã¶
›	155	9B	00F5	245	195.181	C3B5	Ãµ
œ	156	9C	00FA	250	195.186	C3BA	Ãº
	157	9D	00F9	249	195.185	C3B9	Ã¹
ž	158	9E	00FB	251	195.187	C3BB	Ã»
Ÿ	159	9F	00FC	252	195.188	C3BC	Ã¼
	160	A0	2020	8224	226.128.160	E280A0	â€
¡	161	A1	00B0	176	194.176	C2B0	Â°
¢	162	A2	00A2	162	194.162	C2A2	Â¢
£	163	A3	00A3	163	194.163	C2A3	Â£
¤	164	A4	00A7	167	194.167	C2A7	Â§
¥	165	A5	2022	8226	226.128.162	E280A2	â€¢
¦	166	A6	00B6	182	194.182	C2B6	Â¶
§	167	A7	00DF	223	195.159	C39F	ÃŸ
¨	168	A8	00AE	174	194.174	C2AE	Â®
©	169	A9	00A9	169	194.169	C2A9	Â©
ª	170	AA	2122	8482	226.132.162	E284A2	â„¢
«	171	AB	00B4	180	194.180	C2B4	Â´
¬	172	AC	00A8	168	194.168	C2A8	Â¨
	173	AD	2260	8800	226.137.160	E289A0	â‰
®	174	AE	00C6	198	195.134	C386	Ã†
¯	175	AF	00D8	216	195.152	C398	Ã˜
°	176	B0	221E	8734	226.136.158	E2889E	âˆž
±	177	B1	00B1	177	194.177	C2B1	Â±
²	178	B2	2264	8804	226.137.164	E289A4	â‰¤
³	179	B3	2265	8805	226.137.165	E289A5	â‰¥
´	180	B4	00A5	165	194.165	C2A5	Â¥
µ	181	B5	00B5	181	194.181	C2B5	Âµ
¶	182	B6	2202	8706	226.136.130	E28882	âˆ‚
·	183	B7	2211	8721	226.136.145	E28891	âˆ‘
¸	184	B8	220F	8719	226.136.143	E2888F	âˆ
¹	185	B9	03C0	960	207.128	CF80	Ï€
º	186	BA	222B	8747	226.136.171	E288AB	âˆ«
»	187	BB	00AA	170	194.170	C2AA	Âª
¼	188	BC	00BA	186	194.186	C2BA	Âº
½	189	BD	03A9	937	206.169	CEA9	Î©
¾	190	BE	00E6	230	195.166	C3A6	Ã¦
¿	191	BF	00F8	248	195.184	C3B8	Ã¸
À	192	C0	00BF	191	194.191	C2BF	Â¿
Á	193	C1	00A1	161	194.161	C2A1	Â¡
Â	194	C2	00AC	172	194.172	C2AC	Â¬
Ã	195	C3	221A	8730	226.136.154	E2889A	âˆš
Ä	196	C4	0192	402	198.146	C692	Æ’
Å	197	C5	2248	8776	226.137.136	E28988	â‰ˆ
Æ	198	C6	2206	8710	226.136.134	E28886	âˆ†
Ç	199	C7	00AB	171	194.171	C2AB	Â«
È	200	C8	00BB	187	194.187	C2BB	Â»
É	201	C9	2026	8230	226.128.166	E280A6	â€¦
Ê	202	CA	00A0	160	194.160	C2A0	Â
Ë	203	CB	00C0	192	195.128	C380	Ã€
Ì	204	CC	00C3	195	195.131	C383	Ãƒ
Í	205	CD	00D5	213	195.149	C395	Ã•
Î	206	CE	0152	338	197.146	C592	Å’
Ï	207	CF	0153	339	197.147	C593	Å“
Ð	208	D0	2013	8211	226.128.147	E28093	â€“
Ñ	209	D1	2014	8212	226.128.148	E28094	â€”
Ò	210	D2	201C	8220	226.128.156	E2809C	â€œ
Ó	211	D3	201D	8221	226.128.157	E2809D	â€
Ô	212	D4	2018	8216	226.128.152	E28098	â€˜
Õ	213	D5	2019	8217	226.128.153	E28099	â€™
Ö	214	D6	00F7	247	195.183	C3B7	Ã·
×	215	D7	25CA	9674	226.151.138	E2978A	â—Š
Ø	216	D8	00FF	255	195.191	C3BF	Ã¿
Ù	217	D9	0178	376	197.184	C5B8	Å¸
Ú	218	DA	2044	8260	226.129.132	E28184	â„
Û	219	DB	20AC	8364	226.130.172	E282AC	â‚¬
Ü	220	DC	2039	8249	226.128.185	E280B9	â€¹
Ý	221	DD	203A	8250	226.128.186	E280BA	â€º
Þ	222	DE	FB01	64257	239.172.129	EFAC81	ï¬
ß	223	DF	FB02	64258	239.172.130	EFAC82	ï¬‚
à	224	E0	2021	8225	226.128.161	E280A1	â€¡
á	225	E1	00B7	183	194.183	C2B7	Â·
â	226	E2	201A	8218	226.128.154	E2809A	â€š
ã	227	E3	201E	8222	226.128.158	E2809E	â€ž
ä	228	E4	2030	8240	226.128.176	E280B0	â€°
å	229	E5	00C2	194	195.130	C382	Ã‚
æ	230	E6	00CA	202	195.138	C38A	ÃŠ
ç	231	E7	00C1	193	195.129	C381	Ã
è	232	E8	00CB	203	195.139	C38B	Ã‹
é	233	E9	00C8	200	195.136	C388	Ãˆ
ê	234	EA	00CD	205	195.141	C38D	Ã
ë	235	EB	00CE	206	195.142	C38E	ÃŽ
ì	236	EC	00CF	207	195.143	C38F	Ã
í	237	ED	00CC	204	195.140	C38C	ÃŒ
î	238	EE	00D3	211	195.147	C393	Ã“
ï	239	EF	00D4	212	195.148	C394	Ã”
ð	240	F0	F8FF	63743	239.163.191	EFA3BF	ï£¿
ñ	241	F1	00D2	210	195.146	C392	Ã’
ò	242	F2	00DA	218	195.154	C39A	Ãš
ó	243	F3	00DB	219	195.155	C39B	Ã›
ô	244	F4	00D9	217	195.153	C399	Ã™
õ	245	F5	0131	305	196.177	C4B1	Ä±
ö	246	F6	02C6	710	203.134	CB86	Ë†
÷	247	F7	02DC	732	203.156	CB9C	Ëœ
ø	248	F8	00AF	175	194.175	C2AF	Â¯
ù	249	F9	02D8	728	203.152	CB98	Ë˜
ú	250	FA	02D9	729	203.153	CB99	Ë™
û	251	FB	02DA	730	203.154	CB9A	Ëš
ü	252	FC	00B8	184	194.184	C2B8	Â¸
ý	253	FD	02DD	733	203.157	CB9D	Ë
þ	254	FE	02DB	731	203.155	CB9B	Ë›
ÿ	255	FF	02C7	711	203.135	CB87	Ë‡

XML

Xml documents can be encoded using a number of different encodings. The type of encoding is indicated using the encoding tag in the document header (i.e. <?xml version="1.0" encoding="UTF-8"?>).

Writing an XML document to file

When an XML document is persisted as a file, it is safer to consider it in terms as of a stream of bytes as apposed to stream of characters. When an XML document is serialized to a file, an encoding is applied to it. The resulting file will then be correctly encoded given the encoding applied.

If a Unicode encoding is applied, the resulting file is prefixed with the Unicode header 0xFF 0xFE, and will be encoded with 2 bytes per character.
If a UTF-8 encoding is applied the resulting file will contain a variable number of bytes per character. If this file is then viewed using a tool incapable of decoding UTF-8, then you may see it contains a number of strange characters. If the file is viewed using an UTF-8 compliant application (e.g. IExplorer, Notepad on Win2000 onwards, Visual Studio .Net) then the XML Document will appear with the correct characters (if characters are corrupted or misrepresented, it should be noted that some fonts do not contain the full UNICODE set)

Turning an XML document a string

When an XML document is created from a generated class using ToXml (ToXml returns a string). The string returned is encoded as Unicode (except in C++ non-debug builds), however the XML document header does not show any encoding (<?xml version="1.0"?>).

The string returned is Unicode, Unicode is the internal character representation for VB6, .Net & Java, as such if it is written to file or passed to another application, it should be passed as Unicode. If it has to be converted to a 1 byte per character representation prior to this, then data will likely be corrupted if complex characters have been used within the document.

If you need to persist an XML document to a file use ToXmlFile, if you need pass an XML document to another (non-Unicode) application, then should use ToXmlStream.

There is also a problem that commonly occurs in C++ UNICODE applications when dealing with UTF-8 encoded data. If you load a UFT-8 encoded file into a UNICODE application, the temptation is to store it in a UNICODE string (WCHAR*), and the conversion to unicode is often implicit (part of some string/bstr class). However these conversions typically assume the source string is in the local code page, which is rarely UTF-8, and more frequently ANSI. So when the data is converted to UNICODE, the conversion function does not treat the data as UTF-8, and so does not correctly decode it. This results in a UNICODE string which no longer represents the source.
In these circumstances, it is better to either treat the data as binary or to use the appropriate conversion method - utf8 to unicode.

Passing an XML document to a ASCII or ANSI application.

It is common to want to pass the XML document you have created to a non-Unicode application. If you need to do this then you may look first at ToXml, this will provide you with a UNICODE string, however converting this to an ASCII or ANSI string may cause the corruption of complex characters (you loose information going from 2 bytes to 1 byte per character). You could take the string returned from ToXml, and apply your own UTF-8 encoding, however the encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>) would not be present, and the XML parser decoding the document may misinterpret it.
The better solution is to use the ToXmlStream method. This allows you to specify an encoding, and returns a stream of bytes (array of bytes in VB). This byte stream is a representation of the XML Document in the given encoding, containing the correct encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>).

Descrption	Value
Article Created	6/2/2006

Send Feedback