In This Topic
Before we can look at Multi language issues we must first understand the way in which characters are encoded. There are various ways characters can be encoded, we will look at the main ones, UNICODE, Codepages and MULTIBYTE encoding.
UNICODE
On the windows platform UNICODE encodes each character using 2 bytes. This gives a possible 0xffff (65535) characters. The Unicode standard dictates the code used to represent each character (see https://www.unicode.org/charts/).
Typically, when a Unicode document is written to file. The file is prefixed with the 2 bytes 0xFF 0xFE. When the file is read these bytes indicate that the file is Unicode, and are ignored.
Windows Operating System uses UNICODE internally to represent characters.
Table 1.1 - Unicode Characters 0x00 - 0xff |
|
00 |
|
01 |
. |
02 |
. |
03 |
. |
04 |
. |
05 |
. |
06 |
. |
07 |
. |
08 |
. |
09 |
|
0A |
|
0B |
|
0C |
|
0D |
|
0E |
. |
0F |
. |
10 |
. |
11 |
. |
12 |
. |
13 |
. |
14 |
. |
15 |
. |
16 |
. |
17 |
. |
18 |
. |
19 |
. |
1A |
. |
1B |
. |
1C |
. |
1D |
. |
1E |
. |
1F |
. |
20 |
|
21 |
! |
22 |
" |
23 |
# |
24 |
$ |
25 |
% |
26 |
& |
27 |
' |
28 |
( |
29 |
) |
2A |
* |
2B |
+ |
2C |
, |
2D |
- |
2E |
. |
2F |
/ |
30 |
0 |
31 |
1 |
32 |
2 |
33 |
3 |
34 |
4 |
35 |
5 |
36 |
6 |
37 |
7 |
38 |
8 |
39 |
9 |
3A |
: |
3B |
; |
3C |
< |
3D |
= |
3E |
> |
3F |
? |
40 |
@ |
41 |
A |
42 |
B |
43 |
C |
44 |
D |
45 |
E |
46 |
F |
47 |
G |
48 |
H |
49 |
I |
4A |
J |
4B |
K |
4C |
L |
4D |
M |
4E |
N |
4F |
O |
50 |
P |
51 |
Q |
52 |
R |
53 |
S |
54 |
T |
55 |
U |
56 |
V |
57 |
W |
58 |
X |
59 |
Y |
5A |
Z |
5B |
[ |
5C |
\ |
5D |
] |
5E |
^ |
5F |
_ |
60 |
` |
61 |
a |
62 |
b |
63 |
c |
64 |
d |
65 |
e |
66 |
f |
67 |
g |
68 |
h |
69 |
i |
6A |
j |
6B |
k |
6C |
l |
6D |
m |
6E |
n |
6F |
o |
70 |
p |
71 |
q |
72 |
r |
73 |
s |
74 |
t |
75 |
u |
76 |
v |
77 |
w |
78 |
x |
79 |
y |
7A |
z |
7B |
{ |
7C |
| |
7D |
} |
7E |
~ |
7F |
|
80 |
€ |
81 |
|
82 |
‚ |
83 |
ƒ |
84 |
„ |
85 |
|
86 |
† |
87 |
‡ |
88 |
ˆ |
89 |
‰ |
8A |
Š |
8B |
‹ |
8C |
Œ |
8D |
|
8E |
Ž |
8F |
|
90 |
|
91 |
‘ |
92 |
’ |
93 |
“ |
94 |
” |
95 |
• |
96 |
– |
97 |
— |
98 |
˜ |
99 |
™ |
9A |
š |
9B |
› |
9C |
œ |
9D |
|
9E |
ž |
9F |
Ÿ |
A0 |
|
A1 |
¡ |
A2 |
¢ |
A3 |
£ |
A4 |
¤ |
A5 |
¥ |
A6 |
¦ |
A7 |
§ |
A8 |
¨ |
A9 |
© |
AA |
ª |
AB |
« |
AC |
¬ |
AD |
|
AE |
® |
AF |
¯ |
B0 |
° |
B1 |
± |
B2 |
² |
B3 |
³ |
B4 |
´ |
B5 |
µ |
B6 |
¶ |
B7 |
· |
B8 |
¸ |
B9 |
¹ |
BA |
º |
BB |
» |
BC |
¼ |
BD |
½ |
BE |
¾ |
BF |
¿ |
C0 |
À |
C1 |
Á |
C2 |
 |
C3 |
à |
C4 |
Ä |
C5 |
Å |
C6 |
Æ |
C7 |
Ç |
C8 |
È |
C9 |
É |
CA |
Ê |
CB |
Ë |
CC |
Ì |
CD |
Í |
CE |
Î |
CF |
Ï |
D0 |
Ð |
D1 |
Ñ |
D2 |
Ò |
D3 |
Ó |
D4 |
Ô |
D5 |
Õ |
D6 |
Ö |
D7 |
× |
D8 |
Ø |
D9 |
Ù |
DA |
Ú |
DB |
Û |
DC |
Ü |
DD |
Ý |
DE |
Þ |
DF |
ß |
E0 |
à |
E1 |
á |
E2 |
â |
E3 |
ã |
E4 |
ä |
E5 |
å |
E6 |
æ |
E7 |
ç |
E8 |
è |
E9 |
é |
EA |
ê |
EB |
ë |
EC |
ì |
ED |
í |
EE |
î |
EF |
ï |
F0 |
ð |
F1 |
ñ |
F2 |
ò |
F3 |
ó |
F4 |
ô |
F5 |
õ |
F6 |
ö |
F7 |
÷ |
F8 |
ø |
F9 |
ù |
FA |
ú |
FB |
û |
FC |
ü |
FD |
ý |
FE |
þ |
FF |
ÿ |
|
Table 1.2 - Unicode characters 0x7f00 - 0x7fff |
|
00 |
缀 |
01 |
缁 |
02 |
缂 |
03 |
缃 |
04 |
缄 |
05 |
缅 |
06 |
缆 |
07 |
缇 |
08 |
缈 |
09 |
缉 |
0A |
缊 |
0B |
缋 |
0C |
缌 |
0D |
缍 |
0E |
缎 |
0F |
缏 |
10 |
缐 |
11 |
缑 |
12 |
缒 |
13 |
缓 |
14 |
缔 |
15 |
缕 |
16 |
编 |
17 |
缗 |
18 |
缘 |
19 |
缙 |
1A |
缚 |
1B |
缛 |
1C |
缜 |
1D |
缝 |
1E |
缞 |
1F |
缟 |
20 |
缠 |
21 |
缡 |
22 |
缢 |
23 |
缣 |
24 |
缤 |
25 |
缥 |
26 |
缦 |
27 |
缧 |
28 |
缨 |
29 |
缩 |
2A |
缪 |
2B |
缫 |
2C |
缬 |
2D |
缭 |
2E |
缮 |
2F |
缯 |
30 |
缰 |
31 |
缱 |
32 |
缲 |
33 |
缳 |
34 |
缴 |
35 |
缵 |
36 |
缶 |
37 |
缷 |
38 |
缸 |
39 |
缹 |
3A |
缺 |
3B |
缻 |
3C |
缼 |
3D |
缽 |
3E |
缾 |
3F |
缿 |
40 |
罀 |
41 |
罁 |
42 |
罂 |
43 |
罃 |
44 |
罄 |
45 |
罅 |
46 |
罆 |
47 |
罇 |
48 |
罈 |
49 |
罉 |
4A |
罊 |
4B |
罋 |
4C |
罌 |
4D |
罍 |
4E |
罎 |
4F |
罏 |
50 |
罐 |
51 |
网 |
52 |
罒 |
53 |
罓 |
54 |
罔 |
55 |
罕 |
56 |
罖 |
57 |
罗 |
58 |
罘 |
59 |
罙 |
5A |
罚 |
5B |
罛 |
5C |
罜 |
5D |
罝 |
5E |
罞 |
5F |
罟 |
60 |
罠 |
61 |
罡 |
62 |
罢 |
63 |
罣 |
64 |
罤 |
65 |
罥 |
66 |
罦 |
67 |
罧 |
68 |
罨 |
69 |
罩 |
6A |
罪 |
6B |
罫 |
6C |
罬 |
6D |
罭 |
6E |
置 |
6F |
罯 |
70 |
罰 |
71 |
罱 |
72 |
署 |
73 |
罳 |
74 |
罴 |
75 |
罵 |
76 |
罶 |
77 |
罷 |
78 |
罸 |
79 |
罹 |
7A |
罺 |
7B |
罻 |
7C |
罼 |
7D |
罽 |
7E |
罾 |
7F |
罿 |
80 |
羀 |
81 |
羁 |
82 |
羂 |
83 |
羃 |
84 |
羄 |
85 |
羅 |
86 |
羆 |
87 |
羇 |
88 |
羈 |
89 |
羉 |
8A |
羊 |
8B |
羋 |
8C |
羌 |
8D |
羍 |
8E |
美 |
8F |
羏 |
90 |
羐 |
91 |
羑 |
92 |
羒 |
93 |
羓 |
94 |
羔 |
95 |
羕 |
96 |
羖 |
97 |
羗 |
98 |
羘 |
99 |
羙 |
9A |
羚 |
9B |
羛 |
9C |
羜 |
9D |
羝 |
9E |
羞 |
9F |
羟 |
A0 |
羠 |
A1 |
羡 |
A2 |
羢 |
A3 |
羣 |
A4 |
群 |
A5 |
羥 |
A6 |
羦 |
A7 |
羧 |
A8 |
羨 |
A9 |
義 |
AA |
羪 |
AB |
羫 |
AC |
羬 |
AD |
羭 |
AE |
羮 |
AF |
羯 |
B0 |
羰 |
B1 |
羱 |
B2 |
羲 |
B3 |
羳 |
B4 |
羴 |
B5 |
羵 |
B6 |
羶 |
B7 |
羷 |
B8 |
羸 |
B9 |
羹 |
BA |
羺 |
BB |
羻 |
BC |
羼 |
BD |
羽 |
BE |
羾 |
BF |
羿 |
C0 |
翀 |
C1 |
翁 |
C2 |
翂 |
C3 |
翃 |
C4 |
翄 |
C5 |
翅 |
C6 |
翆 |
C7 |
翇 |
C8 |
翈 |
C9 |
翉 |
CA |
翊 |
CB |
翋 |
CC |
翌 |
CD |
翍 |
CE |
翎 |
CF |
翏 |
D0 |
翐 |
D1 |
翑 |
D2 |
習 |
D3 |
翓 |
D4 |
翔 |
D5 |
翕 |
D6 |
翖 |
D7 |
翗 |
D8 |
翘 |
D9 |
翙 |
DA |
翚 |
DB |
翛 |
DC |
翜 |
DD |
翝 |
DE |
翞 |
DF |
翟 |
E0 |
翠 |
E1 |
翡 |
E2 |
翢 |
E3 |
翣 |
E4 |
翤 |
E5 |
翥 |
E6 |
翦 |
E7 |
翧 |
E8 |
翨 |
E9 |
翩 |
EA |
翪 |
EB |
翫 |
EC |
翬 |
ED |
翭 |
EE |
翮 |
EF |
翯 |
F0 |
翰 |
F1 |
翱 |
F2 |
翲 |
F3 |
翳 |
F4 |
翴 |
F5 |
翵 |
F6 |
翶 |
F7 |
翷 |
F8 |
翸 |
F9 |
翹 |
FA |
翺 |
FB |
翻 |
FC |
翼 |
FD |
翽 |
FE |
翾 |
FF |
翿 |
|
Code Pages
On older systems, each character is encoded using a single byte, this provides 0xff (256) possible characters. This system proved limited as there where not enough encodings for all the characters that needed to be represented. The solution was to use a code page. The code page specifies which characters the 256 encodings map to. An application will set its codepage, thus determining which 256 characters it has available to use.
Problems arise with this system when converting data between different code pages. It is normal for many of the characters from the source code page to be missing from the destination code page (and visa versa). This causes data or the interpretation of the data to be corrupted. To avoid this typically a system will try to use the same code page across all the applications.
Table 2.1 & 2.2 each show a sample code pages.
Table 2.1 - Microsoft Windows Codepage : 932 (Japanese Shift-JIS) |
|
|
Table 2.2 - Microsoft Windows Codepage : 1252 (Latin I) |
|
|
MULTIBYTE
Now this is where things get interesting. There are a number of Multibyte formats, but we are going to look primarily at UTF-8 (Unicode Transformation Format). Multibyte means that variable number of bytes, are used to encode a single character.
So why would anyone have developed such an encoding?
Well, lets look at an example. You develop a new application that uses Unicode internally. Everything's going fine, until you need to pass some information via an older system (say Email). This old system does not understand Unicode, and when you attempt to pass your Unicode strings to it, it comes across a null in the string (0x00 0x41 is the A character) and stops reading, thinking its hit the end of the string.
Multibyte encoding promises a solution. Before passing your string to the older system, you encode your Unicode string as UTF-8. The legacy system can quite happily use the encoded string, even if it can't represent it all correctly on the screen (extended characters will not appear correctly), but all regular chars (0x00-0x7f) are fine. On the other side the data can be extracted and decoded, back into UNICODE.
The advantage of UTF-8 over other encodings like BASE64, UUE, and HEX, is that the message is still (mostly) viewable, with only the extended characters corrupted.
Table 3.1 - Characters 0x80 - 0xff UTF-8 encodings |
|
Char |
Decimal char code |
Hex char code |
Hex Unicode char code |
Decimal Unicode char code |
Decimal UTF-8 char code |
Hex UTF-8 char code |
UTF-8 encoded string view using the Windows code page 1252 |
€ |
128 |
80 |
00C4 |
196 |
195.132 |
C384 |
Ä |
|
129 |
81 |
00C5 |
197 |
195.133 |
C385 |
Ã… |
‚ |
130 |
82 |
00C7 |
199 |
195.135 |
C387 |
Ç |
ƒ |
131 |
83 |
00C9 |
201 |
195.137 |
C389 |
É |
„ |
132 |
84 |
00D1 |
209 |
195.145 |
C391 |
Ñ |
… |
133 |
85 |
00D6 |
214 |
195.150 |
C396 |
Ö |
† |
134 |
86 |
00DC |
220 |
195.156 |
C39C |
Ü |
‡ |
135 |
87 |
00E1 |
225 |
195.161 |
C3A1 |
á |
ˆ |
136 |
88 |
00E0 |
224 |
195.160 |
C3A0 |
à |
‰ |
137 |
89 |
00E2 |
226 |
195.162 |
C3A2 |
â |
Š |
138 |
8A |
00E4 |
228 |
195.164 |
C3A4 |
ä |
‹ |
139 |
8B |
00E3 |
227 |
195.163 |
C3A3 |
ã |
Œ |
140 |
8C |
00E5 |
229 |
195.165 |
C3A5 |
Ã¥ |
|
141 |
8D |
00E7 |
231 |
195.167 |
C3A7 |
ç |
Ž |
142 |
8E |
00E9 |
233 |
195.169 |
C3A9 |
é |
|
143 |
8F |
00E8 |
232 |
195.168 |
C3A8 |
è |
|
144 |
90 |
00EA |
234 |
195.170 |
C3AA |
ê |
‘ |
145 |
91 |
00EB |
235 |
195.171 |
C3AB |
ë |
’ |
146 |
92 |
00ED |
237 |
195.173 |
C3AD |
à |
“ |
147 |
93 |
00EC |
236 |
195.172 |
C3AC |
ì |
” |
148 |
94 |
00EE |
238 |
195.174 |
C3AE |
î |
• |
149 |
95 |
00EF |
239 |
195.175 |
C3AF |
ï |
– |
150 |
96 |
00F1 |
241 |
195.177 |
C3B1 |
ñ |
— |
151 |
97 |
00F3 |
243 |
195.179 |
C3B3 |
ó |
˜ |
152 |
98 |
00F2 |
242 |
195.178 |
C3B2 |
ò |
™ |
153 |
99 |
00F4 |
244 |
195.180 |
C3B4 |
ô |
š |
154 |
9A |
00F6 |
246 |
195.182 |
C3B6 |
ö |
› |
155 |
9B |
00F5 |
245 |
195.181 |
C3B5 |
õ |
œ |
156 |
9C |
00FA |
250 |
195.186 |
C3BA |
ú |
|
157 |
9D |
00F9 |
249 |
195.185 |
C3B9 |
ù |
ž |
158 |
9E |
00FB |
251 |
195.187 |
C3BB |
û |
Ÿ |
159 |
9F |
00FC |
252 |
195.188 |
C3BC |
ü |
|
160 |
A0 |
2020 |
8224 |
226.128.160 |
E280A0 |
†|
¡ |
161 |
A1 |
00B0 |
176 |
194.176 |
C2B0 |
° |
¢ |
162 |
A2 |
00A2 |
162 |
194.162 |
C2A2 |
¢ |
£ |
163 |
A3 |
00A3 |
163 |
194.163 |
C2A3 |
£ |
¤ |
164 |
A4 |
00A7 |
167 |
194.167 |
C2A7 |
§ |
¥ |
165 |
A5 |
2022 |
8226 |
226.128.162 |
E280A2 |
• |
¦ |
166 |
A6 |
00B6 |
182 |
194.182 |
C2B6 |
¶ |
§ |
167 |
A7 |
00DF |
223 |
195.159 |
C39F |
ß |
¨ |
168 |
A8 |
00AE |
174 |
194.174 |
C2AE |
® |
© |
169 |
A9 |
00A9 |
169 |
194.169 |
C2A9 |
© |
ª |
170 |
AA |
2122 |
8482 |
226.132.162 |
E284A2 |
â„¢ |
« |
171 |
AB |
00B4 |
180 |
194.180 |
C2B4 |
´ |
¬ |
172 |
AC |
00A8 |
168 |
194.168 |
C2A8 |
¨ |
|
173 |
AD |
2260 |
8800 |
226.137.160 |
E289A0 |
≠|
® |
174 |
AE |
00C6 |
198 |
195.134 |
C386 |
Æ |
¯ |
175 |
AF |
00D8 |
216 |
195.152 |
C398 |
Ø |
° |
176 |
B0 |
221E |
8734 |
226.136.158 |
E2889E |
∞ |
± |
177 |
B1 |
00B1 |
177 |
194.177 |
C2B1 |
± |
² |
178 |
B2 |
2264 |
8804 |
226.137.164 |
E289A4 |
≤ |
³ |
179 |
B3 |
2265 |
8805 |
226.137.165 |
E289A5 |
≥ |
´ |
180 |
B4 |
00A5 |
165 |
194.165 |
C2A5 |
Â¥ |
µ |
181 |
B5 |
00B5 |
181 |
194.181 |
C2B5 |
µ |
¶ |
182 |
B6 |
2202 |
8706 |
226.136.130 |
E28882 |
∂ |
· |
183 |
B7 |
2211 |
8721 |
226.136.145 |
E28891 |
∑ |
¸ |
184 |
B8 |
220F |
8719 |
226.136.143 |
E2888F |
∠|
¹ |
185 |
B9 |
03C0 |
960 |
207.128 |
CF80 |
Ï€ |
º |
186 |
BA |
222B |
8747 |
226.136.171 |
E288AB |
∫ |
» |
187 |
BB |
00AA |
170 |
194.170 |
C2AA |
ª |
¼ |
188 |
BC |
00BA |
186 |
194.186 |
C2BA |
º |
½ |
189 |
BD |
03A9 |
937 |
206.169 |
CEA9 |
Ω |
¾ |
190 |
BE |
00E6 |
230 |
195.166 |
C3A6 |
æ |
¿ |
191 |
BF |
00F8 |
248 |
195.184 |
C3B8 |
ø |
À |
192 |
C0 |
00BF |
191 |
194.191 |
C2BF |
¿ |
Á |
193 |
C1 |
00A1 |
161 |
194.161 |
C2A1 |
¡ |
 |
194 |
C2 |
00AC |
172 |
194.172 |
C2AC |
¬ |
à |
195 |
C3 |
221A |
8730 |
226.136.154 |
E2889A |
√ |
Ä |
196 |
C4 |
0192 |
402 |
198.146 |
C692 |
Æ’ |
Å |
197 |
C5 |
2248 |
8776 |
226.137.136 |
E28988 |
≈ |
Æ |
198 |
C6 |
2206 |
8710 |
226.136.134 |
E28886 |
∆ |
Ç |
199 |
C7 |
00AB |
171 |
194.171 |
C2AB |
« |
È |
200 |
C8 |
00BB |
187 |
194.187 |
C2BB |
» |
É |
201 |
C9 |
2026 |
8230 |
226.128.166 |
E280A6 |
… |
Ê |
202 |
CA |
00A0 |
160 |
194.160 |
C2A0 |
 |
Ë |
203 |
CB |
00C0 |
192 |
195.128 |
C380 |
À |
Ì |
204 |
CC |
00C3 |
195 |
195.131 |
C383 |
à |
Í |
205 |
CD |
00D5 |
213 |
195.149 |
C395 |
Õ |
Î |
206 |
CE |
0152 |
338 |
197.146 |
C592 |
Å’ |
Ï |
207 |
CF |
0153 |
339 |
197.147 |
C593 |
Å“ |
Ð |
208 |
D0 |
2013 |
8211 |
226.128.147 |
E28093 |
– |
Ñ |
209 |
D1 |
2014 |
8212 |
226.128.148 |
E28094 |
— |
Ò |
210 |
D2 |
201C |
8220 |
226.128.156 |
E2809C |
“ |
Ó |
211 |
D3 |
201D |
8221 |
226.128.157 |
E2809D |
†|
Ô |
212 |
D4 |
2018 |
8216 |
226.128.152 |
E28098 |
‘ |
Õ |
213 |
D5 |
2019 |
8217 |
226.128.153 |
E28099 |
’ |
Ö |
214 |
D6 |
00F7 |
247 |
195.183 |
C3B7 |
÷ |
× |
215 |
D7 |
25CA |
9674 |
226.151.138 |
E2978A |
â—Š |
Ø |
216 |
D8 |
00FF |
255 |
195.191 |
C3BF |
ÿ |
Ù |
217 |
D9 |
0178 |
376 |
197.184 |
C5B8 |
Ÿ |
Ú |
218 |
DA |
2044 |
8260 |
226.129.132 |
E28184 |
â„ |
Û |
219 |
DB |
20AC |
8364 |
226.130.172 |
E282AC |
€ |
Ü |
220 |
DC |
2039 |
8249 |
226.128.185 |
E280B9 |
‹ |
Ý |
221 |
DD |
203A |
8250 |
226.128.186 |
E280BA |
› |
Þ |
222 |
DE |
FB01 |
64257 |
239.172.129 |
EFAC81 |
ï¬ |
ß |
223 |
DF |
FB02 |
64258 |
239.172.130 |
EFAC82 |
fl |
à |
224 |
E0 |
2021 |
8225 |
226.128.161 |
E280A1 |
‡ |
á |
225 |
E1 |
00B7 |
183 |
194.183 |
C2B7 |
· |
â |
226 |
E2 |
201A |
8218 |
226.128.154 |
E2809A |
‚ |
ã |
227 |
E3 |
201E |
8222 |
226.128.158 |
E2809E |
„ |
ä |
228 |
E4 |
2030 |
8240 |
226.128.176 |
E280B0 |
‰ |
å |
229 |
E5 |
00C2 |
194 |
195.130 |
C382 |
 |
æ |
230 |
E6 |
00CA |
202 |
195.138 |
C38A |
Ê |
ç |
231 |
E7 |
00C1 |
193 |
195.129 |
C381 |
à |
è |
232 |
E8 |
00CB |
203 |
195.139 |
C38B |
Ë |
é |
233 |
E9 |
00C8 |
200 |
195.136 |
C388 |
È |
ê |
234 |
EA |
00CD |
205 |
195.141 |
C38D |
à |
ë |
235 |
EB |
00CE |
206 |
195.142 |
C38E |
ÃŽ |
ì |
236 |
EC |
00CF |
207 |
195.143 |
C38F |
à |
í |
237 |
ED |
00CC |
204 |
195.140 |
C38C |
Ì |
î |
238 |
EE |
00D3 |
211 |
195.147 |
C393 |
Ó |
ï |
239 |
EF |
00D4 |
212 |
195.148 |
C394 |
Ô |
ð |
240 |
F0 |
F8FF |
63743 |
239.163.191 |
EFA3BF |
 |
ñ |
241 |
F1 |
00D2 |
210 |
195.146 |
C392 |
Ã’ |
ò |
242 |
F2 |
00DA |
218 |
195.154 |
C39A |
Ú |
ó |
243 |
F3 |
00DB |
219 |
195.155 |
C39B |
Û |
ô |
244 |
F4 |
00D9 |
217 |
195.153 |
C399 |
Ù |
õ |
245 |
F5 |
0131 |
305 |
196.177 |
C4B1 |
ı |
ö |
246 |
F6 |
02C6 |
710 |
203.134 |
CB86 |
ˆ |
÷ |
247 |
F7 |
02DC |
732 |
203.156 |
CB9C |
˜ |
ø |
248 |
F8 |
00AF |
175 |
194.175 |
C2AF |
¯ |
ù |
249 |
F9 |
02D8 |
728 |
203.152 |
CB98 |
˘ |
ú |
250 |
FA |
02D9 |
729 |
203.153 |
CB99 |
Ë™ |
û |
251 |
FB |
02DA |
730 |
203.154 |
CB9A |
Ëš |
ü |
252 |
FC |
00B8 |
184 |
194.184 |
C2B8 |
¸ |
ý |
253 |
FD |
02DD |
733 |
203.157 |
CB9D |
Ë |
þ |
254 |
FE |
02DB |
731 |
203.155 |
CB9B |
Ë› |
ÿ |
255 |
FF |
02C7 |
711 |
203.135 |
CB87 |
ˇ |
|
XML
Xml documents can be encoded using a number of different encodings. The type of encoding is indicated using the encoding tag in the document header (i.e. <?xml version="1.0" encoding="UTF-8"?>).
Writing an XML document to file
When an XML document is persisted as a file, it is safer to consider it in terms as of a stream of bytes as apposed to stream of characters. When an XML document is serialized to a file, an encoding is applied to it. The resulting file will then be correctly encoded given the encoding applied.
- If a Unicode encoding is applied, the resulting file is prefixed with the Unicode header 0xFF 0xFE, and will be encoded with 2 bytes per character.
- If a UTF-8 encoding is applied the resulting file will contain a variable number of bytes per character. If this file is then viewed using a tool incapable of decoding UTF-8, then you may see it contains a number of strange characters. If the file is viewed using an UTF-8 compliant application (e.g. IExplorer, Notepad on Win2000 onwards, Visual Studio .Net) then the XML Document will appear with the correct characters (if characters are corrupted or misrepresented, it should be noted that some fonts do not contain the full UNICODE set)
Turning an XML document a string
When an XML document is created from a generated class using ToXml (ToXml returns a string). The string returned is encoded as Unicode (except in C++ non-debug builds), however the XML document header does not show any encoding (<?xml version="1.0"?>).
The string returned is Unicode, Unicode is the internal character representation for VB6, .Net & Java, as such if it is written to file or passed to another application, it should be passed as Unicode. If it has to be converted to a 1 byte per character representation prior to this, then data will likely be corrupted if complex characters have been used within the document.
If you need to persist an XML document to a file use ToXmlFile, if you need pass an XML document to another (non-Unicode) application, then should use ToXmlStream.
There is also a problem that commonly occurs in C++ UNICODE applications when dealing with UTF-8 encoded data. If you load a UFT-8 encoded file into a UNICODE application, the temptation is to store it in a UNICODE string (WCHAR*), and the conversion to unicode is often implicit (part of some string/bstr class). However these conversions typically assume the source string is in the local code page, which is rarely UTF-8, and more frequently ANSI. So when the data is converted to UNICODE, the conversion function does not treat the data as UTF-8, and so does not correctly decode it. This results in a UNICODE string which no longer represents the source.
In these circumstances, it is better to either treat the data as binary or to use the appropriate conversion method - utf8 to unicode.
Passing an XML document to a ASCII or ANSI application.
It is common to want to pass the XML document you have created to a non-Unicode application. If you need to do this then you may look first at ToXml, this will provide you with a UNICODE string, however converting this to an ASCII or ANSI string may cause the corruption of complex characters (you loose information going from 2 bytes to 1 byte per character). You could take the string returned from ToXml, and apply your own UTF-8 encoding, however the encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>) would not be present, and the XML parser decoding the document may misinterpret it.
The better solution is to use the ToXmlStream method. This allows you to specify an encoding, and returns a stream of bytes (array of bytes in VB). This byte stream is a representation of the XML Document in the given encoding, containing the correct encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>).
Descrption |
Value |
Article Created |
6/2/2006 |