Character Encoding

There are many ways in which characters are encoded into the data that is ultimately stored on disk or in memory, however they roughly break down into 3 groups.

Unicode

Unicode is a standard that gives each character a number, there are a lot of charaters when you start to think of all the symbols, hebrew, chinese, arabic....characer sets, so we need more space to store the our character code. On windows each character is stored as 2 bytes, this allows 65536 different characters to be described. So if you have a document that is 1000 chars long, it takes up 2000 bytes.

Because each character is the same size we can easily navigate these kinds of documents and display the full range of characters (Font permitting).

Single Byte Encoding

These use 1 byte to store a single character, because of this they are limited to describing only 256 possible charaters. There are a number of different standards (sometimes called codepages) which define different mappings. These typically agree on the definitions for the first 128 characters, but differ on there mapping for the remaining 128. Examples include ASCII, Windows-1251, ISO-8859 etc.

Because each character is 1 byte we can easily navigate these kinds of documents, however because there is no standard way to define the encoding that is used, the encoding used is the one defined in the Options. This means that unless you change the encoding any byte values >127 may not be represented as they were intended (Notepad and another editors have the same problem).

Multibyte Encoding

In this encoding the range of character values is taken from the unicode standard, however they are stored in such a way as to minimize the resulting size and try to provide backward compatibility.

A typical example of a multi byte encoding system is UTF-8. Under this encoding values 0-127 are stored within a single byte, so the resulting output would be indistinguishable from a Single byte encoding. However when more exotic chars (ie with values >127) are represented using more than one byte. In the worst case up to 4 bytes are used to encode a single char. This means that the space required for a document could be between 1 and 4 times the number of charaters.

The following Table shows how char values are encoded into byte sequences.

Code point	Binary code point	UTF-8 bytes	Example
`U+0000` to `U+007F`	`0xxxxxxx`	`0xxxxxxx`	'$' `U+0024` = `00100100` → `00100100` → `0x24`
`U+0080` to `U+07FF`	`00000yyy yyxxxxxx`	`110yyyyy 10xxxxxx`	'¢' `U+00A2` = `00000000 10100010` → `11000010 10100010` → `0xC2 0xA2`
`U+0800` to `U+FFFF`	`zzzzyyyy yyxxxxxx`	`1110zzzz 10yyyyyy 10xxxxxx`	'€' `U+20AC` = `00100000 10101100` → `11100010 10000010 10101100` → `0xE2 0x82 0xAC`
`U+010000` to `U+10FFFF`	`000wwwzz zzzzyyyy yyxxxxxx`	`11110www 10zzzzzz 10yyyyyy 10xxxxxx`	'𤭢' `U+024B62` = `00000010 01001011 01100010` → `11110000 10100100 10101101 10100010` → `0xF0 0xA4 0xAD 0xA2`

This type of encoding is mainly used when data is persisted to disk (or transferred over a network).

This style of data presents problems for the Large File Editor. Most editors work by pre-parsing the data, and storing it internally as Unicode. When we are dealing with TB files this is not possible, and as you scan through a file the editor needs to jump into the file at random intervals. If you start reading the file at some random point its easy to jump into the middle of a character. It also makes the assessment of position in the file tricky, if the first half of the file is encoded using 4 bytes per char and the last half using 1 byte per char, then when you think your roughly in the middle based on the position in the file, your only 1/5 of the way down in terms of chars.

Because of this the Large File Editor currently treats UTF-8 as a single byte encoding, this means that sometime you will see 2-4 chars representing a single char, these will typically be represented with odd chars or '?'.

They are however still written out correctly if the file is saved.