Character Set
Introduction[edit]
Character sets are the set of code that represents each supported character. This code is named code point.
There is two kinds of character sets:
- fixed length character sets where each character is represented by the same number of bytes, for instance 1 for West European WE8MSWIN1252 character set and 2 for Unicode AL16UTF16 one.
- variable length character sets where each character is represented by a variable number of bytes, for instance 1 to 3 for old Unicode UTF8 or 1 to 4 for new Unicode AL32UTF8.
Unicode character sets[edit]
Unicode is an ISO norm that gives a value to each character. From this several character sets have been defined among them the most known are UCS2, AL16UTF16, UTF8 and AL32UTF8.
UCS2 and AL16UTF16 are fixed length characters set which coded the characters on 2 bytes. The difference between the two character sets is that UCS2 does not take care of platform endianess whereas AL16UTF16 does. That means that in a dump of a string the character bytes are swapped between a big endian and a little endian platform in UCS2 whereas there are in the same order with AL16UTF16.
To support more than than the 65536 characters allowed by 2 bytes, these character sets have been extended to 2 groups of 2 bytes.
AL32UTF8 is an extension of UTF8 to support more character families and extend to not speaking language ones.
The following table gives the matching values for the code points:
Unicode | UCS2 / AL16UTF16 | (AL32)UTF8 | |
---|---|---|---|
code | representation | ||
U+0000 – U+007F | 00000000 0xxxxxxx | 00000000 0xxxxxxx | 0xxxxxxx |
U+0080 – U+07FF | 00000yyy yyxxxxxx | 00000yyy yyxxxxxx | 110yyyyy 10xxxxxx |
U+0800 – U+FFFF (*) | zzzzyyyy yyxxxxxx | zzzzyyyy yyxxxxxx | 1110zzzz 10yyyyyy 10xxxxxx |
U+10000 – U+100000 | 000uuuuu zzzzyyyy yyxxxxxx | 110110ww wwzzzzyy 110111yy yyxxxxxx (**) | 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx |
(*) Unicode codes from U+D800 to U+DFFF are not valid. These codes start with the bit string 11011.
(**) The extended Unicode codes (>U+FFFF) are represented with 2 groups of UCS2 bytes ; these groups (starting with bit string 11011) are part of the invalid range of strict Unicode codes and so cannot be misinterpreted. (note: in table wwww = uuuuu – 1)