Character Sets
Computers at their core, only deal with numbers, integers and floating point. So how is it that on the console we are able to type and display characters? This is actually done by sending small integer values back and forth to the console based on a well-defined mapping between the characters and these small numbers. This mapping is often referred to as the character set used by the computer. Different languages historically have required different mappings, i.e. different character sets.
ASCII
One of the most used character sets is the [[ASCII]] (American Standard Code for Information Interchange) character set. It contains 128 codes mapped to 95 characters, 32 control codes and the null character. The characters it contains are those needed to represent English text in the United States.
Char | Code Dec | Code Hex | Char | Code Dec | Code Hex | Char | Code Dec | Code Hex | Char | Code Dec | Code Hex |
---|---|---|---|---|---|---|---|---|---|---|---|
NUL | 0 | 00 | 32 | 20 | @ | 64 | 40 | ` | 96 | 60 | |
SOH | 1 | 01 | ! | 33 | 21 | A | 65 | 41 | a | 97 | 61 |
STX | 2 | 02 | " | 34 | 22 | B | 66 | 42 | b | 98 | 62 |
ETX | 3 | 03 | # | 35 | 23 | C | 67 | 43 | c | 99 | 63 |
EOT | 4 | 04 | $ | 36 | 24 | D | 68 | 44 | d | 100 | 64 |
ENQ | 5 | 05 | % | 37 | 25 | E | 69 | 45 | e | 101 | 65 |
ACK | 6 | 06 | & | 38 | 26 | F | 70 | 46 | f | 102 | 66 |
BEL | 7 | 07 | ' | 39 | 27 | G | 71 | 47 | g | 103 | 67 |
BS | 8 | 08 | ( | 40 | 28 | H | 72 | 48 | h | 104 | 68 |
HT | 9 | 09 | ) | 41 | 29 | I | 73 | 49 | i | 105 | 69 |
LF | 10 | 0A | * | 42 | 2A | J | 74 | 4A | j | 106 | 6A |
VT | 11 | 0B | + | 43 | 2B | K | 75 | 4B | k | 107 | 6B |
FF | 12 | 0C | , | 44 | 2C | L | 76 | 4C | l | 108 | 6C |
CR | 13 | 0D | - | 45 | 2D | M | 77 | 4D | m | 109 | 6D |
SO | 14 | 0E | . | 46 | 2E | N | 78 | 4E | n | 110 | 6E |
SI | 15 | 0F | / | 47 | 2F | O | 79 | 4F | o | 111 | 6F |
DLE | 16 | 10 | 0 | 48 | 30 | P | 80 | 50 | p | 112 | 70 |
DC1 | 17 | 11 | 1 | 49 | 31 | Q | 81 | 51 | q | 113 | 71 |
DC2 | 18 | 12 | 2 | 50 | 32 | R | 82 | 52 | r | 114 | 72 |
DC3 | 19 | 13 | 3 | 51 | 33 | S | 83 | 53 | s | 115 | 73 |
DC4 | 20 | 14 | 4 | 52 | 34 | T | 84 | 54 | t | 116 | 74 |
NAK | 21 | 15 | 5 | 53 | 35 | U | 85 | 55 | u | 117 | 75 |
SYN | 22 | 16 | 6 | 54 | 36 | V | 86 | 56 | v | 118 | 76 |
ETB | 23 | 17 | 7 | 55 | 37 | W | 87 | 57 | w | 119 | 77 |
CAN | 24 | 18 | 8 | 56 | 38 | X | 88 | 58 | x | 120 | 78 |
EM | 25 | 19 | 9 | 57 | 39 | Y | 89 | 59 | y | 121 | 79 |
SUB | 26 | 1A | : | 58 | 3A | Z | 90 | 5A | z | 122 | 7A |
ESC | 27 | 1B | ; | 59 | 3B | [ | 91 | 5B | { | 123 | 7B |
FS | 28 | 1C | < | 60 | 3C | \ | 92 | 5C | | | 124 | 7C |
GS | 29 | 1D | = | 61 | 3D | ] | 93 | 5D | } | 125 | 7D |
RS | 30 | 1E | > | 62 | 3E | ^ | 94 | 5E | ~ | 126 | 7E |
US | 31 | 1F | ? | 63 | 3F | _ | 95 | 5F | DEL | 127 | 7F |
Control Codes in red
Printable Characters in blue
While mere humans prefer to think about numbers using base 10, i.e. decimal, it is often more convenient when dealing with computers to use base 16 or [[hexadecimal]]. In hexadecimal we use the normal digits 0-9 and then add six more digits using the letters A-F. So counting in hexadecimal we have 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10, 11, etc. Numbers written in hexadecimal will generally be prefixed by 0x
as in 0x1A2B
or with the base 16 subscript notation 1A2B16
Bytes
The smallest unit of information used by the computer is known as a [[byte]]. With one byte we can store a single integer value from 0 to 255, or 256 different possible values. A byte is, therefore, sufficient to store one ASCII character since the codes for ASCII characters range from 0-127. This actually leaves 128 unused codes that could be stored in one byte. Many computer manufacturers have, over the years, decided to use those extra 128 codes for Greek characters, European language characters or even block graphic characters. This lack of consistency ultimately led to standardization of character sets that use the codes 128-255.ISO
There are a number of [[ISO]] standard character sets that expand upon the basic Latin characters of ASCII and standardize the codes from 128-255. [[ISO-8859-1]] or [[ISO Latin-1]] is probably the most commonly used and is designed to support Western European languages. Other ISO standard character sets cover eastern European, southern European, northern European, Hebrew, etc. languages.Unicode
In an effort to unify all the various character sets in use, the [[Unicode]] standard was developed. Unicode defines codes for all the languages of the world in one character set. The first 128 codes in Unicode match those in ASCII. Currently Unicode contains over one million codes that range from 0 to 10FFFF16 which can be seen on the [[Unicode Charts]]. Most Unicode codes (called codepoints) are thus too big to fit into one byte. There are different ways to encode Unicode codepoints into bytes, the most popular is [[UTF-8]].
In [[UTF-8]] the ASCII characters 0-127 are stored in one byte just like ASCII. This means that ASCII encoded text is also valid UTF-8. Higher codepoints are stored using two to six bytes. The terminal that we will use to run programs will process input and output as UTF-8. Also most files are stored as UTF-8 to keep file size down.
[[UCS-2]] uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called Basic Multilingual Plane (BMP). Because of this, USC-2 is outdated, though you might still come across it.
[[UTF-16]] uses two bytes (16 bits) and extends UCS-2 using the same encoding for the Basic Multilingual Plane (BMP) and therefore USC-2 is valid UTF-16 coding. For the higher planes, UTF-16 uses a four byte encoding. UTF-16 text can thus be a combination of two and four byte codes depending on which Unicode codepoints are to be represented.
[[UTF-32]] (also referred to as UCS-4) uses four bytes for each character. Because the bytes per codepoint are fixed this simplifies encodeing but it also consumes the most space. This is why UTF-32 is not often used except for perhaps internal representation of Unicode text in memory.
Unicode text must be converted often between UTF-8 when sending to the console or to a file and the internal representations used by a program in memory. Below is a table that illustrates how different languages store Unicode text in memory.
Language | Data Type | Internal Encoding |
---|---|---|
C++ | char | 8-bit |
wchar_t | 16-bit or 32-bit depending on implementation | |
string | ASCII or other 8-bit characterset | |
wstring | USC-2 or UTF-32 depending on implementation | |
Java | char | 16-bit |
string | UTF-16 (before Java 9) ISO-8859-1 or UTF-16 (Java 9 or later depending on codepoints used) | |
JavaScript | string | UTF-16 |
Perl | string | ISO-8859-1 or UTF-8 (depending on codepoints used) |
Python | str | ISO-8859-1, UCS-2 or UCS-4 (depending on codepoints used) |
Rust | String | UTF-8 |
Swift | char | 21-bit |
String | 21-bit Unicode Scalars |
There are varying levels of support for Unicode in different languages. Some like C++ really don't specify that you are using Unicode at all just implementation dependent character sets and Unicode could be one depending on your computer's locale setting. Other languages specifically use Unicode for internal representation of characters and strings. More recent languages have significantly deprecated the idea of characters and really only work with strings because of the complexities of Unicode.
What are these complexities? One codepoint represents one character right? Unfortunately this is not the case. One character as we think about it is a single graphical mark. In some languages these graphical marks can have modifiers above, below or around them changing the meaning of the mark. For example we can have the character (E). But if we combine that with a diacritical mark called a grave accent (`), then we get the mark (È). In Unicode there is one codepoint 0xC8 for (È), one codepoint 0x45 for (E) and one codepoint 0x60 for the grave accent (`). So in a Unicode text you can use the singular pre-composed code or you can use the two decompomposed codes. Either approach is acceptable and equivalent. European languages, Korean and Arabic are just some examples of languages that use this technique. This is why support for Unicode varies in different languages. In most languages if you ask for how many characters are in the string "È", you might get one or two as the answer depending on whether it is a pre-composed or decomponsed sequence of characters. Swift is an example of language where the answer to that question is always one because it operates not on Unicode characters but on what it called extended grapheme clusters which more closely correspond to what humans consider a character.
/****************************************************************************** * Simple first program with Unicode strings. * * Copyright © 2020 Richard Lesh. All rights reserved. *****************************************************************************/ public class HelloWorlds { public static void main(String[] args) { System.out.println("!مرحبا أيها العالم"); System.out.println("你好, 世界!"); System.out.println("Hello, world!"); System.out.println("Bonjour le monde!"); System.out.println("Hallo welt!"); System.out.println("Γειά σου Κόσμε!"); System.out.println("!שלום העולם"); System.out.println("नमस्ते दुनिया!"); System.out.println("こんにちは世界!"); System.out.println("안녕, 월드!"); System.out.println("Привет, мир!"); System.out.println("¡Hola mundo!"); } }
Output
Questions
- {{What is the ASCII decimal code for 'A'?}}
- {{What is the ASCII hexadecimal code for 'n'?}}
- {{What is the ASCII hexadecimal code for '5'?}}
- {{What character has ASCII hexadecimal code 0x2A?}}
- {{What is the Unicode decimal codepoint for 'A'?}}
- {{What is the Unicode decimal codepoint for 'ñ'?}}
- {{What is the Unicode hexadecimal code for 'Γ' (Greek capital gamma)?}}
- {{What is the Unicode hexadecimal code for '☺' (white smiling face)?}}
- {{What character has Unicode decimal codepoint 203?}}
- {{What character has Unicode hexadecimal codepoint 0x2260?}}
References
- [[Unicode Charts]]
- [[Unicode Lookup]]