Pure Programmer
Blue Matrix


Cluster Map

Character Sets

Computers at their core, only deal with numbers, integers and floating point. So how is it that on the console we are able to type and display characters? This is actually done by sending small integer values back and forth to the console based on a well-defined mapping between the characters and these small numbers. This mapping is often referred to as the character set used by the computer. Different languages historically have required different mappings, i.e. different character sets.

ASCII

One of the most used character sets is the [[ASCII]] (American Standard Code for Information Interchange) character set. It contains 128 codes mapped to 95 characters, 32 control codes and the null character. The characters it contains are those needed to represent English text in the United States.

ASCII Codes
CharCode
Dec
Code
Hex
CharCode
Dec
Code
Hex
CharCode
Dec
Code
Hex
CharCode
Dec
Code
Hex
NUL000 3220@6440`9660
SOH101!3321A6541a9761
STX202"3422B6642b9862
ETX303#3523C6743c9963
EOT404$3624D6844d10064
ENQ505%3725E6945e10165
ACK606&3826F7046f10266
BEL707'3927G7147g10367
BS808(4028H7248h10468
HT909)4129I7349i10569
LF100A*422AJ744Aj1066A
VT110B+432BK754Bk1076B
FF120C,442CL764Cl1086C
CR130D-452DM774Dm1096D
SO140E.462EN784En1106E
SI150F/472FO794Fo1116F
DLE161004830P8050p11270
DC1171114931Q8151q11371
DC2181225032R8252r11472
DC3191335133S8353s11573
DC4201445234T8454t11674
NAK211555335U8555u11775
SYN221665436V8656v11876
ETB231775537W8757w11977
CAN241885638X8858x12078
EM251995739Y8959y12179
SUB261A:583AZ905Az1227A
ESC271B;593B[915B{1237B
FS281C<603C\925C|1247C
GS291D=613D]935D}1257D
RS301E>623E^945E~1267E
US311F?633F_955FDEL1277F

Control Codes in red
Printable Characters in blue

While mere humans prefer to think about numbers using base 10, i.e. decimal, it is often more convenient when dealing with computers to use base 16 or [[hexadecimal]]. In hexadecimal we use the normal digits 0-9 and then add six more digits using the letters A-F. So counting in hexadecimal we have 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10, 11, etc. Numbers written in hexadecimal will generally be prefixed by 0x as in 0x1A2B or with the base 16 subscript notation 1A2B16

Bytes

The smallest unit of information used by the computer is known as a [[byte]]. With one byte we can store a single integer value from 0 to 255, or 256 different possible values. A byte is, therefore, sufficient to store one ASCII character since the codes for ASCII characters range from 0-127. This actually leaves 128 unused codes that could be stored in one byte. Many computer manufacturers have, over the years, decided to use those extra 128 codes for Greek characters, European language characters or even block graphic characters. This lack of consistency ultimately led to standardization of character sets that use the codes 128-255.

ISO

There are a number of [[ISO]] standard character sets that expand upon the basic Latin characters of ASCII and standardize the codes from 128-255. [[ISO-8859-1]] or [[ISO Latin-1]] is probably the most commonly used and is designed to support Western European languages. Other ISO standard character sets cover eastern European, southern European, northern European, Hebrew, etc. languages.

Unicode

In an effort to unify all the various character sets in use, the [[Unicode]] standard was developed. Unicode defines codes for all the languages of the world in one character set. The first 128 codes in Unicode match those in ASCII. Currently Unicode contains over one million codes that range from 0 to 10FFFF16 which can be seen on the [[Unicode Charts]]. Most Unicode codes (called codepoints) are thus too big to fit into one byte. There are different ways to encode Unicode codepoints into bytes, the most popular is [[UTF-8]].

In [[UTF-8]] the ASCII characters 0-127 are stored in one byte just like ASCII. This means that ASCII encoded text is also valid UTF-8. Higher codepoints are stored using two to six bytes. The terminal that we will use to run programs will process input and output as UTF-8. Also most files are stored as UTF-8 to keep file size down.

[[UCS-2]] uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called Basic Multilingual Plane (BMP). Because of this, USC-2 is outdated, though you might still come across it.

[[UTF-16]] uses two bytes (16 bits) and extends UCS-2 using the same encoding for the Basic Multilingual Plane (BMP) and therefore USC-2 is valid UTF-16 coding. For the higher planes, UTF-16 uses a four byte encoding. UTF-16 text can thus be a combination of two and four byte codes depending on which Unicode codepoints are to be represented.

[[UTF-32]] (also referred to as UCS-4) uses four bytes for each character. Because the bytes per codepoint are fixed this simplifies encodeing but it also consumes the most space. This is why UTF-32 is not often used except for perhaps internal representation of Unicode text in memory.

Unicode text must be converted often between UTF-8 when sending to the console or to a file and the internal representations used by a program in memory. Below is a table that illustrates how different languages store Unicode text in memory.

How Different Languages Encode Unicode
LanguageData TypeInternal Encoding
C++char8-bit
wchar_t16-bit or 32-bit depending on implementation
stringASCII or other 8-bit characterset
wstringUSC-2 or UTF-32 depending on implementation
Javachar16-bit
stringUTF-16 (before Java 9)
ISO-8859-1 or UTF-16 (Java 9 or later depending on codepoints used)
JavaScriptstringUTF-16
PerlstringISO-8859-1 or UTF-8 (depending on codepoints used)
PythonstrISO-8859-1, UCS-2 or UCS-4 (depending on codepoints used)
RustStringUTF-8
Swiftchar21-bit
String21-bit Unicode Scalars

There are varying levels of support for Unicode in different languages. Some like C++ really don't specify that you are using Unicode at all just implementation dependent character sets and Unicode could be one depending on your computer's locale setting. Other languages specifically use Unicode for internal representation of characters and strings. More recent languages have significantly deprecated the idea of characters and really only work with strings because of the complexities of Unicode.

What are these complexities? One codepoint represents one character right? Unfortunately this is not the case. One character as we think about it is a single graphical mark. In some languages these graphical marks can have modifiers above, below or around them changing the meaning of the mark. For example we can have the character (E). But if we combine that with a diacritical mark called a grave accent (`), then we get the mark (È). In Unicode there is one codepoint 0xC8 for (È), one codepoint 0x45 for (E) and one codepoint 0x60 for the grave accent (`). So in a Unicode text you can use the singular pre-composed code or you can use the two decompomposed codes. Either approach is acceptable and equivalent. European languages, Korean and Arabic are just some examples of languages that use this technique. This is why support for Unicode varies in different languages. In most languages if you ask for how many characters are in the string "È", you might get one or two as the answer depending on whether it is a pre-composed or decomponsed sequence of characters. Swift is an example of language where the answer to that question is always one because it operates not on Unicode characters but on what it called extended grapheme clusters which more closely correspond to what humans consider a character.

HelloWorlds.java
/******************************************************************************
 * Simple first program with Unicode strings.
 * 
 * Copyright © 2020 Richard Lesh.  All rights reserved.
 *****************************************************************************/

public class HelloWorlds {

	public static void main(String[] args) {
		System.out.println("!مرحبا أيها العالم");
		System.out.println("你好, 世界!");
		System.out.println("Hello, world!");
		System.out.println("Bonjour le monde!");
		System.out.println("Hallo welt!");
		System.out.println("Γειά σου Κόσμε!");
		System.out.println("!שלום העולם");
		System.out.println("नमस्ते दुनिया!");
		System.out.println("こんにちは世界!");
		System.out.println("안녕, 월드!");
		System.out.println("Привет, мир!");
		System.out.println("¡Hola mundo!");
	}
}

Output
$ javac -Xlint HelloWorlds.java $ java -ea HelloWorlds !مرحبا أيها العالم 你好, 世界! Hello, world! Bonjour le monde! Hallo welt! Γειά σου Κόσμε! !שלום העולם नमस्ते दुनिया! こんにちは世界! 안녕, 월드! Привет, мир! ¡Hola mundo!

Questions

References