Character Sets

Computers at their core, only deal with numbers, integers and floating point. So how is it that on the console we are able to type and display characters? This is actually done by sending small integer values back and forth to the console based on a well-defined mapping between the characters and these small numbers. This mapping is often referred to as the character set used by the computer. Different languages historically have required different mappings, i.e. different character sets.

ASCII

One of the most used character sets is the [[ASCII]] (American Standard Code for Information Interchange) character set. It contains 128 codes mapped to 95 characters, 32 control codes and the null character. The characters it contains are those needed to represent English text in the United States.

ASCII Codes
Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex
NUL	0	00		32	20	@	64	40	`	96	60
SOH	1	01	!	33	21	A	65	41	a	97	61
STX	2	02	"	34	22	B	66	42	b	98	62
ETX	3	03	#	35	23	C	67	43	c	99	63
EOT	4	04	$	36	24	D	68	44	d	100	64
ENQ	5	05	%	37	25	E	69	45	e	101	65
ACK	6	06	&	38	26	F	70	46	f	102	66
BEL	7	07	'	39	27	G	71	47	g	103	67
BS	8	08	(	40	28	H	72	48	h	104	68
HT	9	09	)	41	29	I	73	49	i	105	69
LF	10	0A	*	42	2A	J	74	4A	j	106	6A
VT	11	0B	+	43	2B	K	75	4B	k	107	6B
FF	12	0C	,	44	2C	L	76	4C	l	108	6C
CR	13	0D	-	45	2D	M	77	4D	m	109	6D
SO	14	0E	.	46	2E	N	78	4E	n	110	6E
SI	15	0F	/	47	2F	O	79	4F	o	111	6F
DLE	16	10	0	48	30	P	80	50	p	112	70
DC1	17	11	1	49	31	Q	81	51	q	113	71
DC2	18	12	2	50	32	R	82	52	r	114	72
DC3	19	13	3	51	33	S	83	53	s	115	73
DC4	20	14	4	52	34	T	84	54	t	116	74
NAK	21	15	5	53	35	U	85	55	u	117	75
SYN	22	16	6	54	36	V	86	56	v	118	76
ETB	23	17	7	55	37	W	87	57	w	119	77
CAN	24	18	8	56	38	X	88	58	x	120	78
EM	25	19	9	57	39	Y	89	59	y	121	79
SUB	26	1A	:	58	3A	Z	90	5A	z	122	7A
ESC	27	1B	;	59	3B	[	91	5B	{	123	7B
FS	28	1C	<	60	3C	\	92	5C	\|	124	7C
GS	29	1D	=	61	3D	]	93	5D	}	125	7D
RS	30	1E	>	62	3E	^	94	5E	~	126	7E
US	31	1F	?	63	3F	_	95	5F	DEL	127	7F

Control Codes in red
Printable Characters in blue

While mere humans prefer to think about numbers using base 10, i.e. decimal, it is often more convenient when dealing with computers to use base 16 or [[hexadecimal]]. In hexadecimal we use the normal digits 0-9 and then add six more digits using the letters A-F. So counting in hexadecimal we have 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10, 11, etc. Numbers written in hexadecimal will generally be prefixed by 0x as in 0x1A2B or with the base 16 subscript notation 1A2B₁₆

Bytes

The smallest unit of information used by the computer is known as a [[byte]]. With one byte we can store a single integer value from 0 to 255, or 256 different possible values. A byte is, therefore, sufficient to store one ASCII character since the codes for ASCII characters range from 0-127. This actually leaves 128 unused codes that could be stored in one byte. Many computer manufacturers have, over the years, decided to use those extra 128 codes for Greek characters, European language characters or even block graphic characters. This lack of consistency ultimately led to standardization of character sets that use the codes 128-255.

ISO

There are a number of [[ISO]] standard character sets that expand upon the basic Latin characters of ASCII and standardize the codes from 128-255. [[ISO-8859-1]] or [[ISO Latin-1]] is probably the most commonly used and is designed to support Western European languages. Other ISO standard character sets cover eastern European, southern European, northern European, Hebrew, etc. languages.

Unicode

In an effort to unify all the various character sets in use, the [[Unicode]] standard was developed. Unicode defines codes for all the languages of the world in one character set. The first 128 codes in Unicode match those in ASCII. Currently Unicode contains over one million codes that range from 0 to 10FFFF₁₆ which can be seen on the [[Unicode Charts]]. Most Unicode codes (called codepoints) are thus too big to fit into one byte. There are different ways to encode Unicode codepoints into bytes, the most popular is [[UTF-8]].

In [[UTF-8]] the ASCII characters 0-127 are stored in one byte just like ASCII. This means that ASCII encoded text is also valid UTF-8. Higher codepoints are stored using two to six bytes. The terminal that we will use to run programs will process input and output as UTF-8. Also most files are stored as UTF-8 to keep file size down.

[[UCS-2]] uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called Basic Multilingual Plane (BMP). Because of this, USC-2 is outdated, though you might still come across it.

[[UTF-16]] uses two bytes (16 bits) and extends UCS-2 using the same encoding for the Basic Multilingual Plane (BMP) and therefore USC-2 is valid UTF-16 coding. For the higher planes, UTF-16 uses a four byte encoding. UTF-16 text can thus be a combination of two and four byte codes depending on which Unicode codepoints are to be represented.

[[UTF-32]] (also referred to as UCS-4) uses four bytes for each character. Because the bytes per codepoint are fixed this simplifies encodeing but it also consumes the most space. This is why UTF-32 is not often used except for perhaps internal representation of Unicode text in memory.

Unicode text must be converted often between UTF-8 when sending to the console or to a file and the internal representations used by a program in memory. Below is a table that illustrates how different languages store Unicode text in memory.

How Different Languages Encode Unicode
Language	Data Type	Internal Encoding
C++	char	8-bit
	wchar_t	16-bit or 32-bit depending on implementation
	string	ASCII or other 8-bit characterset
	wstring	USC-2 or UTF-32 depending on implementation
Java	char	16-bit
Java	string	UTF-16 (before Java 9) ISO-8859-1 or UTF-16 (Java 9 or later depending on codepoints used)
JavaScript	string	UTF-16
Perl	string	ISO-8859-1 or UTF-8 (depending on codepoints used)
Python	str	ISO-8859-1, UCS-2 or UCS-4 (depending on codepoints used)
Rust	String	UTF-8
Swift	char	21-bit
Swift	String	21-bit Unicode Scalars

There are varying levels of support for Unicode in different languages. Some like C++ really don't specify that you are using Unicode at all just implementation dependent character sets and Unicode could be one depending on your computer's locale setting. Other languages specifically use Unicode for internal representation of characters and strings. More recent languages have significantly deprecated the idea of characters and really only work with strings because of the complexities of Unicode.

What are these complexities? One codepoint represents one character right? Unfortunately this is not the case. One character as we think about it is a single graphical mark. In some languages these graphical marks can have modifiers above, below or around them changing the meaning of the mark. For example we can have the character (E). But if we combine that with a diacritical mark called a grave accent (`), then we get the mark (È). In Unicode there is one codepoint 0xC8 for (È), one codepoint 0x45 for (E) and one codepoint 0x60 for the grave accent (`). So in a Unicode text you can use the singular pre-composed code or you can use the two decompomposed codes. Either approach is acceptable and equivalent. European languages, Korean and Arabic are just some examples of languages that use this technique. This is why support for Unicode varies in different languages. In most languages if you ask for how many characters are in the string "È", you might get one or two as the answer depending on whether it is a pre-composed or decomponsed sequence of characters. Swift is an example of language where the answer to that question is always one because it operates not on Unicode characters but on what it called extended grapheme clusters which more closely correspond to what humans consider a character.

HelloWorlds.java

/******************************************************************************
 * Simple first program with Unicode strings.
 * 
 * Copyright © 2020 Richard Lesh.  All rights reserved.
 *****************************************************************************/

public class HelloWorlds {

	public static void main(String[] args) {
		System.out.println("!مرحبا أيها العالم");
		System.out.println("你好, 世界!");
		System.out.println("Hello, world!");
		System.out.println("Bonjour le monde!");
		System.out.println("Hallo welt!");
		System.out.println("Γειά σου Κόσμε!");
		System.out.println("!שלום העולם");
		System.out.println("नमस्ते दुनिया!");
		System.out.println("こんにちは世界!");
		System.out.println("안녕, 월드!");
		System.out.println("Привет, мир!");
		System.out.println("¡Hola mundo!");
	}
}

Output

$ javac -Xlint HelloWorlds.java $ java -ea HelloWorlds !مرحبا أيها العالم 你好, 世界! Hello, world! Bonjour le monde! Hallo welt! Γειά σου Κόσμε! !שלום העולם नमस्ते दुनिया! こんにちは世界! 안녕, 월드! Привет, мир! ¡Hola mundo!

Questions

{{What is the ASCII decimal code for 'A'?}}
{{What is the ASCII hexadecimal code for 'n'?}}
{{What is the ASCII hexadecimal code for '5'?}}
{{What character has ASCII hexadecimal code 0x2A?}}
{{What is the Unicode decimal codepoint for 'A'?}}
{{What is the Unicode decimal codepoint for 'ñ'?}}
{{What is the Unicode hexadecimal code for 'Γ' (Greek capital gamma)?}}
{{What is the Unicode hexadecimal code for '☺' (white smiling face)?}}
{{What character has Unicode decimal codepoint 203?}}
{{What character has Unicode hexadecimal codepoint 0x2260?}}

References

[[Unicode Charts]]
[[Unicode Lookup]]

Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex
NUL	0	00		32	20	@	64	40	`	96	60
SOH	1	01	!	33	21	A	65	41	a	97	61
STX	2	02	"	34	22	B	66	42	b	98	62
ETX	3	03	#	35	23	C	67	43	c	99	63
EOT	4	04	$	36	24	D	68	44	d	100	64
ENQ	5	05	%	37	25	E	69	45	e	101	65
ACK	6	06	&	38	26	F	70	46	f	102	66
BEL	7	07	'	39	27	G	71	47	g	103	67
BS	8	08	(	40	28	H	72	48	h	104	68
HT	9	09	)	41	29	I	73	49	i	105	69
LF	10	0A	*	42	2A	J	74	4A	j	106	6A
VT	11	0B	+	43	2B	K	75	4B	k	107	6B
FF	12	0C	,	44	2C	L	76	4C	l	108	6C
CR	13	0D	-	45	2D	M	77	4D	m	109	6D
SO	14	0E	.	46	2E	N	78	4E	n	110	6E
SI	15	0F	/	47	2F	O	79	4F	o	111	6F
DLE	16	10	0	48	30	P	80	50	p	112	70
DC1	17	11	1	49	31	Q	81	51	q	113	71
DC2	18	12	2	50	32	R	82	52	r	114	72
DC3	19	13	3	51	33	S	83	53	s	115	73
DC4	20	14	4	52	34	T	84	54	t	116	74
NAK	21	15	5	53	35	U	85	55	u	117	75
SYN	22	16	6	54	36	V	86	56	v	118	76
ETB	23	17	7	55	37	W	87	57	w	119	77
CAN	24	18	8	56	38	X	88	58	x	120	78
EM	25	19	9	57	39	Y	89	59	y	121	79
SUB	26	1A	:	58	3A	Z	90	5A	z	122	7A
ESC	27	1B	;	59	3B	[	91	5B	{	123	7B
FS	28	1C	<	60	3C	\	92	5C	\|	124	7C
GS	29	1D	=	61	3D	]	93	5D	}	125	7D
RS	30	1E	>	62	3E	^	94	5E	~	126	7E
US	31	1F	?	63	3F	_	95	5F	DEL	127	7F

Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex
NUL	0	00		32	20	@	64	40	`	96	60
SOH	1	01	!	33	21	A	65	41	a	97	61
STX	2	02	"	34	22	B	66	42	b	98	62
ETX	3	03	#	35	23	C	67	43	c	99	63
EOT	4	04	$	36	24	D	68	44	d	100	64
ENQ	5	05	%	37	25	E	69	45	e	101	65
ACK	6	06	&	38	26	F	70	46	f	102	66
BEL	7	07	'	39	27	G	71	47	g	103	67
BS	8	08	(	40	28	H	72	48	h	104	68
HT	9	09	)	41	29	I	73	49	i	105	69
LF	10	0A	*	42	2A	J	74	4A	j	106	6A
VT	11	0B	+	43	2B	K	75	4B	k	107	6B
FF	12	0C	,	44	2C	L	76	4C	l	108	6C
CR	13	0D	-	45	2D	M	77	4D	m	109	6D
SO	14	0E	.	46	2E	N	78	4E	n	110	6E
SI	15	0F	/	47	2F	O	79	4F	o	111	6F
DLE	16	10	0	48	30	P	80	50	p	112	70
DC1	17	11	1	49	31	Q	81	51	q	113	71
DC2	18	12	2	50	32	R	82	52	r	114	72
DC3	19	13	3	51	33	S	83	53	s	115	73
DC4	20	14	4	52	34	T	84	54	t	116	74
NAK	21	15	5	53	35	U	85	55	u	117	75
SYN	22	16	6	54	36	V	86	56	v	118	76
ETB	23	17	7	55	37	W	87	57	w	119	77
CAN	24	18	8	56	38	X	88	58	x	120	78
EM	25	19	9	57	39	Y	89	59	y	121	79
SUB	26	1A	:	58	3A	Z	90	5A	z	122	7A
ESC	27	1B	;	59	3B	[	91	5B	{	123	7B
FS	28	1C	<	60	3C	\	92	5C	\|	124	7C
GS	29	1D	=	61	3D	]	93	5D	}	125	7D
RS	30	1E	>	62	3E	^	94	5E	~	126	7E
US	31	1F	?	63	3F	_	95	5F	DEL	127	7F

Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex	Char	Code Dec	Code Hex
NUL	0	00		32	20	@	64	40	`	96	60
SOH	1	01	!	33	21	A	65	41	a	97	61
STX	2	02	"	34	22	B	66	42	b	98	62
ETX	3	03	#	35	23	C	67	43	c	99	63
EOT	4	04	$	36	24	D	68	44	d	100	64
ENQ	5	05	%	37	25	E	69	45	e	101	65
ACK	6	06	&	38	26	F	70	46	f	102	66
BEL	7	07	'	39	27	G	71	47	g	103	67
BS	8	08	(	40	28	H	72	48	h	104	68
HT	9	09	)	41	29	I	73	49	i	105	69
LF	10	0A	*	42	2A	J	74	4A	j	106	6A
VT	11	0B	+	43	2B	K	75	4B	k	107	6B
FF	12	0C	,	44	2C	L	76	4C	l	108	6C
CR	13	0D	-	45	2D	M	77	4D	m	109	6D
SO	14	0E	.	46	2E	N	78	4E	n	110	6E
SI	15	0F	/	47	2F	O	79	4F	o	111	6F
DLE	16	10	0	48	30	P	80	50	p	112	70
DC1	17	11	1	49	31	Q	81	51	q	113	71
DC2	18	12	2	50	32	R	82	52	r	114	72
DC3	19	13	3	51	33	S	83	53	s	115	73
DC4	20	14	4	52	34	T	84	54	t	116	74
NAK	21	15	5	53	35	U	85	55	u	117	75
SYN	22	16	6	54	36	V	86	56	v	118	76
ETB	23	17	7	55	37	W	87	57	w	119	77
CAN	24	18	8	56	38	X	88	58	x	120	78
EM	25	19	9	57	39	Y	89	59	y	121	79
SUB	26	1A	:	58	3A	Z	90	5A	z	122	7A
ESC	27	1B	;	59	3B	[	91	5B	{	123	7B
FS	28	1C	<	60	3C	\	92	5C	\|	124	7C
GS	29	1D	=	61	3D	]	93	5D	}	125	7D
RS	30	1E	>	62	3E	^	94	5E	~	126	7E
US	31	1F	?	63	3F	_	95	5F	DEL	127	7F