Character Data

Numbers are only part of the data a typical Java program needs to read and write. Most programs also need to handle text, which is composed of characters. Since computers only really understand numbers, characters are encoded by matching each character in a given script to a particular number. For example, in the common ASCII encoding, the character A is mapped to the number 65; the character B is mapped to the number 66; the character C is mapped to the number 67; and so on. Different encodings may encode different scripts or may encode the same or similar scripts in different ways.

Java understands several dozen different character sets for a variety of languages, ranging from ASCII to the Shift Japanese Input System (SJIS) to Unicode. Internally, Java uses the Unicode character set. Unicode is a two-byte extension of the one-byte ISO Latin-1 character set, which in turn is an eight-bit superset of the seven-bit ASCII character set.

ASCII

ASCII, the American Standard Code for Information Interchange, is a seven-bit character set. Thus it defines 27 or 128 different characters whose numeric values range from to 127. These characters are sufficient for handling most of American English and can make reasonable approximations to most European languages (with the notable exceptions of Russian and Greek). It’s an often used lowest common denominator format for different computers. If you were to read a byte value between and 127 from a stream, then cast it to a char, the result would be the corresponding ASCII character.

ASCII characters 0-31 and character 127 are nonprinting control characters. Characters 32-47 are various punctuation and space characters. Characters 48-57 are the digits 0-9. Characters 58-64 are another group of punctuation characters. Characters 65-90 are the capital letters A-Z. Characters 91-96 are a few more punctuation marks. Characters 97-122 are the lowercase letters a-z. Finally, characters 123 through 126 are a few remaining punctuation symbols. The complete ASCII character set is shown in Table 2.1 in Appendix B.

All Java programs can be expressed in pure ASCII. Non-ASCII Unicode characters are encoded as Unicode escapes; that is, written as a backslash ( \), followed by a u, followed by four hexadecimal digits; for example, \u00A9. This is discussed further under the Section 1.3.3 section, later in this chapter.

ISO Latin-1

ISO Latin-1 is an eight-bit character set that’s a strict superset of ASCII. It defines 28 or 256 different characters whose numeric values range from to 255. The first 128 characters—that is, those numbers with the high-order bit equal to zero—correspond exactly to the ASCII character set. Thus 65 is ASCII A and ISO Latin-1 A; 66 is ASCII B and ISO Latin-1 B; and so on. Where ISO Latin-1 and ASCII diverge is in the characters between 128 and 255 (characters with high bit equal to one). ASCII does not define these characters. ISO Latin-1 uses them for various accented letters like ü needed for non-English languages written in a Roman script, additional punctuation marks and symbols like ©, and additional control characters. The upper, non-ASCII half of the ISO Latin-1 character set is shown in Table 2.2.

Latin-1 provides enough characters to write most Western European languages (again with the notable exception of Greek). It’s a popular lowest common denominator format for different computers. If you were to read an unsigned byte value from a stream, then cast it to a char, the result would be the corresponding ISO Latin-1 character.

Unicode

ISO Latin-1 suffices for most Western European languages, but it doesn’t have anywhere near the number of characters required to represent Cyrillic, Greek, Arabic, Hebrew, Persian, or Devanagari, not to mention pictographic languages like Chinese and Japanese. Chinese alone has over 80,000 different characters. To handle these scripts and many others, the Unicode character set was invented. Unicode is a 2-byte, 16-bit character set with 216 or 65,536 different possible characters. (Only about 40,000 are used in practice, the rest being reserved for future expansion.) Unicode can handle most of the world’s living languages and a number of dead ones as well.

The first 256 characters of Unicode—that is, the characters whose high-order byte is zero—are identical to the characters of the ISO Latin-1 character set. Thus 65 is ASCII A and Unicode A; 66 is ASCII B and Unicode B and so on.

Java streams do not do a good job of reading Unicode text. (This is why readers and writers were added in Java 1.1.) Streams generally read a byte at a time, but each Unicode character occupies two bytes. Thus, to read a Unicode character, you multiply the first byte read by 256, add it to the second byte read, and cast the result to a char. For example:

int b1 = in.read();
int b2 = in.read();
char c = (char) (b1*256 + b2);

You must be careful to ensure that you don’t inadvertently read the last byte of one character and the first byte of the next, instead. Thus, for the most part, when reading text encoded in Unicode or any other format, you should use a reader rather than an input stream. Readers handle the conversion of bytes in one character set to Java chars without any extra effort. For similar reasons, you should use a writer rather than an output stream to write text.

UTF-8

Unicode is a relatively inefficient encoding when most of your text consists of ASCII characters. Every character requires the same number of bytes—two—even though some characters are used much more frequently than others. A more efficient encoding would use fewer bits for the more common characters. This is what UTF-8 does.

In UTF-8 the ASCII alphabet is encoded using a single byte, just as in ASCII. The next 1,919 characters are encoded in two bytes. The remaining Unicode characters are encoded in three bytes. However, since these three-byte characters are relatively uncommon,[3] especially in English text, the savings achieved by encoding ASCII in a single byte more than makes up for it.

Java’s .class files use UTF-8 internally to store string literals. Data input streams and data output streams also read and write strings in UTF-8. However, this is all hidden from direct view of the programmer, unless perhaps you’re trying to write a Java compiler or parse output of a data stream without using the DataInputStream class.

Other encodings

ASCII, ISO Latin-1, and Unicode are hardly the only character sets in common use, though they are the ones handled most directly by Java. There are many other character sets, both that encode different scripts and that encode the same scripts in different ways. For example, IBM mainframes have long used a non-ASCII eight-bit character set called EBCDIC. EBCDIC has most of the same characters as ASCII but assigns them to different numbers. Macintoshes commonly use an eight-bit encoding called MacRoman that matches ASCII in the lower 128 places and has most of the same characters as ISO Latin-1 in the upper 128 characters but in different positions. Big-5 and SJIS are encodings of Chinese and Japanese, respectively, that are designed to allow these large scripts to be input from a standard English keyboard.

Java’s Reader, Writer, and String classes understand how to convert these character sets to and from Unicode. This will be the subject of Chapter 14.

The char Data Type

Character-oriented data in Java is primarily composed of the char primitive data type, char arrays, and Strings, which are stored as arrays of chars internally. Just as you need to understand bytes to really grasp how input and output streams work, so too do you need to understand chars to understand how readers and writers work.

In Java, a char is a two-byte, unsigned integer, the only unsigned type in Java. Thus, possible char values range from to 65,535. Each char represents a particular character in the Unicode character set. chars may be assigned to by using int literals in this range; for example:

char copyright = 169;

chars may also be assigned to by using char literals; that is, the character itself enclosed in single quotes:

char copyright = '©';

Sun’s javac compiler can translate many different encodings to Unicode by using the -encoding command-line flag to specify the encoding in which the file is written. For example, if you know a file is written in ISO Latin-1, you might compile it as follows:

% javac -encoding 8859_1 CharTest.java

The complete list of available encodings is given in Table 2.4.

With the exception of Unicode itself, most character sets understood by Java do not have equivalents for all the Unicode characters. To encode characters that do not exist in the character set you’re programming with, you can use Unicode escapes. A Unicode escape sequence is an unescaped backslash, followed by any number of u characters, followed by four hexadecimal digits specifying the character to be used. For example:

char copyright = '\u00A9';

Note

The double backslash, \\, is an escaped backslash, which is replaced by a single backslash that only means the backslash character. It is not further interpreted. Thus a Java Compiler interprets the string \u00A9 as © but \\u00A9 as the literal string \u00A9 and the string \\\u00A9 as \©. Whenever an odd number of backslashes precede the four hex digits, they will be interpreted as a single Unicode character. Whenever an even number of backslashes precede the four hex digits, they will be interpreted as four separate characters.

Unicode escapes may be used not just in char literals, but also in strings, identifiers, comments, and even in keywords, separators, operators, and numeric literals. The compiler translates Unicode escapes to actual Unicode characters before it does anything else with a source code file. However, the actual use of Unicode escapes inside keywords, separators, operators, and numeric literals is unnecessary and can only lead to obfuscation. With the possible exception of identifiers, comments, and string and char literals, Java programs can be expressed in pure ASCII without using Unicode escapes.

A char used in arithmetic is promoted to int. This presents the same problem as it does for bytes. For instance, the following line causes the compiler to emit an error message: “Incompatible type for declaration. Explicit cast needed to convert int to char.”

char c = 'a' + 'b';

Admittedly, you rarely need to perform mathematical operations on chars.



[3] The vast majority of the characters above 2047 are the pictograms used for Chinese, Japanese, and Korean.

Get Java I/O now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.