Numbers are only part of the data a typical Java program needs to read and write. Most programs also need to handle text, which is composed of characters. Since computers only really understand numbers, characters are encoded by matching each character in a given script to a particular number. For example, in the common ASCII encoding, the character A is mapped to the number 65; the character B is mapped to the number 66; the character C is mapped to the number 67; and so on. Different encodings may encode different scripts or may encode the same or similar scripts in different ways.
Java understands several dozen different character sets for a variety of languages, ranging from ASCII to the Shift Japanese Input System (SJIS) to Unicode. Internally, Java uses the Unicode character set. Unicode is a two-byte extension of the one-byte ISO Latin-1 character set, which in turn is an eight-bit superset of the seven-bit ASCII character set.
ASCII, the American
Standard Code for Information Interchange, is a seven-bit character
set. Thus it defines 27 or 128 different
characters whose numeric values range from
to 127. These characters are sufficient for handling most of American
English and can make reasonable approximations to most European
languages (with the notable exceptions of Russian and Greek).
It’s an often used lowest common denominator format for
different computers. If you were to read a
and 127 from a stream, then cast it to a
result would be the corresponding ASCII character.
ASCII characters 0-31 and character 127 are nonprinting control characters. Characters 32-47 are various punctuation and space characters. Characters 48-57 are the digits 0-9. Characters 58-64 are another group of punctuation characters. Characters 65-90 are the capital letters A-Z. Characters 91-96 are a few more punctuation marks. Characters 97-122 are the lowercase letters a-z. Finally, characters 123 through 126 are a few remaining punctuation symbols. The complete ASCII character set is shown in Table 2.1 in Appendix B.
All Java programs can be expressed in pure ASCII. Non-ASCII Unicode
characters are encoded as Unicode escapes; that is, written as a
backslash ( \), followed by a u, followed by
four hexadecimal digits; for example,
is discussed further under the Section 1.3.3
section, later in this chapter.
ISO Latin-1 is an eight-bit character set that’s a strict superset of ASCII. It defines 28 or 256 different characters whose numeric values range from to 255. The first 128 characters—that is, those numbers with the high-order bit equal to zero—correspond exactly to the ASCII character set. Thus 65 is ASCII A and ISO Latin-1 A; 66 is ASCII B and ISO Latin-1 B; and so on. Where ISO Latin-1 and ASCII diverge is in the characters between 128 and 255 (characters with high bit equal to one). ASCII does not define these characters. ISO Latin-1 uses them for various accented letters like ü needed for non-English languages written in a Roman script, additional punctuation marks and symbols like ©, and additional control characters. The upper, non-ASCII half of the ISO Latin-1 character set is shown in Table 2.2.
Latin-1 provides enough characters to write most Western European
languages (again with the notable exception of Greek). It’s a
popular lowest common denominator format for different computers. If
you were to read an unsigned
byte value from a
stream, then cast it to a
char, the result would
be the corresponding ISO Latin-1 character.
ISO Latin-1 suffices for most Western European languages, but it doesn’t have anywhere near the number of characters required to represent Cyrillic, Greek, Arabic, Hebrew, Persian, or Devanagari, not to mention pictographic languages like Chinese and Japanese. Chinese alone has over 80,000 different characters. To handle these scripts and many others, the Unicode character set was invented. Unicode is a 2-byte, 16-bit character set with 216 or 65,536 different possible characters. (Only about 40,000 are used in practice, the rest being reserved for future expansion.) Unicode can handle most of the world’s living languages and a number of dead ones as well.
The first 256 characters of Unicode—that is, the characters whose high-order byte is zero—are identical to the characters of the ISO Latin-1 character set. Thus 65 is ASCII A and Unicode A; 66 is ASCII B and Unicode B and so on.
Java streams do not do a good job of reading Unicode text. (This is
why readers and writers were added in Java 1.1.) Streams generally
read a byte at a time, but each Unicode character occupies two bytes.
Thus, to read a Unicode character, you multiply the first byte read
by 256, add it to the second byte read, and cast the result to a
char. For example:
int b1 = in.read(); int b2 = in.read(); char c = (char) (b1*256 + b2);
You must be careful to ensure that you don’t inadvertently read
the last byte of one character and the first byte of the next,
instead. Thus, for the most part, when reading text encoded in
Unicode or any other format, you should use a reader rather than an
input stream. Readers handle the conversion of bytes in one character
set to Java
chars without any extra effort. For
similar reasons, you should use a writer rather than an output stream
to write text.
Unicode is a relatively inefficient encoding when most of your text consists of ASCII characters. Every character requires the same number of bytes—two—even though some characters are used much more frequently than others. A more efficient encoding would use fewer bits for the more common characters. This is what UTF-8 does.
In UTF-8 the ASCII alphabet is encoded using a single byte, just as in ASCII. The next 1,919 characters are encoded in two bytes. The remaining Unicode characters are encoded in three bytes. However, since these three-byte characters are relatively uncommon, especially in English text, the savings achieved by encoding ASCII in a single byte more than makes up for it.
.class files use UTF-8 internally
to store string literals. Data input streams and data output streams
also read and write strings in UTF-8. However, this is all hidden
from direct view of the programmer, unless perhaps you’re
trying to write a Java compiler or parse output of a data stream
without using the
ASCII, ISO Latin-1, and Unicode are hardly the only character sets in common use, though they are the ones handled most directly by Java. There are many other character sets, both that encode different scripts and that encode the same scripts in different ways. For example, IBM mainframes have long used a non-ASCII eight-bit character set called EBCDIC. EBCDIC has most of the same characters as ASCII but assigns them to different numbers. Macintoshes commonly use an eight-bit encoding called MacRoman that matches ASCII in the lower 128 places and has most of the same characters as ISO Latin-1 in the upper 128 characters but in different positions. Big-5 and SJIS are encodings of Chinese and Japanese, respectively, that are designed to allow these large scripts to be input from a standard English keyboard.
String classes understand how to convert these
character sets to and from Unicode. This will be the subject of Chapter 14.
Character-oriented data in Java is primarily composed of the
primitive data type,
char arrays, and
Strings, which are stored as arrays of
chars internally. Just as you need to understand
bytes to really grasp how input and output streams
work, so too do you need to understand
understand how readers and writers work.
In Java, a
char is a two-byte, unsigned integer,
the only unsigned type in Java. Thus, possible
char values range from
to 65,535. Each
char represents a particular
character in the Unicode character set.
be assigned to by using
int literals in this
range; for example:
char copyright = 169;
chars may also be assigned to by using
char literals; that is, the character itself
enclosed in single quotes:
char copyright = '©';
Sun’s javac compiler can translate many
different encodings to Unicode by using the
-encoding command-line flag to specify the
encoding in which the file is written. For example, if you know a
file is written in ISO Latin-1, you might compile it as follows:
% javac -encoding 8859_1 CharTest.java
The complete list of available encodings is given in Table 2.4.
With the exception of Unicode itself, most character sets understood by Java do not have equivalents for all the Unicode characters. To encode characters that do not exist in the character set you’re programming with, you can use Unicode escapes. A Unicode escape sequence is an unescaped backslash, followed by any number of u characters, followed by four hexadecimal digits specifying the character to be used. For example:
char copyright = '\u00A9';
The double backslash,
\\, is an escaped backslash, which is replaced by
a single backslash that only means the backslash character. It is not
further interpreted. Thus a Java Compiler interprets the string
\u00A9 as © but
the literal string \u00A9 and the string
as \©. Whenever an odd number of backslashes precede the four
hex digits, they will be interpreted as a single Unicode character.
Whenever an even number of backslashes precede the four hex digits,
they will be interpreted as four separate characters.
Unicode escapes may be used not just in
literals, but also in strings, identifiers, comments, and even in
keywords, separators, operators, and numeric literals. The compiler
translates Unicode escapes to actual Unicode characters before it
does anything else with a source code file. However, the actual use
of Unicode escapes inside keywords, separators, operators, and
numeric literals is unnecessary and can only lead to obfuscation.
With the possible exception of identifiers, comments, and string and
char literals, Java programs can be expressed in
pure ASCII without using Unicode escapes.
char used in arithmetic is promoted to
int. This presents the same problem as it does for
bytes. For instance, the following line causes the compiler to emit
an error message: “Incompatible type for declaration. Explicit
cast needed to convert
char c = 'a' + 'b';
 The vast majority of the characters above 2047 are the pictograms used for Chinese, Japanese, and Korean.