Lexical Structure
This section explains the lexical structure of a Java program. It starts with a discussion of the Unicode character set in which Java programs are written . It then covers the tokens that comprise a Java program, explaining comments, identifiers, reserved words, literals, and so on.
The Unicode Character Set
Java programs are written using Unicode. You can use Unicode characters anywhere in a Java program, including comments and identifiers such as variable names. Unlike the 7-bit ASCII character set, which is useful only for English, and the 8-bit ISO Latin-1 character set, which is useful only for major Western European languages, the Unicode character set can represent virtually every written language in common use on the planet. 16-bit Unicode characters are typically written to files using an encoding known as UTF-8, which converts the 16-bit characters into a stream of bytes. The format is designed so that plain ASCII text (and the 7-bit characters of Latin-1) are valid UTF-8 byte streams. Thus, you can simply write plain ASCII programs, and they will work as valid Unicode.
If you do not use a Unicode-enabled
text editor, or if you do not want to force other programmers who
view or edit your code to use a Unicode-enabled editor, you can embed
Unicode characters into your Java programs using the special Unicode
escape sequence \u
xxxx
,
in other words, a backslash and a lowercase u, followed by four
hexadecimal characters. For example, \u0020
is the ...
Get Java in a Nutshell, 5th Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.