Lexical Structure
This section explains the lexical structure of a Java program. It starts with a discussion of the Unicode character set in which Java programs are written . It then covers the tokens that comprise a Java program, explaining comments, identifiers, reserved words, literals, and so on.
The Unicode Character Set
Java programs are written using Unicode. You can use Unicode characters anywhere in a Java program, including comments and identifiers such as variable names. Unlike the 7-bit ASCII character set, which is useful only for English, and the 8-bit ISO Latin-1 character set, which is useful only for major Western European languages, the Unicode character set can represent virtually every written language in common use on the planet. 16-bit Unicode characters are typically written to files using an encoding known as UTF-8, which converts the 16-bit characters into a stream of bytes. The format is designed so that plain ASCII text (and the 7-bit characters of Latin-1) are valid UTF-8 byte streams. Thus, you can simply write plain ASCII programs, and they will work as valid Unicode.
If you do not use a Unicode-enabled
text editor, or if you do not want to force other programmers who
view or edit your code to use a Unicode-enabled editor, you can embed
Unicode characters into your Java programs using the special Unicode
escape sequence \u
xxxx,
in other words, a backslash and a lowercase u, followed by four
hexadecimal characters. For example, \u0020 is the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access