Unicode

Java uses the Unicode character encoding. (Java 1.3 uses Unicode Version 2.1. Support for Unicode 3.0 will be included in Java 1.4 or another future release.) Unicode is a 16-bit character encoding established by the Unicode Consortium, which describes the standard as follows (see http://unicode.org ):

The Unicode Standard defines codes for characters used in the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and scripts of Asia. The Unicode Standard also includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, etc. ... In all, the Unicode Standard provides codes for 49,194 characters from the world’s alphabets, ideograph sets, and symbol collections.

In the canonical form of Unicode encoding, which is what Java char and String types use, every character occupies two bytes. The Unicode characters \u0020 to \u007E are equivalent to the ASCII and ISO8859-1 (Latin-1) characters 0x20 through 0x7E. The Unicode characters \u00A0 to \u00FF are identical to the ISO8859-1 characters 0xA0 to 0xFF. Thus, there is a trivial mapping between Latin-1 and Unicode characters. A number of other portions of the Unicode encoding are based on preexisting standards, such as ISO8859-5 (Cyrillic) and ISO8859-8 (Hebrew), though the mappings between these standards and Unicode may not be as trivial as the Latin-1 mapping.

Note that Unicode support may be limited on many platforms. One of ...

Get Java Examples in a Nutshell, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.