5.4.2. Decomposition

Occasionally, a character or sequence of characters can be described in more than one way in Unicode. For example, an “Å” can be Unicode character U+00C5, or it can be expressed as a plain A (U+0065) followed by a ° (“combining ring above”; U+030A). Perhaps more surprisingly, the letter sequence “ffi” can be described with a single character “Latin small ligature ffi” with code U+FB03. (One could argue that this is a presentation issue that should not have resulted in different Unicode characters, but we don’t make the rules.)

The Unicode standard defines four normalization forms (D, KD, C, and KC) for strings. See www.unicode.org/unicode/reports/tr15/tr15-23.html for the details. Two of them are used for collation. In the ...

Get Core Java® Volume II—Advanced Features, Ninth Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.