Language-Sensitive Comparison on Unicode Text
To the previously mentioned considerations, which you have to deal with regardless of which encoding standard you use to encode your characters, Unicode adds a few more interesting complications.
Unicode Normalization
Unlike in most other encoding schemes, many characters and sequences of characters have multiple legal representations in Unicode. One of the requirements of supporting Unicode is that (provided you support all of the characters involved) all representations of a character be treated as equal. Thus, whether you represent “ä” with
U+00E4 LATIN SMALL LETTER A WITH DIAERESIS
or
U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS
it should look and behave the same way everywhere. ...
Get Unicode Demystified now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.