Normalization Forms

For reasons of compatibility with legacy character sets, as well as out-and-out mistakes, a number of characters have more than one representation in Unicode. For example, the umlaut character can be represented as either the single character ü or as a u followed by a combining diaresis. XML 1.0[1] treats these two forms as distinct. For example, Münchn (München) is not the same as Münchn (München). You can see that this might be a bit of a problem.

[1] This is one of the few changes that may be made in XML 1.1. However, exactly how or when characters will be normalized has not yet been finalized.

While such differences are not significant to XML parsing, they ...

Get Effective XML: 50 Specific Ways to Improve Your XML now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.