Unicode and ISO/IEC 10646

As there are far more than 256 symbols in use in the world, even ISO 8859 cannot represent them all. One obvious solution is to use more than one byte to encode each character, and two standards have emerged that use this technique. These are the Unicode and ISO/IEC 10646 standards (see www.unicode.org).

Unicode

The Unicode standard, now at version 3.0 (September 1999), was the first of these initiatives. It uses two bytes for each character, immediately raising the scope to 65,536 characters (though it actually contains just under 50,000 at the time of writing). Online charts of the characters covered can be found at www.unicode.org/charts. The number of characters of different types are listed below:

  • Alphabetics and ...

Get XML Companion, The, Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.