UTF-32

The simplest (and newest) of the Unicode encoding forms is UTF-32, which was first defined in Unicode Standard Annex #19 (now officially part of Unicode 3.1). To go from the 21-bit abstract code point value to UTF-32, you simply zero-pad the value out to 32 bits.

UTF-32 exists for three basic reasons:

  1. It's the Unicode standard's counterpart to UCS-4, the four-byte format from ISO 10646.

  2. It provides a way to represent every Unicode code point value with a single code unit, which can make for simpler implementations.

  3. It can be useful as an in-memory format on systems with a 32-bit word length. Some systems either don't give you a way to access individual bytes of a 32-bit word or impose a performance penalty for doing so. If memory is cheap, ...

Get Unicode Demystified now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.