Appendix C. A Unicode Primer

This isn’t a complete or comprehensive introduction to Unicode; it’s just enough for you to understand the parts of Unicode that we present in Learning Perl. Unicode is tricky not only because it’s a new way to think about strings, with lots of adjusted vocabulary, but also because computer languages in general have implemented it so poorly. Perl 5.14 makes lots of improvements to Perl’s Unicode compliance, but it’s not perfect (yet). It is, arguably, the best Unicode support that you will find, though.

Unicode

The Universal Character Set (UCS) is an abstract mapping of characters to code points. It has nothing to do with a particular representation in memory, which means we can agree on at least one way to talk about characters no matter which platform we’re on. An encoding turns the code points into a particular representation in memory, taking the abstract mapping and representing it physically within a computer. You probably think of this storage in terms of bytes, although when talking about Unicode, we use the term octets (see Figure C-1). Different encodings store the characters differently. To go the other way, interpreting the octets as characters, you decode them. You don’t have to worry too much about these because Perl can handle most of the details for you.

The code point of a character is not its storage. The encoding transforms characters into storage.
Figure C-1. The code point of a character is not its storage. The encoding transforms ...

Get Learning Perl, 6th Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.