Chapter 6. Unicode
If you’ve never heard of Unicode, you must have been living on a desert island with nothing but a manual typewriter for the last 20 years. Unicode celebrated its 20th birthday back in early 2010. Even if you have heard of it, you may not really know what it is, or how to work with it. This is not something to be embarrassed about; the fact of the matter is that everyone is still learning about Unicode, including its inventors. Although we can’t hope to cover all the nuanced intricacies of Unicode in this chapter or even this book, we can certainly get you started using Unicode in Perl.
Working with Unicode these days isn’t an option: it’s a necessity. The majority of the Web is in Unicode,[107] and many large corpora are 100% Unicode. Because web browsers do their best to make do with whatever character set web servers give them, you probably haven’t noticed how much Unicode is really out there now. Programming languages without solid Unicode support are decades behind the curve, as are programs written in those languages. They might have worked okay in the 1980s, even the 1990s, but today we need the real thing.
So how did we get here?
Computers store characters as numbers. In the early days these were small integers, 5, 6, 7, or 8 bits long. EBCDIC used 8 bits and was based on punch cards. ASCII used up only 7 bits, leaving precisely 1 bit in each byte for other purposes—many, many other purposes, all contradictory, as it turned out.
So, in those days, pretty much ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access