Atoms
Although there are various invisible things going on behind the scenes that we’ll explain presently, the smallest things you generally work with in Perl are individual characters. And we do mean characters; historically, Perl freely confused bytes with characters and characters with bytes, but in this new era of global networking, we must be careful to distinguish the two.
Perl may, of course, be written entirely in the 7-bit ASCII character set. For historical reasons, bytes in the range 128–255 are understood by Perl as being from the ISO-8859-1 (Latin1) character set, whose codepoints correspond to Unicode’s. To tell Perl that bytes in the current source file are to be treated as Unicode encoded as UTF-8, put this declaration at the top of your file:
use utf8;
As described in Chapter 6, Perl has had Unicode support since the last millennium. This support is pervasive throughout the language: you can use Unicode characters in identifiers (variable names and such) as well as within literal strings. When you are using Unicode, you don’t need to worry about how many bits or bytes it takes to represent a character. Perl just pretends all characters are the same size (that is, size 1), even though any given character might be represented by multiple bytes internally. Perl normally represents characters internally as UTF-8, a variable-length encoding. (For instance, a Unicode smiley character ☺, U+263A, would be represented internally as a three-byte sequence, but you aren’t supposed ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access