Show, Don’t Tell
If a picture is worth a thousand words, putting the actual characters you want into your program has to be worth at least fifty or so. So you’ll want to start off by telling Perl that your source code really is in Unicode characters, not just in bytes.[109] You don’t have to do this, but it makes some things easier if you can enter real Unicode into your source.
So far, Perl assumes every source unit is in ASCII unless you tell it otherwise (though, arguably, the default should change to Unicode someday). You can always specify Unicode codepoints through the circumlocutions we mentioned above, but literals will be treated as separate bytes. If Perl sees a literal UTF-8 character, it won’t realize it should treat it as one logical character, and it will show up as one, two, three, or even four separate Perl characters, all with ordinals under 256. You don’t want that to happen, so use these declarations:
use v5.14; # includes the unicode_strings feature use utf8; # handles UTF–8 literals
The first makes sure that codepoints with ordinals in the tricky range of 128–255
are treated as Unicode strings, while the second tells the Perl
compiler that this entire source file is in the UTF-8 encoding of
Unicode. Under the utf8 pragma, you can
now use Unicode in your string and regex literals.
my $dwarf = "Þórinn Eikinskjaldi"; my $search = "búsqueda"; my $measure = "Ångström"; my $how = "à contre–cœur"; my $motto = "";
That’s a lot easier to read, although maybe not as easy to ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access