Getting at Unicode Data
Internally, Perl keeps all codepoints in a format that’s compatible with Unicode, meaning that the bottom 21 bits are the same as Unicode’s, just as Unicode’s bottom 8 bits are the same as Latin-1’s. How these codepoints are actually stored internally is not something average Perl users should ever have to worry about.
However, as soon as you have to interact with the outside world, you are going to have to interpret the input data being fed to you and, in turn, generate output data that’s in a format the receiving program finds palatable. Characters inside Perl have been decoded from their external format into abstract characters, but when you need to emit those characters, you’ll have to encode them into whatever format is expected of you. If you forget to do this, you’re liable to generate mutterings about “wide character” or “Malformed UTF-8 character”.
Perl has two main ways to mark the encoding of an entire stream,
plus various shortcuts to make this even easier. If your stream is
already opened, you can set its encoding by passing a second argument
to the binmode
function:
binmode(STDIN, ":encoding(CP1252)")
|| die "can't binmode to cp1252: $!";
binmode(STDOUT, ":encoding(UTF–8)")
|| die "can't binmode to UTF–8: $!";If you haven’t opened the file yet, then you can use the mode
argument in a call to open to specify the
encoding right there.
open(OUTPUT, "> :raw :encoding(UTF–16LE) :crlf", $filename) || die "can't open $filename: $!"; print OUTPUT for @stuff; ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access