Effects of Character Semantics

The upshot of all this is that a typical built-in operator will operate on characters unless it is in the scope of a use bytes pragma. However, even outside the scope of use bytes, if all of the operands of the operator are stored as 8-bit characters (that is, none of the operands are stored in utf8), then character semantics are indistinguishable from byte semantics, and the result of the operator will be stored in 8-bit form internally. This preserves backward compatibility as long as you don't feed your program any characters wider than Latin-1.

The utf8 pragma is primarily a compatibility device that enables recognition of UTF-8 in literals and identifiers encountered by the parser. It may also be used for enabling some of the more experimental Unicode support features. Our long-term goal is to turn the utf8 pragma into a no-op.

The use bytes pragma will never turn into a no-op. Not only is it necessary for byte-oriented code, but it also has the side effect of defining byte-oriented wrappers around certain functions for use outside the scope of use bytes. As of this writing, the only defined wrapper is for length, but there are likely to be more as time goes by. To use such a wrapper, say:

use bytes ();   # Load wrappers without importing byte semantics.
…
$charlen =        length("\x{ffff_ffff}");   # Returns 1.
$bytelen = bytes::length("\x{ffff_ffff}");   # Returns 7.

Outside the scope of a use bytes declaration, Perl version 5.6 works (or at least, is ...

Get Programming Perl, 3rd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.