By Tom Christiansen, Nathan Torkington
Book Price: $49.95 USD
£35.50 GBP
PDF Price: $34.99
Cover | Table of Contents | Colophon
substr for
that. Like all data types in Perl, strings grow on demand. Space is
reclaimed by Perl's garbage collection system when no longer used,
typically when the variables have gone out of scope or when the
expression in which they were used has been evaluated. In other
words, memory management is already taken care of, so you don't have
to worry about it.substr for
that. Like all data types in Perl, strings grow on demand. Space is
reclaimed by Perl's garbage collection system when no longer used,
typically when the variables have gone out of scope or when the
expression in which they were used has been evaluated. In other
words, memory management is already taken care of, so you don't have
to worry about it.undef. All other values are defined, even numeric
and the empty string. Definedness is not the same as Boolean truth,
though; to check whether a value is defined, use the
substr function lets you read from and write to
specific portions of the string.$value = substr($string, $offset, $count); $value = substr($string, $offset); substr($string, $offset, $count) = $newstring; substr($string, $offset, $count, $newstring); # same as previous substr($string, $offset) = $newtail;
unpack function gives only read access, but is
faster when you have many substrings to extract.# get a 5-byte string, skip 3 bytes,
# then grab two 8-byte strings, then the rest;
# (NB: only works on ASCII data, not Unicode)
($leading, $s1, $s2, $trailing) =
unpack("A5 x3 A8 A8 A*", $data);
# split at 5-byte boundaries
@fivers = unpack("A5" x (length($string)/5), $string);
# chop string into individual single-byte characters
@chars = unpack("A1" x length($string), $string);
unpack or
substr to access individual characters or a
portion of the string.substr indicates the start
of the substring you're interested in, counting from the front if
positive and from the end if negative. If the offset is 0, the
substring starts at the beginning. The count argument is the length
of the substring.$string = "This is what you have"; # +012345678901234567890 Indexing forwards (left to right) # 109876543210987654321- Indexing backwards (right to left) # note that 0 means 10 or 20, etc. above $first = substr($string, 0, 1); # "T" $start = substr($string, 5, 2); # "is" $rest = substr($string, 13); # "you have" $last = substr($string, -1); # "e" $end = substr($string, -4); # "have" $piece = substr($string, -8, 3); # "you"
|| or
||= operator, which work on both strings and
numbers:# use $b if $b is true, else $c $a = $b || $c; # set $x to $y unless $x is already true $x ||= $y;
0, "0", and
"" are valid values for your variables, use
defined instead:# use $b if $b is defined, else $c $a = defined($b) ? $b : $c; # the "new" defined-or operator from future perl use v5.9; $a = $b // $c;
defined
and ||) is what they test: definedness versus
truth. Three defined values are still false in the world of Perl:
0, "0", and
"". If your variable already held one of those,
and you wanted to keep that value, a || wouldn't
work. You'd have to use the more elaborate three-way test with
defined instead. It's often convenient to arrange
for your program to care about only true or false values, not defined
or undefined ones.|| operator has a
much more interesting property: it returns its first operand (the
lefthand side) if that operand is true; otherwise it returns its
second operand. The && operator also
returns the last evaluated expression, but is less often used for
this property. These operators don't care whether their operands are
strings, numbers, or references—any scalar will do. They just
return the first one that makes the whole expression true or false.
This doesn't affect the Boolean sense of the return value, but it
does make the operators' return values more
useful.
($VAR1, $VAR2) = ($VAR2, $VAR1);
$temp = $a; $a = $b; $b = $temp;
$a = "alpha"; $b = "omega"; ($a, $b) = ($b, $a); # the first shall be last -- and versa vice
($alpha, $beta, $production) = qw(January March August); # move beta to alpha, # move production to beta, # move alpha to production ($alpha, $beta, $production) = ($beta, $production, $alpha);
$alpha,
$beta, and $production have the
values "March", "August", and
"January".ord to convert a
character to a number, or use chr to convert a
number to its corresponding character:$num = ord($char); $char = chr($num);
%c format used in printf
and sprintf also converts a number to a character:$char = sprintf("%c", $num); # slower than chr($num)
printf("Number %d is character %c\n", $num, $num);
Number 101 is character e
C* template used with pack
and unpack can quickly convert many 8-bit bytes;
similarly, use U* for Unicode characters.@bytes = unpack("C*", $string);
$string = pack("C*", @bytes);
$unistr = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
@unichars = unpack("U*", $unistr);
chr and ord to convert between
a character and its corresponding ordinal value:$value = ord("e"); # now 101
$character = chr(101); # now "e"
print or the
%s format in printf and
sprintf. The %c format forces
printf or sprintf to convert a
number into a character; it's not used for printing a character
that's already in character format (that is, a string).printf("Number %d is character %c\n", 101, 101);
pack, unpack,
chr, and ord functions are all
faster than use
charnames at the
top of your file, then freely insert
"\N{
CHARSPEC}"
escapes into your string literals.use
charnames pragma lets you
use symbolic names for Unicode characters. These are compile-time
constants that you access with the
\N{
CHARSPEC} double-quoted
string sequence. Several subpragmas are supported. The
:full subpragma grants access to the full range of
character names, but you have to write them out in full, exactly as
they occur in the Unicode character database, including the loud,
all-capitals notation. The :short subpragma gives
convenient shortcuts. Any import without a colon tag is taken to be a
script name, giving case-sensitive shortcuts for those scripts.use charnames ':full';
print "\N{GREEK CAPITAL LETTER DELTA} is called delta.\n";
Δ is called delta
use charnames ':short';
print "\N{greek:Delta} is an upper-case delta.\n";
Δ is an upper-case delta
use charnames qw(cyrillic greek);
print "\N{Sigma} and \N{sigma} are Greek sigmas.\n";
print "\N{Be} and \N{be} are Cyrillic bes.\n";
Σ
and
σ
are Greek sigmas
Б
and
б
are Cyrillic bes
charnames::viacode and
charnames::vianame, can translate between numeric
code points and the long names. The Unicode documents use the
notation U+XXXX to indicate the Unicode
character whose code point is XXXX, so we'll use
that here in our output.use charnames qw(:full);
for $code (0xC4, 0x394) {
printf "Character U+%04X (%s) is named %s\n",
$code, chr($code), charnames::viacode($code);
}
Character U+00C4 (Ä) is named LATIN CAPITAL LETTER A WITH DIAERESISsplit with a null pattern to break up the string
into individual characters, or use unpack if you
just want the characters' values:@array = split(//, $string); # each element a single character
@array = unpack("U*", $string); # each element a code point (number)
while (/(.)/g) { # . is never a newline here
# $1 has character, ord($1) its number
}
/X*/ matches all possible strings,
including the empty string. Odds are you will find others when you
don't mean to.an
apple
a
day", sorted in ascending order:%seen = ( );
$string = "an apple a day";
foreach $char (split //, $string) {
$seen{$char}++;
}
print "unique chars are: ", sort(keys %seen), "\n";
unique chars are: adelnpy
split and unpack
solutions give an array of characters to work with. If you don't want
an array, use a pattern match with the /g flag in
a while loop, extracting one character at a time:%seen = ( );
$string = "an apple a day";
while ($string =~ /(.)/g) {
$seen{$1}++;
}
print "unique chars are: ", sort(keys %seen), "\n";
reverse function in scalar context for flipping
characters:$revchars = reverse($string);
reverse in list context with
split and
join:
$revwords = join(" ", reverse split(" ", $string));
reverse function is two different functions in
one. Called in scalar context, it joins together its arguments and
returns that string in reverse order. Called in list context, it
returns its arguments in the opposite order. When using
reverse for its character-flipping behavior, use
scalar to force scalar context unless it's
entirely obvious.$gnirts = reverse($string); # reverse letters in $string
@sdrow = reverse(@words); # reverse elements in @words
$confused = reverse(@words); # reverse letters in join("", @words)
split is a special case. It causes
split to use contiguous whitespace as the
separator and also discard leading null fields, just like
awk. Normally, split discards
only trailing null fields.# reverse word order
$string = 'Yoda said, "can you see this?"';
@allwords = split(" ", $string);
$revwords = join(" ", reverse @allwords);
print $revwords, "\n";
this?" see you "can said, Yoda
@allwords and
do it on one line:$revwords = join(" ", reverse split(" ", $string));
$string becomes a single
space in $revwords. If you want to preserve
whitespace, use this:$revwords = join("", reverse split(/(\s+)/, $string));
reverse is to test whether a word is a
palindrome (a word that reads the same backward or forward):\X in a regular expression.$string = "fac\x{0327}ade"; # "façade"
$string =~ /fa.ade/; # fails
$string =~ /fa\Xade/; # succeeds
@chars = split(//, $string); # 7 letters in @chars
@chars = $string =~ /(.)/g; # same thing
@chars = $string =~ /(\X)/g; # 6 "letters" in @chars
\x{E7}", a
character right out of Latin1 (ISO 8859-1). These characters might be
encoded into a two-byte sequence under the UTF-8 encoding that Perl
uses internally, but those two bytes still only count as one single
character. That works just fine.\x{0327}". Code point U+0327 is a non-spacing
combining character that means to go back and put a cedilla
underneath the preceding base character.substr, length, and regular
expression metacharacters, such as in /./ or
/[^abc]/.NFD( ) function from the Unicode::Normalize
module.
use Unicode::Normalize;
$s1 = "fa\x{E7}ade";
$s2 = "fac\x{0327}ade";
if (NFD($s1) eq NFD($s2)) { print "Yup!\n" }
$string = v231.780; # LATIN SMALL LETTER C WITH CEDILLA # COMBINING CARON $string = v99.807.780; # LATIN SMALL LETTER C # COMBINING CARON # COMBINING CEDILLA $string = v99.780.807 # LATIN SMALL LETTER C # COMBINING CEDILLA # COMBINING CARON
NFD( ) for canonical
decomposition and NFC( ) for canonical
decomposition followed by canonical composition. No matter which of
these three ways you used to specify your use bytes pragma makes all Perl operations in its
lexical scope treat the string as a group of octets. Use it when your
code is calling Perl's character-aware functions directly:$ff = "\x{FB00}"; # ff ligature
$chars = length($ff); # length is one character
{
use bytes; # force byte semantics
$octets = length($ff); # length is two octets
}
$chars = length($ff); # back to character semantics
Encode module lets you convert
a Unicode string to a string of octets, and back again. Use it when
the character-aware code isn't in your lexical scope:use Encode qw(encode_utf8);
sub somefunc; # defined elsewhere
$ff = "\x{FB00}"; # ff ligature
$ff_oct = encode_utf8($ff); # convert to octets
$chars = somefunc($ff); # work with character string
$octets = somefunc($ff_oct); # work with octet string
length) check this
flag and give character or octet semantics accordingly.Content-Length
header that specifies the size of the body of a message in octets.
You can't simply use Perl's length function to
calculate the size, because if the string you're calling
length on is marked as UTF-8, you'll get the size
in characters.use byteswhile ($string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e) {
# spin in empty loop until substitution finally fails
}
use Text::Tabs; @expanded_lines = expand(@lines_with_tabs); @tabulated_lines = unexpand(@lines_without_tabs);
$` variable, whose
very mention currently slows down every pattern match in the program.
This is explained in Special
Variables in Chapter 6. You could use this
algorithm to make a filter to expand its input's tabstops to eight
spaces each:
while (<>) {
1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
print;
}
$`, you could use a slightly more
complicated alternative that uses the numbered variables for explicit
capture; this one expands tabstops to four each instead of eight:1 while s/^(.*?)(\t+)/$1 . ' ' x (length($2) * 4 - length($1) % 4)/e;
@+ and @- arrays. This also
expands to four-space positions:1 while s/\t+/' ' x (($+[0] - $-[0]) * 4 - $-[0] % 4)/e;
1 while loops
and wondering why they couldn't have been written as part of a simple
You owe $debt to me.
$debt in the string with
its value.$text =~ s/\$(\w+)/${$1}/g;
/ee if they might be lexical
(my) variables:$text =~ s/(\$\w+)/$1/gee;
$1 contains the string somevar,
${$1} will be whatever $somevar
contains. This won't work if the use
strict 'refs' pragma is in
effect because that bans symbolic dereferencing.our ($rows, $cols);
no strict 'refs'; # for ${$1}/g below
my $text;
($rows, $cols) = (24, 80);
$text = q(I am $rows high and $cols long); # like single quotes!
$text =~ s/\$(\w+)/${$1}/g;
print $text;
I am 24 high and 80 long
/e substitution modifier used to evaluate the
replacement as code rather than as a string. It's designed for
situations where you don't know the exact replacement value, but you
do know how to calculate it. For example, doubling every whole number
in a string:
$text = "I am 17 years old"; $text =~ s/(\d+)/2 * $1/eg;
/e
on a substitute, it compiles the code in the replacement block along
with the rest of your program, long before the substitution actually
happens. When a substitution is made, $1 is
replaced with the string that matched. The code to evaluate would
then be something like:2 * 17
$text = 'I am $AGE years old'; # note single quotes $text =~ s/(\$\w+)/$1/eg; # WRONG
lc and uc functions or the
\L and \U string escapes.$big = uc($little); # "bo peep" -> "BO PEEP" $little = lc($big); # "JOHN" -> "john" $big = "\U$little"; # "bo peep" -> "BO PEEP" $little = "\L$big"; # "JOHN" -> "john"
lcfirst and
ucfirst functions or the \l and
\u string escapes.$big = "\u$little"; # "bo" -> "Bo" $little = "\l$big"; # "BoPeep" -> "boPeep"
$beast = "dromedary"; # capitalize various parts of $beast $capit = ucfirst($beast); # Dromedary $capit = "\u\L$beast"; # (same) $capall = uc($beast); # DROMEDARY $capall = "\U$beast"; # (same) $caprest = lcfirst(uc($beast)); # dROMEDARY $caprest = "\l\U$beast"; # (same)
# titlecase each word's first character, lowercase the rest
$text = "thIS is a loNG liNE";
$text =~ s/(\w+)/\u\L$1/g;
print $text;
This Is A Long Line
tc( ) titlecasing function:INIT {
our %nocap;
for (qw(
a an the
and but or
as at but by for from in into of off on onto per to with
))
{
$nocap{$_}++;
}
}
sub tc {
local $_ = shift;
# put into lowercase if on stop list, else titlecase
s/(\pL[\pL']*)/$nocap{$1} ? lc($1) : ucfirst(lc($1))/ge;
s/^(\pL[\pL']*) /\u\L$1/x; # first word guaranteed to cap
s/ (\pL[\pL']*)$/\u\L$1/x; # last word guaranteed to cap
# treat parenthesized portion as a complete title
s/\( (\pL[\pL']*) /(\u\L$1/x;
s/(\pL[\pL']*) \) /\u\L$1)/x;
# capitalize first word following colon or semi-colon
s/ ( [:;] \s+ ) (\pL[\pL']* ) /$1\u\L$2/x;
return $_;
}
s/(\w+\S*\w*)/\u\L$1/g;
tc function is as defined in
the Solution.# with apologies (or kudos) to Stephen Brust, PJF,
# and to JRRT, as always.
@data = (
"the enchantress of \x{01F3}ur mountain",
"meeting the enchantress of \x{01F3}ur mountain",
"the lord of the rings: the fellowship of the ring",
);
$mask = "%-20s: %s\n";
sub tc_lame {
local $_ = shift;
s/(\w+\S*\w*)/\u\L$1/g;
return $_;
}
for $datum (@data) {
printf $mask, "ALL CAPITALS", uc($datum);
printf $mask, "no capitals", lc($datum);
printf $mask, "simple titlecase", tc_lame($datum);
printf $mask, "better titlecase", tc($datum);
print "\n";
}
$answer = $var1 . func( ) . $var2; # scalar only
@{[
LIST
EXPR
]}
or ${
\(SCALAR
EXPR
)
}
expansions:$answer = "STRING @{[ LIST EXPR ]} MORE STRING";
$answer = "STRING ${\( SCALAR EXPR )} MORE STRING";
$phrase = "I have " . ($n + 1) . " guanacos.";
$phrase = "I have ${\($n + 1)} guanacos.";
print effectively concatenates its entire argument
list, if we were going to print
$phrase, we could have just said:print "I have ", $n + 1, " guanacos.\n";
@, $, and \
are special within double quotes and most backquotes. (As with
m// and s///, the qx(
) synonym is not subject to double-quote expansion if its
delimiter is single quotes! $home
=
qx'echo
home
is
$HOME'; would get the shell
$HOME variable, not one in Perl.) So, the only way
to force arbitrary expressions to expand is by expanding a
${ } or @{ } whose block
contains a reference.$phrase = "I have ${\( count_em( ) )} guanacos.";
s///
operator to strip out leading whitespace.# all in one
($var = << HERE_TARGET) =~ s/^\s+//gm;
your text
goes here
HERE_TARGET
# or with two steps
$var = << HERE_TARGET;
your text
goes here
HERE_TARGET
$var =~ s/^\s+//gm;
/m
modifier lets the ^ character match at the start
of each line in the string, and the /g modifier
makes the pattern-matching engine repeat the substitution as often as
it can (i.e., for every line in the here document).($definition = << 'FINIS') =~ s/^\s+//gm;
The five varieties of camelids
are the familiar camel, his friends
the llama and the alpaca, and the
rather less well-known guanaco
and vicuña.
FINIS
\s,
meaning one whitespace character, which will also match newlines.
This means they will remove any blank lines in your here document. If
you don't want this, replace \s with
[^\S\n] in the patterns.=~. This lets
us do it all in one line, but works only when assigning to a
variable. When you're using the here document directly, it would be
considered a constant value, and you wouldn't be able to modify it.
In fact, you can't change a here document's value
unless you first put it into a variable.use Text::Wrap; @output = wrap($leadtab, $nexttab, @para);
use Text::Autoformat; $formatted = autoformat $rawtext;
wrap function,
shown in Example 1-3, which takes a list of lines
and reformats them into a paragraph with no line more than
$Text::Wrap::columns characters long. We set
$columns to 20, ensuring that no line will be
longer than 20 characters. We pass wrap two
arguments before the list of lines: the first is the indent for the
first line of output, the second the indent for every subsequent
line. #!/usr/bin/perl -w
# wrapdemo - show how Text::Wrap works
@input = ("Folding and splicing is the work of an editor,",
"not a mere collection of silicon",
"and",
"mobile electrons!");
use Text::Wrap qw($columns &wrap);
$columns = 20;
print "0123456789" x 2, "\n";
print wrap(" ", " ", @input), "\n";
01234567890123456789
Folding and
splicing is the
work of an
editor, not a
mere collection
of silicon andsprintf and
want to convert literal % signs into
%%.# backslash $var =~ s/([CHARLIST])/\\$1/g; # double $var =~ s/([CHARLIST])/$1$1/g;
$var is the variable to be altered. The
CHARLIST is a list of characters to escape and can
contain backslash escapes like \t and
\n. If you just have one character to escape, omit
the brackets:$string =~ s/%/%%/g;
system and exec to run
programs, shown in Recipe 16.2. They avoid
the shell altogether.)$string = q(Mom said, "Don't do that."); $string =~ s/(['"])/\\$1/g;
$string = q(Mom said, "Don't do that."); $string =~ s/(['"])/$1$1/g;
$string =~ s/^\s+//; $string =~ s/\s+$//;
$string = trim($string);
@many = trim(@many);
sub trim {
my @out = @_;
for (@out) {
s/^\s+//; # trim left
s/\s+$//; # trim right
}
return @out = = 1
? $out[0] # only one to return
: @out; # or many
}
chop function. Be careful not to confuse this with
the similar but different chomp function, which
removes the last part of the string contained within that variable if
and only if it is contained in the $/ variable,
"\n" by default. These are often used to remove
the trailing newline from input:# print what's typed, but surrounded by > < symbols
while (<STDIN>) {
chomp;
print ">$_<\n";
}
like \"this\"", use the standard Text::ParseWords
and this simple code:use Text::ParseWords;
sub parse_csv0 {
return quotewords("," => 0, $_[0]);
}
like
""this""", you could use the following procedure from
Mastering Regular Expressions, Second Edition:sub parse_csv1 {
my $text = shift; # record containing comma-separated values
my @fields = ( );
while ($text =~ m{
# Either some non-quote/non-comma text:
( [^"',] + )
# ...or...
|
# ...a double-quoted field: (with "" allowed inside)
" # field's opening quote; don't save this
( now a field is either
(?: [^"] # non-quotes or
|
"" # adjacent quote pairs
) * # any number
)
" # field's closing quote; unsaved
}gx)
{
if (defined $1) {
$field = $1;
} else {
($field = $2) =~ s/""/"/g;
}
push @fields, $field;
}
return @fields;
}
use Text::CSV;
sub parse_csv1 {
my $line = shift;
my $csv = Text::CSV->new( );
return $csv->parse($line) && $csv->fields( );
}
tie @data, "Tie::CSV_File", "data.csv";
for ($i = 0; $i < @data; $i++) {
printf "Row %d (Line %d) is %s\n", $i, $i+1, "@{$data[$i]}";
for ($j = 0; $j < @{$data[$i]}; $j++) {
print "Column $j is <$data[$i][$j]>\n";
}
}use
constant pragma will
work:
use constant AVOGADRO => 6.02252e23; printf "You need %g of those for guac\n", AVOGADRO;
*AVOGADRO = \6.02252e23; print "You need $AVOGADRO of those for guac\n";
tie
class whose STORE method raises an exception:package Tie::Constvar;
use Carp;
sub TIESCALAR {
my ($class, $initval) = @_;
my $var = $initval;
return bless \$var => $class;
}
sub FETCH {
my $selfref = shift;
return $$selfref;
}
sub STORE {
confess "Meddle not with the constants of the universe";
}
use constant pragma is the easiest to use, but
has a few drawbacks. The biggest one is that it doesn't give you a
variable that you can expand in double-quoted strings. Another is
that it isn't scoped; it puts a subroutine of that name into the
package namespace.sub AVOGADRO( ) { 6.02252e23 }
use subs qw(AVOGADRO);
local *AVOGADRO = sub ( ) { 6.02252e23 };
use Text::Soundex; $CODE = soundex($STRING); @CODES = soundex(@LIST);
use Text::Metaphone;
$phoned_words = Metaphone('Schwern');