BUY THIS BOOK
Add to Cart

Print Book $49.95


Add to Cart

PDF $34.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £35.50

What is this?

Looking to Reprint or License this content?

Perl Cookbook
Perl Cookbook, Second Edition

By Tom Christiansen, Nathan Torkington
Book Price: $49.95 USD
£35.50 GBP
PDF Price: $34.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Strings
He multiplieth words without knowledge.
—Job 35:16
Many programming languages force you to work at an uncomfortably low level. You think in lines, but your language wants you to deal with pointers. You think in strings, but it wants you to deal with bytes. Such a language can drive you to distraction. Don't despair; Perl isn't a low-level language, so lines and strings are easy to handle.
Perl was designed for easy but powerful text manipulation. In fact, Perl can manipulate text in so many ways that they can't all be described in one chapter. Check out other chapters for recipes on text processing. In particular, see Chapter 6 and Chapter 8, which discuss interesting techniques not covered here.
Perl's fundamental unit for working with data is the scalar, that is, single values stored in single (scalar) variables. Scalar variables hold strings, numbers, and references. Array and hash variables hold lists or associations of scalars, respectively. References are used for referring to values indirectly, not unlike pointers in low-level languages. Numbers are usually stored in your machine's double-precision floating-point notation. Strings in Perl may be of any length, within the limits of your machine's virtual memory, and can hold any arbitrary data you care to put there—even binary data containing null bytes.
A string in Perl is not an array of characters—nor of bytes, for that matter. You cannot use array subscripting on a string to address one of its characters; use substr for that. Like all data types in Perl, strings grow on demand. Space is reclaimed by Perl's garbage collection system when no longer used, typically when the variables have gone out of scope or when the expression in which they were used has been evaluated. In other words, memory management is already taken care of, so you don't have to worry about it.
A scalar value is either defined or undefined. If defined, it may hold a string, number, or reference. The only undefined value is
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction
Many programming languages force you to work at an uncomfortably low level. You think in lines, but your language wants you to deal with pointers. You think in strings, but it wants you to deal with bytes. Such a language can drive you to distraction. Don't despair; Perl isn't a low-level language, so lines and strings are easy to handle.
Perl was designed for easy but powerful text manipulation. In fact, Perl can manipulate text in so many ways that they can't all be described in one chapter. Check out other chapters for recipes on text processing. In particular, see Chapter 6 and Chapter 8, which discuss interesting techniques not covered here.
Perl's fundamental unit for working with data is the scalar, that is, single values stored in single (scalar) variables. Scalar variables hold strings, numbers, and references. Array and hash variables hold lists or associations of scalars, respectively. References are used for referring to values indirectly, not unlike pointers in low-level languages. Numbers are usually stored in your machine's double-precision floating-point notation. Strings in Perl may be of any length, within the limits of your machine's virtual memory, and can hold any arbitrary data you care to put there—even binary data containing null bytes.
A string in Perl is not an array of characters—nor of bytes, for that matter. You cannot use array subscripting on a string to address one of its characters; use substr for that. Like all data types in Perl, strings grow on demand. Space is reclaimed by Perl's garbage collection system when no longer used, typically when the variables have gone out of scope or when the expression in which they were used has been evaluated. In other words, memory management is already taken care of, so you don't have to worry about it.
A scalar value is either defined or undefined. If defined, it may hold a string, number, or reference. The only undefined value is undef. All other values are defined, even numeric and the empty string. Definedness is not the same as Boolean truth, though; to check whether a value is defined, use the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Accessing Substrings
You want to access or modify just a portion of a string, not the whole thing. For instance, you've read a fixed-width record and want to extract individual fields.
The substr function lets you read from and write to specific portions of the string.
$value = substr($string, $offset, $count);
$value = substr($string, $offset);

substr($string, $offset, $count) = $newstring;
substr($string, $offset, $count, $newstring);  # same as previous
substr($string, $offset)         = $newtail;
The unpack function gives only read access, but is faster when you have many substrings to extract.
# get a 5-byte string, skip 3 bytes,
# then grab two 8-byte strings, then the rest;
# (NB: only works on ASCII data, not Unicode)
($leading, $s1, $s2, $trailing) =
    unpack("A5 x3 A8 A8 A*", $data);

# split at 5-byte boundaries
@fivers = unpack("A5" x (length($string)/5), $string);

# chop string into individual single-byte characters
@chars  = unpack("A1" x length($string), $string);
Strings are a basic data type; they aren't arrays of a basic data type. Instead of using array subscripting to access individual characters as you sometimes do in other programming languages, in Perl you use functions like unpack or substr to access individual characters or a portion of the string.
The offset argument to substr indicates the start of the substring you're interested in, counting from the front if positive and from the end if negative. If the offset is 0, the substring starts at the beginning. The count argument is the length of the substring.
$string = "This is what you have";
#         +012345678901234567890  Indexing forwards  (left to right)
#          109876543210987654321- Indexing backwards (right to left)
#           note that 0 means 10 or 20, etc. above

$first  = substr($string, 0, 1);  # "T"
$start  = substr($string, 5, 2);  # "is"
$rest   = substr($string, 13);    # "you have"
$last   = substr($string, -1);    # "e"
$end    = substr($string, -4);    # "have"
$piece  = substr($string, -8, 3); # "you"
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Establishing a Default Value
You would like to supply a default value to a scalar variable, but only if it doesn't already have one. It often happens that you want a hardcoded default value for a variable that can be overridden from the command line or through an environment variable.
Use the || or ||= operator, which work on both strings and numbers:
# use $b if $b is true, else $c
$a = $b || $c;

# set $x to $y unless $x is already true
$x ||= $y;
If 0, "0", and "" are valid values for your variables, use defined instead:
# use $b if $b is defined, else $c
$a = defined($b) ? $b : $c;

# the "new" defined-or operator from future perl
use v5.9;
$a = $b // $c;
The big difference between the two techniques (defined and ||) is what they test: definedness versus truth. Three defined values are still false in the world of Perl: 0, "0", and "". If your variable already held one of those, and you wanted to keep that value, a || wouldn't work. You'd have to use the more elaborate three-way test with defined instead. It's often convenient to arrange for your program to care about only true or false values, not defined or undefined ones.
Rather than being restricted in its return values to a mere 1 or 0 as in most other languages, Perl's || operator has a much more interesting property: it returns its first operand (the lefthand side) if that operand is true; otherwise it returns its second operand. The && operator also returns the last evaluated expression, but is less often used for this property. These operators don't care whether their operands are strings, numbers, or references—any scalar will do. They just return the first one that makes the whole expression true or false. This doesn't affect the Boolean sense of the return value, but it does make the operators' return values more useful.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Exchanging Values Without Using Temporary Variables
You want to exchange the values of two scalar variables, but don't want to use a temporary variable.
Use list assignment to reorder the variables.
($VAR1, $VAR2) = ($VAR2, $VAR1);
Most programming languages require an intermediate step when swapping two variables' values:
$temp    = $a;
$a       = $b;
$b       = $temp;
Not so in Perl. It tracks both sides of the assignment, guaranteeing that you don't accidentally clobber any of your values. This eliminates the temporary variable:
$a       = "alpha";
$b       = "omega";
($a, $b) = ($b, $a);        # the first shall be last -- and versa vice
You can even exchange more than two variables at once:
($alpha, $beta, $production) = qw(January March August);
# move beta       to alpha,
# move production to beta,
# move alpha      to production
($alpha, $beta, $production) = ($beta, $production, $alpha);
When this code finishes, $alpha, $beta, and $production have the values "March", "August", and "January".
The section on "List value constructors" in perldata(1) and on "List Values and Arrays" in Chapter 2 of Programming Perl
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Converting Between Characters and Values
You want to print the number represented by a given character, or you want to print a character given a number.
Use ord to convert a character to a number, or use chr to convert a number to its corresponding character:
$num  = ord($char);
$char = chr($num);
The %c format used in printf and sprintf also converts a number to a character:
$char = sprintf("%c", $num);                # slower than chr($num)
printf("Number %d is character %c\n", $num, $num);
Number 101 is character e
            
A C* template used with pack and unpack can quickly convert many 8-bit bytes; similarly, use U* for Unicode characters.
@bytes = unpack("C*", $string);
$string = pack("C*", @bytes);

$unistr = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
@unichars = unpack("U*", $unistr);
Unlike low-level, typeless languages such as assembler, Perl doesn't treat characters and numbers interchangeably; it treats strings and numbers interchangeably. That means you can't just assign characters and numbers back and forth. Perl provides Pascal's chr and ord to convert between a character and its corresponding ordinal value:
$value     = ord("e");    # now 101
$character = chr(101);    # now "e"
If you already have a character, it's really represented as a string of length one, so just print it out directly using print or the %s format in printf and sprintf. The %c format forces printf or sprintf to convert a number into a character; it's not used for printing a character that's already in character format (that is, a string).
printf("Number %d is character %c\n", 101, 101);
The pack, unpack, chr, and ord functions are all faster than
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Named Unicode Characters
You want to use Unicode names for fancy characters in your code without worrying about their code points.
Place a use charnames at the top of your file, then freely insert "\N{ CHARSPEC}" escapes into your string literals.
The use charnames pragma lets you use symbolic names for Unicode characters. These are compile-time constants that you access with the \N{ CHARSPEC} double-quoted string sequence. Several subpragmas are supported. The :full subpragma grants access to the full range of character names, but you have to write them out in full, exactly as they occur in the Unicode character database, including the loud, all-capitals notation. The :short subpragma gives convenient shortcuts. Any import without a colon tag is taken to be a script name, giving case-sensitive shortcuts for those scripts.
use charnames ':full';
print "\N{GREEK CAPITAL LETTER DELTA} is called delta.\n";

Δ is called delta

use charnames ':short';
print "\N{greek:Delta} is an upper-case delta.\n";

Δ is an upper-case delta

use charnames qw(cyrillic greek);
print "\N{Sigma} and \N{sigma} are Greek sigmas.\n";
print "\N{Be} and \N{be} are Cyrillic bes.\n";

Σ
               and
               σ
               are Greek sigmas
               Б 
               and 
               б 
               are Cyrillic bes
            
Two functions, charnames::viacode and charnames::vianame, can translate between numeric code points and the long names. The Unicode documents use the notation U+XXXX to indicate the Unicode character whose code point is XXXX, so we'll use that here in our output.
use charnames qw(:full);
for $code (0xC4, 0x394) { 
    printf "Character U+%04X (%s) is named %s\n",
        $code, chr($code), charnames::viacode($code);
}

Character U+00C4 (Ä) is named LATIN CAPITAL LETTER A WITH DIAERESIS
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processing a String One Character at a Time
You want to process a string one character at a time.
Use split with a null pattern to break up the string into individual characters, or use unpack if you just want the characters' values:
@array = split(//, $string);      # each element a single character
@array = unpack("U*", $string);   # each element a code point (number)
Or extract each character in turn with a loop:
while (/(.)/g) {         # . is never a newline here
        # $1 has character, ord($1) its number
    }
As we said before, Perl's fundamental unit is the string, not the character. Needing to process anything a character at a time is rare. Usually some kind of higher-level Perl operation, like pattern matching, solves the problem more handily. See, for example, Recipe 7.14, where a set of substitutions is used to find command-line arguments.
Splitting on a pattern that matches the empty string returns a list of individual characters in the string. This is a convenient feature when done intentionally, but it's easy to do unintentionally. For instance, /X*/ matches all possible strings, including the empty string. Odds are you will find others when you don't mean to.
Here's an example that prints the characters used in the string "an apple a day", sorted in ascending order:
%seen = ( );
$string = "an apple a day";
foreach $char (split //, $string) {
    $seen{$char}++;
}
print "unique chars are: ", sort(keys %seen), "\n";
unique chars are:  adelnpy
            
These split and unpack solutions give an array of characters to work with. If you don't want an array, use a pattern match with the /g flag in a while loop, extracting one character at a time:
%seen = ( );
$string = "an apple a day";
while ($string =~ /(.)/g) {
    $seen{$1}++;
}
print "unique chars are: ", sort(keys %seen), "\n";
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reversing a String by Word or Character
You want to reverse the words or characters of a string.
Use the reverse function in scalar context for flipping characters:
$revchars = reverse($string);
To flip words, use reverse in list context with split and join:
$revwords = join(" ", reverse split(" ", $string));
The reverse function is two different functions in one. Called in scalar context, it joins together its arguments and returns that string in reverse order. Called in list context, it returns its arguments in the opposite order. When using reverse for its character-flipping behavior, use scalar to force scalar context unless it's entirely obvious.
$gnirts   = reverse($string);       # reverse letters in $string

@sdrow    = reverse(@words);        # reverse elements in @words

$confused = reverse(@words);        # reverse letters in join("", @words)
Here's an example of reversing words in a string. Using a single space, " ", as the pattern to split is a special case. It causes split to use contiguous whitespace as the separator and also discard leading null fields, just like awk. Normally, split discards only trailing null fields.
# reverse word order
$string = 'Yoda said, "can you see this?"';
@allwords    = split(" ", $string);
$revwords    = join(" ", reverse @allwords);
print $revwords, "\n";
this?" see you "can said, Yoda
            
We could remove the temporary array @allwords and do it on one line:
$revwords = join(" ", reverse split(" ", $string));
Multiple whitespace in $string becomes a single space in $revwords. If you want to preserve whitespace, use this:
$revwords = join("", reverse split(/(\s+)/, $string));
One use of reverse is to test whether a word is a palindrome (a word that reads the same backward or forward):
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Treating Unicode Combined Characters as Single Characters
You have a Unicode string that contains combining characters, and you'd like to treat each of these sequences as a single logical character.
Process them using \X in a regular expression.
$string = "fac\x{0327}ade";         # "façade"
$string =~ /fa.ade/;                # fails
$string =~ /fa\Xade/;               # succeeds

@chars = split(//, $string);        # 7 letters in @chars
@chars = $string =~ /(.)/g;         # same thing
@chars = $string =~ /(\X)/g;        # 6 "letters" in @chars
In Unicode, you can combine a base character with one or more non-spacing characters following it; these are usually diacritics, such as accent marks, cedillas, and tildas. Due to the presence of precombined characters, for the most part to accommodate legacy character systems, there can be two or more ways of writing the same thing.
For example, the word "façade" can be written with one character between the two a's, "\x{E7}", a character right out of Latin1 (ISO 8859-1). These characters might be encoded into a two-byte sequence under the UTF-8 encoding that Perl uses internally, but those two bytes still only count as one single character. That works just fine.
There's a thornier issue. Another way to write U+00E7 is with two different code points: a regular "c" followed by "\x{0327}". Code point U+0327 is a non-spacing combining character that means to go back and put a cedilla underneath the preceding base character.
There are times when you want Perl to treat each combined character sequence as one logical character. But because they're distinct code points, Perl's character-related operations treat non-spacing combining characters as separate characters, including substr, length, and regular expression metacharacters, such as in /./ or /[^abc]/.
In a regular expression, the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Canonicalizing Strings with Unicode Combined Characters
You have two strings that look the same when you print them out, but they don't test as string equal and sometimes even have different lengths. How can you get Perl to consider them the same strings?
When you have otherwise equivalent strings, at least some of which contain Unicode combining character sequences, instead of comparing them directly, compare the results of running them through the NFD( ) function from the Unicode::Normalize module.
use Unicode::Normalize;
$s1 = "fa\x{E7}ade";                
$s2 = "fac\x{0327}ade";                
if (NFD($s1) eq NFD($s2)) { print "Yup!\n" }
The same character sequence can sometimes be specified in multiple ways. Sometimes this is because of legacy encodings, such as the letters from Latin1 that contain diacritical marks. These can be specified directly with a single character (like U+00E7, LATIN SMALL LETTER C WITH CEDILLA) or indirectly via the base character (like U+0063, LATIN SMALL LETTER C) followed by a combining character (U+0327, COMBINING CEDILLA).
Another possibility is that you have two or more marks following a base character, but the order of those marks varies in your data. Imagine you wanted the letter "c" to have both a cedilla and a caron on top of it in order to print a ̌. That could be specified in any of these ways:
$string = v231.780;
#   LATIN SMALL LETTER C WITH CEDILLA
#   COMBINING CARON

$string = v99.807.780;
#         LATIN SMALL LETTER C
#         COMBINING CARON
#         COMBINING CEDILLA

$string = v99.780.807
#         LATIN SMALL LETTER C
#         COMBINING CEDILLA
#         COMBINING CARON
The normalization functions rearrange those into a reliable ordering. Several are provided, including NFD( ) for canonical decomposition and NFC( ) for canonical decomposition followed by canonical composition. No matter which of these three ways you used to specify your
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Treating a Unicode String as Octets
You have a Unicode string but want Perl to treat it as octets (e.g., to calculate its length or for purposes of I/O).
The use bytes pragma makes all Perl operations in its lexical scope treat the string as a group of octets. Use it when your code is calling Perl's character-aware functions directly:
$ff = "\x{FB00}";             # ff ligature
$chars = length($ff);         # length is one character
{
  use bytes;                  # force byte semantics
  $octets = length($ff);      # length is two octets
}
$chars = length($ff);         # back to character semantics
Alternatively, the Encode module lets you convert a Unicode string to a string of octets, and back again. Use it when the character-aware code isn't in your lexical scope:
use Encode qw(encode_utf8);

sub somefunc;                 # defined elsewhere

$ff = "\x{FB00}";             # ff ligature
$ff_oct = encode_utf8($ff);   # convert to octets

$chars = somefunc($ff);       # work with character string
$octets = somefunc($ff_oct);  # work with octet string
As explained in this chapter's Introduction, Perl knows about two types of string: those made of simple uninterpreted octets, and those made of Unicode characters whose UTF-8 representation may require more than one octet. Each individual string has a flag associated with it, identifying the string as either UTF-8 or octets. Perl's I/O and string operations (such as length) check this flag and give character or octet semantics accordingly.
Sometimes you need to work with bytes and not characters. For example, many protocols have a Content-Length header that specifies the size of the body of a message in octets. You can't simply use Perl's length function to calculate the size, because if the string you're calling length on is marked as UTF-8, you'll get the size in characters.
The use bytes
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Expanding and Compressing Tabs
You want to convert tabs in a string to the appropriate number of spaces, or vice versa. Converting spaces into tabs can be used to reduce file size when the file has many consecutive spaces. Converting tabs into spaces may be required when producing output for devices that don't understand tabs or think them at different positions than you do.
Either use a rather funny looking substitution:
while ($string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e) {
    # spin in empty loop until substitution finally fails
}
or use the standard Text::Tabs module:
use Text::Tabs;
@expanded_lines  = expand(@lines_with_tabs);
@tabulated_lines = unexpand(@lines_without_tabs);
Assuming tab stops are set every N positions (where N is customarily eight), it's easy to convert them into spaces. The standard textbook method does not use the Text::Tabs module but suffers slightly from being difficult to understand. Also, it uses the $` variable, whose very mention currently slows down every pattern match in the program. This is explained in Special Variables in Chapter 6. You could use this algorithm to make a filter to expand its input's tabstops to eight spaces each:
while (<>) {
    1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
    print;
}
To avoid $`, you could use a slightly more complicated alternative that uses the numbered variables for explicit capture; this one expands tabstops to four each instead of eight:
1 while s/^(.*?)(\t+)/$1 . ' ' x (length($2) * 4 - length($1) % 4)/e;
Another approach is to use the offsets directly from the @+ and @- arrays. This also expands to four-space positions:
1 while s/\t+/' ' x (($+[0] - $-[0]) * 4 - $-[0] % 4)/e;
If you're looking at all of these 1 while loops and wondering why they couldn't have been written as part of a simple
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Expanding Variables in User Input
You've read a string with an embedded variable reference, such as:
You owe $debt to me.
Now you want to replace $debt in the string with its value.
Use a substitution with symbolic references if the variables are all globals:
$text =~ s/\$(\w+)/${$1}/g;
But use a double /ee if they might be lexical (my) variables:
$text =~ s/(\$\w+)/$1/gee;
The first technique is basically to find what looks like a variable name, then use symbolic dereferencing to interpolate its contents. If $1 contains the string somevar, ${$1} will be whatever $somevar contains. This won't work if the use strict 'refs' pragma is in effect because that bans symbolic dereferencing.
Here's an example:
our ($rows, $cols);
no strict 'refs';                   # for ${$1}/g below
my $text;

($rows, $cols) = (24, 80);
$text = q(I am $rows high and $cols long);  # like single quotes!
$text =~ s/\$(\w+)/${$1}/g;
print $text;
I am 24 high and 80 long
            
You may have seen the /e substitution modifier used to evaluate the replacement as code rather than as a string. It's designed for situations where you don't know the exact replacement value, but you do know how to calculate it. For example, doubling every whole number in a string:
$text = "I am 17 years old";
$text =~ s/(\d+)/2 * $1/eg;
When Perl is compiling your program and sees a /e on a substitute, it compiles the code in the replacement block along with the rest of your program, long before the substitution actually happens. When a substitution is made, $1 is replaced with the string that matched. The code to evaluate would then be something like:
2 * 17
If we tried saying:
$text = 'I am $AGE years old';      # note single quotes
$text =~ s/(\$\w+)/$1/eg;           # WRONG
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Controlling Case
A string in uppercase needs converting to lowercase, or vice versa.
Use the lc and uc functions or the \L and \U string escapes.
$big = uc($little);             # "bo peep" -> "BO PEEP"
$little = lc($big);             # "JOHN"    -> "john"
$big = "\U$little";             # "bo peep" -> "BO PEEP"
$little = "\L$big";             # "JOHN"    -> "john"
To alter just one character, use the lcfirst and ucfirst functions or the \l and \u string escapes.
$big = "\u$little";             # "bo"      -> "Bo"
$little = "\l$big";             # "BoPeep"  -> "boPeep"
The functions and string escapes look different, but both do the same thing. You can set the case of either just the first character or the whole string. You can even do both at once to force uppercase (actually, titlecase; see later explanation) on initial characters and lowercase on the rest.
$beast   = "dromedary";
# capitalize various parts of $beast
$capit   = ucfirst($beast);         # Dromedary
$capit   = "\u\L$beast";            # (same)
$capall  = uc($beast);              # DROMEDARY
$capall  = "\U$beast";              # (same)
$caprest = lcfirst(uc($beast));     # dROMEDARY
$caprest = "\l\U$beast";            # (same)
These capitalization-changing escapes are commonly used to make a string's case consistent:
# titlecase each word's first character, lowercase the rest
$text = "thIS is a loNG liNE";
$text =~ s/(\w+)/\u\L$1/g;
print $text;
This Is A Long Line
            
You can also use these for case-insensitive comparison:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Properly Capitalizing a Title or Headline
You have a string representing a headline, the title of book, or some other work that needs proper capitalization.
Use a variant of this tc( ) titlecasing function:
INIT {
    our %nocap;
    for (qw(
            a an the
            and but or
            as at but by for from in into of off on onto per to with
        ))
    {
        $nocap{$_}++;
    }
}

sub tc {
    local $_ = shift;

    # put into lowercase if on stop list, else titlecase
    s/(\pL[\pL']*)/$nocap{$1} ? lc($1) : ucfirst(lc($1))/ge;

    s/^(\pL[\pL']*) /\u\L$1/x;  # first  word guaranteed to cap
    s/ (\pL[\pL']*)$/\u\L$1/x;  # last word guaranteed to cap

    # treat parenthesized portion as a complete title
    s/\( (\pL[\pL']*) /(\u\L$1/x;
    s/(\pL[\pL']*) \) /\u\L$1)/x;

    # capitalize first word following colon or semi-colon
    s/ ( [:;] \s+ ) (\pL[\pL']* ) /$1\u\L$2/x;

    return $_;
}
The rules for correctly capitalizing a headline or title in English are more complex than simply capitalizing the first letter of each word. If that's all you need to do, something like this should suffice:
s/(\w+\S*\w*)/\u\L$1/g;
Most style guides tell you that the first and last words in the title should always be capitalized, along with every other word that's not an article, the particle "to" in an infinitive construct, a coordinating conjunction, or a preposition.
Here's a demo, this time demonstrating the distinguishing property of titlecase. Assume the tc function is as defined in the Solution.
# with apologies (or kudos) to Stephen Brust, PJF,
# and to JRRT, as always.
@data = (
            "the enchantress of \x{01F3}ur mountain",
    "meeting the enchantress of \x{01F3}ur mountain",
    "the lord of the rings: the fellowship of the ring",
);

$mask = "%-20s: %s\n";

sub tc_lame {
    local $_ = shift;
    s/(\w+\S*\w*)/\u\L$1/g;
    return $_;
}

for $datum (@data) { 
    printf $mask, "ALL CAPITALS",       uc($datum);
    printf $mask, "no capitals",        lc($datum);
    printf $mask, "simple titlecase",   tc_lame($datum);
    printf $mask, "better titlecase",   tc($datum);
    print "\n";
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interpolating Functions and Expressions Within Strings
You want a function call or expression to expand within a string. This lets you construct more complex templates than with simple scalar variable interpolation.
Break up your expression into distinct concatenated pieces:
$answer = $var1 . func( ) . $var2;   # scalar only
Or use the slightly sneaky @{[ LIST EXPR ]} or ${ \(SCALAR EXPR ) } expansions:
$answer = "STRING @{[ LIST EXPR ]} MORE STRING";
$answer = "STRING ${\( SCALAR EXPR )} MORE STRING";
This code shows both techniques. The first line shows concatenation; the second shows the expansion trick:
$phrase = "I have " . ($n + 1) . " guanacos.";
$phrase = "I have ${\($n + 1)} guanacos.";
The first technique builds the final string by concatenating smaller strings, avoiding interpolation but achieving the same end. Because print effectively concatenates its entire argument list, if we were going to print $phrase, we could have just said:
print "I have ",  $n + 1, " guanacos.\n";
When you absolutely must have interpolation, you need the punctuation-riddled interpolation from the Solution. Only @, $, and \ are special within double quotes and most backquotes. (As with m// and s///, the qx( ) synonym is not subject to double-quote expansion if its delimiter is single quotes! $home = qx'echo home is $HOME'; would get the shell $HOME variable, not one in Perl.) So, the only way to force arbitrary expressions to expand is by expanding a ${ } or @{ } whose block contains a reference.
In the example:
$phrase = "I have ${\( count_em( ) )} guanacos.";
the function call within the parentheses is not in scalar context; it is still in list context. The following overrules that:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Indenting Here Documents
When using the multiline quoting mechanism called a here document, the text must be flush against the margin, which looks out of place in the code. You would like to indent the here document text in the code, but not have the indentation appear in the final string value.
Use a s/// operator to strip out leading whitespace.
# all in one
($var = << HERE_TARGET) =~ s/^\s+//gm;
    your text
    goes here
HERE_TARGET

# or with two steps
$var = << HERE_TARGET;
    your text
    goes here
HERE_TARGET
$var =~ s/^\s+//gm;
The substitution is straightforward. It removes leading whitespace from the text of the here document. The /m modifier lets the ^ character match at the start of each line in the string, and the /g modifier makes the pattern-matching engine repeat the substitution as often as it can (i.e., for every line in the here document).
($definition = << 'FINIS') =~ s/^\s+//gm;
    The five varieties of camelids
    are the familiar camel, his friends
    the llama and the alpaca, and the
    rather less well-known guanaco
    and vicuña.
FINIS
Be warned: all patterns in this recipe use \s, meaning one whitespace character, which will also match newlines. This means they will remove any blank lines in your here document. If you don't want this, replace \s with [^\S\n] in the patterns.
The substitution uses the property that the result of an assignment can be used as the lefthand side of =~. This lets us do it all in one line, but works only when assigning to a variable. When you're using the here document directly, it would be considered a constant value, and you wouldn't be able to modify it. In fact, you can't change a here document's value unless you first put it into a variable.
Not to worry, though, because there's an easy way around this, particularly if you're going to do this a lot in the program. Just write a subroutine:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reformatting Paragraphs
Your string is too big to fit the screen, and you want to break it up into lines of words, without splitting a word between lines. For instance, a style correction script might read a text file a paragraph at a time, replacing bad phrases with good ones. Replacing a phrase like utilizes the inherent functionality of with uses will change the length of lines, so it must somehow reformat the paragraphs when they're output.
Use the standard Text::Wrap module to put line breaks at the right place:
use Text::Wrap;
@output = wrap($leadtab, $nexttab, @para);
Or use the more discerning CPAN module, Text::Autoformat, instead:
use Text::Autoformat;
$formatted = autoformat $rawtext;
The Text::Wrap module provides the wrap function, shown in Example 1-3, which takes a list of lines and reformats them into a paragraph with no line more than $Text::Wrap::columns characters long. We set $columns to 20, ensuring that no line will be longer than 20 characters. We pass wrap two arguments before the list of lines: the first is the indent for the first line of output, the second the indent for every subsequent line.
Example 1-3. wrapdemo
  #!/usr/bin/perl -w
  # wrapdemo - show how Text::Wrap works
  @input = ("Folding and splicing is the work of an editor,",
            "not a mere collection of silicon",
            "and",
            "mobile electrons!");
  use Text::Wrap qw($columns &wrap);
  $columns = 20;
  print "0123456789" x 2, "\n";
  print wrap("    ", "  ", @input), "\n";
The result of this program is:
               01234567890123456789
               Folding and
               splicing is the
               work of an
               editor, not a
               mere collection
               of silicon and
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Escaping Characters
You need to output a string with certain characters (quotes, commas, etc.) escaped. For instance, you're producing a format string for sprintf and want to convert literal % signs into %%.
Use a substitution to backslash or double each character to be escaped:
# backslash
$var =~ s/([CHARLIST])/\\$1/g;

# double
$var =~ s/([CHARLIST])/$1$1/g;
$var is the variable to be altered. The CHARLIST is a list of characters to escape and can contain backslash escapes like \t and \n. If you just have one character to escape, omit the brackets:
$string =~ s/%/%%/g;
The following code lets you do escaping when preparing strings to submit to the shell. (In practice, you would need to escape more than just ' and " to make any arbitrary string safe for the shell. Getting the list of characters right is so hard, and the risks if you get it wrong are so great, that you're better off using the list form of system and exec to run programs, shown in Recipe 16.2. They avoid the shell altogether.)
$string = q(Mom said, "Don't do that.");
$string =~ s/(['"])/\\$1/g;
We had to use two backslashes in the replacement because the replacement section of a substitution is read as a double-quoted string, and to get one backslash, you need to write two. Here's a similar example for VMS DCL, where you need to double every quote to get one through:
$string = q(Mom said, "Don't do that.");
$string =~ s/(['"])/$1$1/g;
Microsoft command interpreters are harder to work with. In Windows, COMMAND.COM recognizes double quotes but not single ones, disregards backquotes for running commands, and requires a backslash to make a double quote into a literal. Any of the many free or commercial Unix-like shell environments available for Windows will work just fine, though.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Trimming Blanks from the Ends of a String
You have read a string that may have leading or trailing whitespace, and you want to remove it.
Use a pair of pattern substitutions to get rid of them:
$string =~ s/^\s+//;
$string =~ s/\s+$//;
Or write a function that returns the new value:
$string = trim($string);
@many   = trim(@many);

sub trim {
    my @out = @_;
    for (@out) {
        s/^\s+//;          # trim left
        s/\s+$//;          # trim right
    }
    return @out =  = 1 
              ? $out[0]   # only one to return
              : @out;     # or many
}
This problem has various solutions, but this one is the most efficient for the common case. This function returns new versions of the strings passed in to it with their leading and trailing whitespace removed. It works on both single strings and lists.
To remove the last character from the string, use the chop function. Be careful not to confuse this with the similar but different chomp function, which removes the last part of the string contained within that variable if and only if it is contained in the $/ variable, "\n" by default. These are often used to remove the trailing newline from input:
# print what's typed, but surrounded by > < symbols
while (<STDIN>) {
    chomp;
    print ">$_<\n";
}
This function can be embellished in any of several ways.
First, what should you do if several strings are passed in, but the return context demands a single scalar? As written, the function given in the Solution does a somewhat silly thing: it (inadvertently) returns a scalar representing the number of strings passed in. This isn't very useful. You could issue a warning or raise an exception. You could also squash the list of return values together.
For strings with spans of extra whitespace at points other than their ends, you could have your function collapse any remaining stretch of whitespace characters in the interior of the string down to a single space each by adding this line as the new last line of the loop:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Parsing Comma-Separated Data
You have a data file containing comma-separated values that you need to read, but these data fields may have quoted commas or escaped quotes in them. Most spreadsheets and database programs use comma-separated values as a common interchange format.
If your data file follows normal Unix quoting and escaping conventions, where quotes within a field are backslash-escaped "like \"this\"", use the standard Text::ParseWords and this simple code:
use Text::ParseWords;
sub parse_csv0 {
    return quotewords("," => 0, $_[0]);
}
However, if quotes within a field are doubled "like ""this""", you could use the following procedure from Mastering Regular Expressions, Second Edition:
sub parse_csv1 {
    my $text = shift;      # record containing comma-separated values
    my @fields  = ( );

    while ($text =~ m{
        # Either some non-quote/non-comma text:
        ( [^"',] + )

         # ...or...
         | 

        # ...a double-quoted field: (with "" allowed inside)

        " # field's opening quote; don't save this
         (   now a field is either
          (?:     [^"]    # non-quotes or
              |
                  ""      # adjacent quote pairs
           ) *  # any number
         )
        " # field's closing quote; unsaved

    }gx)
    {
      if (defined $1) {
          $field = $1;
      } else {
          ($field = $2) =~ s/""/"/g;
      }
      push @fields, $field;
    }
    return @fields;
}
Or use the CPAN Text:CSV module:
use Text::CSV;
sub parse_csv1 {
    my $line = shift;
    my $csv = Text::CSV->new( );              
    return $csv->parse($line) && $csv->fields( );           
}
Or use the CPAN Tie::CSV_File module:
tie @data, "Tie::CSV_File", "data.csv";

for ($i = 0; $i < @data; $i++) {
    printf "Row %d (Line %d) is %s\n", $i, $i+1, "@{$data[$i]}";
    for ($j = 0; $j < @{$data[$i]}; $j++) {
        print "Column $j is <$data[$i][$j]>\n";
    } 
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Constant Variables
You want a variable whose value cannot be modified once set.
If you don't need it to be a scalar variable that can interpolate, the use constant pragma will work:
use constant AVOGADRO => 6.02252e23;

printf "You need %g of those for guac\n", AVOGADRO;
If it does have to be a variable, assign to the typeglob a reference to a literal string or number, then use the scalar variable:
*AVOGADRO = \6.02252e23;
print "You need $AVOGADRO of those for guac\n";
But the most foolproof way is via a small tie class whose STORE method raises an exception:
package Tie::Constvar;
use Carp;
sub TIESCALAR {
    my ($class, $initval) = @_;
    my $var = $initval;
    return bless \$var => $class;
}
sub FETCH {
    my $selfref = shift;
    return $$selfref;
}
sub STORE {
    confess "Meddle not with the constants of the universe";
}
The use constant pragma is the easiest to use, but has a few drawbacks. The biggest one is that it doesn't give you a variable that you can expand in double-quoted strings. Another is that it isn't scoped; it puts a subroutine of that name into the package namespace.
The way the pragma really works is to create a subroutine of that name that takes no arguments and always returns the same value (or values if a list is provided). That means it goes into the current package's namespace and isn't scoped. You could do the same thing yourself this way:
sub AVOGADRO( ) { 6.02252e23 }
If you wanted it scoped to the current block, you could make a temporary subroutine by assigning an anonymous subroutine to the typeglob of that name:
use subs qw(AVOGADRO);
local *AVOGADRO = sub ( ) { 6.02252e23 };
But that's pretty magical, so you should comment the code if you don't plan to use the pragma.
If instead of assigning to the typeglob a reference to a subroutine, you assign to it a reference to a constant scalar, then you'll be able to use the variable of that name. That's the second technique given in the Solution. Its disadvantage is that typeglobs are available only for package variables, not for lexicals created via
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Soundex Matching
You have two English surnames and want to know whether they sound somewhat similar, regardless of spe