Read it Now!
Reprint Licensing

Perl Cookbook
Perl Cookbook Tips and Tricks for Perl Programmers

By Tom Christiansen, Nathan Torkington

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Strings
He multiplieth words without knowledge.
—Job 35:16
Many programming languages force you to work at an uncomfortably low level. You think in lines, but your language wants you to deal with pointers. You think in strings, but it wants you to deal with bytes. Such a language can drive you to distraction. Don't despair, though—Perl isn't a low-level language; lines and strings are easy to handle.
Perl was designed for text manipulation. In fact, Perl can manipulate text in so many ways that they can't all be described in one chapter. Check out other chapters for recipes on text processing. In particular, see Chapter 6, and Chapter 8, which discuss interesting techniques not covered here.
Perl's fundamental unit for working with data is the scalar, that is, single values stored in single (scalar) variables. Scalar variables hold strings, numbers, and references. Array and hash variables hold lists or associations of scalars, respectively. References are used for referring to other values indirectly, not unlike pointers in low-level languages. Numbers are usually stored in your machine's double-precision floating-point notation. Strings in Perl may be of any length (within the limits of your machine's virtual memory) and contain any data you care to put there—even binary data containing null bytes.
A string is not an array of bytes: You cannot use array subscripting on a string to address one of its characters; use substr for that. Like all data types in Perl, strings grow and shrink on demand. They get reclaimed by Perl's garbage collection system when they're no longer used, typically when the variables holding them go out of scope or when the expression they were used in has been evaluated. In other words, memory management is already taken care of for you, so you don't have to worry about it.
A scalar value is either defined or undefined. If defined, it may hold a string, number, or reference. The only undefined value is
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction
He multiplieth words without knowledge.
—Job 35:16
Many programming languages force you to work at an uncomfortably low level. You think in lines, but your language wants you to deal with pointers. You think in strings, but it wants you to deal with bytes. Such a language can drive you to distraction. Don't despair, though—Perl isn't a low-level language; lines and strings are easy to handle.
Perl was designed for text manipulation. In fact, Perl can manipulate text in so many ways that they can't all be described in one chapter. Check out other chapters for recipes on text processing. In particular, see Chapter 6, and Chapter 8, which discuss interesting techniques not covered here.
Perl's fundamental unit for working with data is the scalar, that is, single values stored in single (scalar) variables. Scalar variables hold strings, numbers, and references. Array and hash variables hold lists or associations of scalars, respectively. References are used for referring to other values indirectly, not unlike pointers in low-level languages. Numbers are usually stored in your machine's double-precision floating-point notation. Strings in Perl may be of any length (within the limits of your machine's virtual memory) and contain any data you care to put there—even binary data containing null bytes.
A string is not an array of bytes: You cannot use array subscripting on a string to address one of its characters; use substr for that. Like all data types in Perl, strings grow and shrink on demand. They get reclaimed by Perl's garbage collection system when they're no longer used, typically when the variables holding them go out of scope or when the expression they were used in has been evaluated. In other words, memory management is already taken care of for you, so you don't have to worry about it.
A scalar value is either defined or undefined. If defined, it may hold a string, number, or reference. The only undefined value is
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Accessing Substrings
You want to access or modify just a portion of a string, not the whole thing. For instance, you've read a fixed-width record and want to extract the individual fields.
The substr function lets you read from and write to bits of the string.
$value = substr($string, $offset, $count);
$value = substr($string, $offset);
    
substr($string, $offset, $count) = $newstring;
substr($string, $offset)         = $newtail;
The unpack function gives only read access, but is faster when you have many substrings to extract.
# get a 5-byte string, skip 3, then grab 2 8-byte strings, then the rest
($leading, $s1, $s2, $trailing) =
    unpack("A5 x3 A8 A8 A*", $data);

# split at five byte boundaries
@fivers = unpack("A5" x (length($string)/5), $string);

# chop string into individual characters
@chars  = unpack("A1" x length($string), $string);
Unlike many other languages that represent strings as arrays of bytes (or characters), in Perl, strings are a basic data type. This means that you must use functions like unpack or substr to access individual characters or a portion of the string.
The offset argument to substr indicates the start of the substring you're interested in, counting from the front if positive and from the end if negative. If offset is 0, the substring starts at the beginning. The count argument is the length of the substring.
$string = "This is what you have";
#         +012345678901234567890  Indexing forwards  (left to right)
#          109876543210987654321- Indexing backwards (right to left)
#           note that 0 means 10 or 20, etc. above

$first  = substr($string, 0, 1);  # "T"
$start  = substr($string, 5, 2);  # "is"
$rest   = substr($string, 13);    # "you have"
$last   = substr($string, -1);    # "e"
$end    = substr($string, -4);    # "have"
$piece  = substr($string, -8, 3); # "you"
You can do more than just look at parts of the string with
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Establishing a Default Value
You would like to give a default value to a scalar variable, but only if it doesn't already have one. It often happens that you want a hard-coded default value for a variable that can be overridden from the command-line or through an environment variable.
Use the || or ||= operator, which work on both strings and numbers:
# use $b if $b is true, else $c
$a = $b || $c;              

# set $x to $y unless $x is already true
$x ||= $y
If 0 or "0" are valid values for your variables, use defined instead:
# use $b if $b is defined, else $c
$a = defined($b) ? $b : $c;
The big difference between the two techniques (defined and ||) is what they test: definedness versus truth. Three defined values are still false in the world of Perl: 0, "0", and "". If your variable already held one of those, and you wanted to keep that value, a || wouldn't work. You'd have to use the clumsier tests with defined instead. It's often convenient to arrange for your program to care only about true or false values, not defined or undefined ones.
Rather than being restricted in its return values to a mere 1 or as in most other languages, Perl's || operator has a much more interesting property: It returns its first operand (the left-hand side) if that operand is true; otherwise it returns its second operand. The && operator also returns the last evaluated expression, but is less often used for this property. These operators don't care whether their operands are strings, numbers, or references—any scalar will do. They just return the first one that makes the whole expression true or false. This doesn't affect the Boolean sense of the return value, but it does make the operators more convenient to use.
This property lets you provide a default value to a variable, function, or longer expression in case the first part doesn't pan out. Here's an example of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Exchanging Values Without Using Temporary Variables
You want to exchange the values of two scalar variables, but don't want to use a temporary variable.
Use list assignment to reorder the variables.
($VAR1, $VAR2) = ($VAR2, $VAR1);
Most programming languages force you to use an intermediate step when swapping two variables' values:
$temp    = $a;
$a       = $b;
$b       = $temp;
Not so in Perl. It tracks both sides of the assignment, guaranteeing that you don't accidentally clobber any of your values. This lets you eliminate the temporary variable:
$a       = "alpha";
$b       = "omega";
($a, $b) = ($b, $a);        # the first shall be last -- and versa vice
You can even exchange more than two variables at once:
($alpha, $beta, $production) = qw(January March August);
# move beta       to alpha,
# move production to beta,
# move alpha      to production
($alpha, $beta, $production) = ($beta, $production, $alpha);
When this code finishes, $alpha, $beta, and $production have the values "March", "August", and "January".
The section on "List value constructors" in perldata(1) and Chapter 2 of Programming Perl
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Converting Between ASCII Characters and Values
You want to print out the number represented by a given ASCII character, or you want to print out an ASCII character given a number.
Use ord to convert a character to a number, or use chr to convert a number to a character:
$num  = ord($char);
$char = chr($num);
The %c format used in printf and sprintf also converts a number to a character:
$char = sprintf("%c", $num);                # slower than chr($num)
printf("Number %d is character %c\n", $num, $num);

                  Number 101 is character e
               
            
A C* template used with pack and unpack can quickly convert many characters.
@ASCII = unpack("C*", $string);
$STRING = pack("C*", @ascii);
Unlike low-level, typeless languages like assembler, Perl doesn't treat characters and numbers interchangeably; it treats strings and numbers interchangeably. That means you can't just assign characters and numbers back and forth. Perl provides Pascal's chr and ord to convert between a character and its corresponding ordinal value:
$ascii_value = ord("e");    # now 101
$character   = chr(101);    # now "e"
If you already have a character, it's really represented as a string of length one, so just print it out directly using print or the %s format in printf and sprintf. The %c format forces printf or sprintf to convert a number into a character; it's not used for printing a character that's already in character format (that is, a string).
printf("Number %d is character %c\n", 101, 101);
The pack , unpack, chr, and ord functions are all faster than sprintf. Here are pack and unpack in action:
@ascii_character_numbers = unpack("C*", "sample");
print "@ascii_character_numbers\n";
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processing a String One Character at a Time
You want to process a string one character at a time.
Use split with a null pattern to break up the string into individual characters, or use unpack if you just want their ASCII values:
@array = split(//, $string);

@array = unpack("C*", $string);
Or extract each character in turn with a loop:
while (/(.)/g) { 	# . is never a newline here
        # do something with $1
    }
As we said before, Perl's fundamental unit is the string, not the character. Needing to process anything a character at a time is rare. Usually some kind of higher-level Perl operation, like pattern matching, solves the problem more easily. See, for example, Section 7.7, where a set of substitutions is used to find command-line arguments.
Splitting on a pattern that matches the empty string returns a list of the individual characters in the string. This is a convenient feature when done intentionally, but it's easy to do unintentionally. For instance, /X*/ matches the empty string. Odds are you will find others when you don't mean to.
Here's an example that prints the characters used in the string "an apple a day", sorted in ascending ASCII order:
%seen = ();
$string = "an apple a day";
foreach $byte (split //, $string) {
    $seen{$byte}++;
}
print "unique chars are: ", sort(keys %seen), "\n";

                  unique chars are:  adelnpy
               
            
These split and unpack solutions give you an array of characters to work with. If you don't want an array, you can use a pattern match with the /g flag in a while loop, extracting one character at a time:
%seen = ();
$string = "an apple a day";
while ($string =~ /(.)/g) {
    $seen{$1}++;
}
print "unique chars are: ", sort(keys %seen), "\n";

                  unique chars are:  adelnpy
               
            
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reversing a String by Word or Character
You want to reverse the characters or words of a string.
Use the reverse function in scalar context for flipping bytes.
$revbytes = reverse($string);
To flip words, use reverse in list context with split and join:
$revwords = join(" ", reverse split(" ", $string));
The reverse function is two different functions in one. When called in scalar context, it joins together its arguments and returns that string in reverse order. When called in list context, it returns its arguments in the opposite order. When using reverse for its byte-flipping behavior, use scalar to force scalar context unless it's entirely obvious.
$gnirts   = reverse($string);       # reverse letters in $string

@sdrow    = reverse(@words);        # reverse elements in @words

$confused = reverse(@words);        # reverse letters in join("", @words)
Here's an example of reversing words in a string. Using a single space, " ", as the pattern to split is a special case. It causes split to use contiguous whitespace as the separator and also discard any leading null fields, just like awk. Normally, split discards only trailing null fields.
# reverse word order
$string = 'Yoda said, "can you see this?"';
@allwords    = split(" ", $string);
$revwords    = join(" ", reverse @allwords);
print $revwords, "\n";

                  this?" see you "can said, Yoda
               
            
We could remove the temporary array @allwords and do it on one line:
$revwords = join(" ", reverse split(" ", $string));
Multiple whitespace in $string becomes a single space in $revwords. If you want to preserve whitespace, use this:
$revwords = join("", reverse split(/(\s+)/, $string));
One use of reverse is to test whether a word is a palindrome (a word that reads the same backward or forward):
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Expanding and Compressing Tabs
You want to convert tabs in a string to the appropriate number of spaces, or vice versa. Converting spaces into tabs can be used to reduce file size when the file has many consecutive spaces. Converting tabs into spaces may be required when producing output for devices that don't understand tabs or think they're at different positions than you do.
Either use a rather funny looking substitution:
while ($string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e) {
    # spin in empty loop until substitution finally fails
}
Or the standard Text::Tabs module:
use Text::Tabs;
@expanded_lines  = expand(@lines_with_tabs);
@tabulated_lines = unexpand(@lines_without_tabs);
Assuming that tab stops are set every N positions (where N is customarily eight), it's easy to convert them into spaces. The standard, textbook method does not use the Text::Tabs module but suffers from being difficult to understand. Also, it uses the $` variable, whose very mention currently slows down every pattern match in the program. The reason for this is given in Section 6.0.3 in Chapter 6.
while (<>) {
    1 while s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
    print;
}
If you're looking at the second while loop and wondering why it couldn't have been written as part of a simple s///g instead, it's because you need to recalculate the length from the start of the line again each time (stored in $`) rather than merely from where the last match occurred.
The obscure convention 1 while CONDITION is the same as while (CONDITION) { }, but shorter. Its origins date to when Perl ran the first incredibly faster than the second. While the second is now almost as fast, it remains convenient, and the habit has stuck.
The standard Text::Tabs module provides conversion functions to convert both directions, exports a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Expanding Variables in User Input
You've read in a string with an embedded variable reference, such as:
You owe $debt to me.
Now you want to replace $debt in the string with its value.
Use a substitution with symbolic references if the variables are all globals:
$text =~ s/\$(\w+)/${$1}/g;
But use a double /ee if they might be lexical (my) variables:
$text =~ s/(\$\w+)/$1/gee;
The first technique is basically "find what looks like a variable name, and then use symbolic dereferencing to interpolate its contents." If $1 contains the string somevar, then ${$1} will be whatever $somevar contains. This won't work if the use strict 'refs' pragma is in effect because that bans symbolic dereferencing.
Here's an example:
use vars qw($rows $cols);
no strict 'refs';                   # for ${$1}/g below
my $text;

($rows, $cols) = (24, 80);
$text = q(I am $rows high and $cols long);  # like single quotes!
$text =~ s/\$(\w+)/${$1}/g;
print $text;

                  I am 24 high and 80 long
               
            
You may have seen the /e substitution modifier used to evaluate the replacement as code rather than as a string. It's designed for situations such as doubling every whole number in a string:
$text = "I am 17 years old";
$text =~ s/(\d+)/2 * $1/eg;
When Perl is compiling your program and sees a /e on a substitute, it compiles the code in the replacement block along with the rest of your program, long before the substitution actually happens. When a substitution is made, $1 is replaced with the string that matched. The code to evaluate would then be something like:
2 * 17
If we tried saying:
$text = 'I am $AGE years old';      # note single quotes
$text =~ s/(\$\w+)/$1/eg;           # WRONG
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Controlling Case
A string in uppercase needs converting to lowercase, or vice versa.
Use the lc and uc functions or the \L and \U string escapes.
use locale;                     # needed in 5.004 or above

$big = uc($little);             # "bo peep" -> "BO PEEP"
$little = lc($big);             # "JOHN"    -> "john"
$big = "\U$little";             # "bo peep" -> "BO PEEP"
$little = "\L$big";             # "JOHN"    -> "john"
To alter just one character, use the lcfirst and ucfirst functions or the \l and \u string escapes.
$big = "\u$little";             # "bo"      -> "Bo"
$little = "\l$big";             # "BoPeep"    -> "boPeep"
The functions and string escapes look different, but both do the same thing. You can set the case of either the first character or the whole string. You can even do both at once to force uppercase on initial characters and lowercase on the rest.
The use locale directive tells Perl's case-conversion functions and pattern matching engine to respect your language environment, allowing for characters with diacritical marks, and so on. A common mistake is to use tr/// to convert case. (We're aware that the old Camel book recommended tr/A-Z/a-z/. In our defense, that was the only way to do it back then.) This won't work in all situations because when you say tr/A-Z/a-z/ you have omitted all characters with umlauts, accent marks, cedillas, and other diacritics used in dozens of languages, including English. The uc and \U case-changing commands understand these characters and convert them properly, at least when you've said use locale. (An exception is that in German, the uppercase form of ß is SS, but it's not in Perl.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Interpolating Functions and Expressions Within Strings
You want a function call or expression to expand within a string. This lets you construct more complex templates than with simple scalar variable interpolation.
You can break up your expression into distinct concatenated pieces:
$answer = $var1 . func() . $var2;   # scalar only
Or you can use the slightly sneaky @{[ LIST EXPR ]} or ${ \(SCALAR EXPR ) } expansions:
$answer = "STRING @{[ LIST EXPR ]} MORE STRING";
$answer = "STRING ${\( SCALAR EXPR )} MORE STRING";
This code shows both techniques. The first line shows concatenation; the second shows the expansion trick:
$phrase = "I have " . ($n + 1) . " guanacos.";
$phrase = "I have ${\($n + 1)} guanacos.";
The first technique builds the final string by concatenating smaller strings, avoiding interpolation but achieving the same end. Because print effectively concatenates its entire argument list, if we were going to print $phrase, we could have just said:
print "I have ",  $n + 1, " guanacos.\n";
When you absolutely must have interpolation, you need the punctuation-riddled interpolation from the Solution. Only @, $, and \ are special within double quotes and most backquotes. (As with m// and s///, the qx() synonym is not subject to double-quote expansion if its delimiter is single quotes! $home = qx'echo home is $HOME'; would get the shell $HOME variable, not one in Perl.) So, the only way to force arbitrary expressions to expand is by expanding a ${} or @{} whose block contains a reference.
You can do more than simply assign to a variable after interpolation. It's a general mechanism that can be used in any double-quoted string. For instance, this example will build a string with an interpolated expression and pass the result to a function:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Indenting Here Documents
When using the multiline quoting mechanism called a here document, the text must be flush against the margin, which looks out of place in the code. You would like to indent the here document text in the code, but not have the indentation appear in the final string value.
Use a s/// operator to strip out leading whitespace.
# all in one
($var = <<HERE_TARGET) =~ s/^\s+//gm;
    your text
    goes here
HERE_TARGET

# or with two steps
$var = <<HERE_TARGET;
    your text
    goes here
HERE_TARGET
$var =~ s/^\s+//gm;
The substitution is straightforward. It removes leading whitespace from the text of the here document. The /m modifier lets the ^ character match at the start of each line in the string, and the /g modifier makes the pattern matching engine repeat the substitution as often as it can (i.e., for every line in the here document).
($definition = <<'FINIS') =~ s/^\s+//gm;
    The five varieties of camelids
    are the familiar camel, his friends
    the llama and the alpaca, and the
    rather less well-known guanaco
    and vicuña.
FINIS
Be warned: all the patterns in this recipe use \s , which will also match newlines. This means they will remove any blank lines in your here document. If you don't want this, replace \s with [^\S\n] in the patterns.
The substitution makes use of the property that the result of an assignment can be used as the left-hand side of =~. This lets us do it all in one line, but it only works when you're assigning to a variable. When you're using the here document directly, it would be considered a constant value and you wouldn't be able to modify it. In fact, you can't change a here document's value unless you first put it into a variable.
Not to worry, though, because there's an easy way around this, particularly if you're going to do this a lot in the program. Just write a subroutine to do it:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Reformatting Paragraphs
Your string is too big to fit the screen, and you want to break it up into lines of words, without splitting a word between lines. For instance, a style correction script might read a text file a paragraph at a time, replacing bad phrases with good ones. Replacing a phrase like utilizes the inherent functionality of with uses will change the length of lines, so it must somehow reformat the paragraphs when they're output.
Use the standard Text::Wrap module to put line breaks at the right place.
use Text::Wrap;
@OUTPUT = wrap($LEADTAB, $NEXTTAB, @PARA);
The Text::Wrap module provides the wrap function, shown in Example 1.3, which takes a list of lines and reformats them into a paragraph having no line more than $Text::Wrap::columns characters long. We set $columns to 20, ensuring that no line will be longer than 20 characters. We pass wrap two arguments before the list of lines: the first is the indent for the first line of output, the second the indent for every subsequent line.
Example 1.3. wrapdemo
#!/usr/bin/perl -w
# wrapdemo - show how Text::Wrap works

@input = ("Folding and splicing is the work of an editor,",
          "not a mere collection of silicon",
          "and",
          "mobile electrons!");

use Text::Wrap qw($columns &wrap);

$columns = 20;
print "0123456789" x 2, "\n";
print wrap("    ", "  ", @input), "\n";
The result of this program is:
               
                  01234567890123456789
               
               
                      Folding and
               
               
                    splicing is the
               
               
                    work of an
               
               
                    editor, not a
               
               
                    mere collection
               
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Escaping Characters
You need to output a string with certain characters (quotes, commas, etc.) escaped. For instance, you're producing a format string for sprintf and want to convert literal % signs into %%.
Use a substitution to backslash or double each character to be escaped.
# backslash
$var =~ s/([CHARLIST])/\\$1/g;

# double
$var =~ s/([CHARLIST])/$1$1/g;
$var is the variable to be altered. The CHARLIST is a list of characters to escape and can contain backslash escapes like \t and \n. If you just have one character to escape, omit the brackets:
$string =~ s/%/%%/g;
The following lets you do escaping when preparing strings to submit to the shell. (In practice, you would need to escape more than just ' and " to make any arbitrary string safe for the shell. Getting the list of characters right is so hard, and the risks if you get it wrong are so great, that you're better off using the list form of system and exec to run programs, shown in Section 16.2. They avoid the shell altogether.)
$string = q(Mom said, "Don't do that.");
$string =~ s/(['"])/\\$1/g;
We had to use two backslashes in the replacement because the replacement section of a substitution is read as a double-quoted string, and to get one backslash, you need to write two. Here's a similar example for VMS DCL, where you need to double every quote to get one through:
$string = q(Mom said, "Don't do that.");
$string =~ s/(['"])/$1$1/g;
Microsoft command interpreters are harder to work with. In DOS and Windows COMMAND.COM recognizes double quotes but not single ones, has no clue what to do with backquotes, and requires a backslash to make a double quote a literal. Almost any of the free or commercial Unix-like shell environments for Windows will improve this depressing situation.
Because we're using character classes in the regular expressions, we can use
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Trimming Blanks from the Ends of a String
You have read a string that may have leading or trailing whitespace, and you want to remove it.
Use a pair of pattern substitutions to get rid of them:
$string =~ s/^\s+//;
$string =~ s/\s+$//;
You can also write a function that returns the new value:
$string = trim($string);
@many   = trim(@many);

sub trim {
    my @out = @_;
    for (@out) {
        s/^\s+//;
        s/\s+$//;
    }
    return wantarray ? @out : $out[0];
}
This problem has various solutions, but this is the most efficient for the common case.
If you want to remove the last character from the string, use the chop function. Version 5 added chomp, which removes the last character if and only if it is contained in the $/ variable, "\n" by default. These are often used to remove the trailing newline from input:
# print what's typed, but surrounded by >< symbols
while(<STDIN>) {
    chomp;
    print ">$_<\n";
}
The s/// operator in perlre(1) and perlop(1) and the "Pattern Matching" section of Chapter 2 of Programming Perl; the chomp and chop functions in perlfunc(1) and Chapter 3 of Programming Perl; we trim leading and trailing whitespace in the getnum function in Section 2.1.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Parsing Comma-Separated Data
You have a data file containing comma-separated values that you need to read in, but these data fields may have quoted commas or escaped quotes in them. Most spreadsheets and database programs use comma-separated values as a common interchange format.
Use the procedure in Mastering Regular Expressions.
sub parse_csv {
    my $text = shift;      # record containing comma-separated values
    my @new  = ();
    push(@new, $+) while $text =~ m{
        # the first part groups the phrase inside the quotes.
        # see explanation of this pattern in MRE
        "([^\"\\]*(?:\\.[^\"\\]*)*)",?
           |  ([^,]+),?
           | ,
       }gx;
       push(@new, undef) if substr($text, -1,1) eq ',';
       return @new;      # list of values that were comma-separated
}
Or use the standard Text::ParseWords module.
use Text::ParseWords;

sub parse_csv {
    return quotewords(",",0, $_[0]);
}
Comma-separated input is a deceptive and complex format. It sounds simple, but involves a fairly complex escaping system because the fields themselves can contain commas. This makes the pattern matching solution complex and rules out a simple split /,/.
Fortunately, Text::ParseWords hides the complexity from you. Pass its quotewords function two arguments and the CSV string. The first argument is the separator (a comma, in this case) and the second is a true or false value controlling whether the strings are returned with quotes around them.
If you want to represent quotation marks inside a field delimited by quotation marks, escape them with backslashes "like\"this\"". Quotation marks and backslashes are the only characters that have meaning backslashed. Any other use of a backslash will be left in the output string.
Here's how you'd use the parse_csv subroutines. The
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Soundex Matching
You have two English surnames and want to know whether they sound somewhat similar, regardless of spelling. This would let you offer users a "fuzzy search" of names in a telephone book to catch "Smith" and "Smythe" and others within the set, such as "Smite" and "Smote."
Use the standard Text::Soundex module:
use Text::Soundex;

 $CODE  = soundex($STRING);
 @CODES = soundex(@LIST);
The soundex algorithm hashes words (particularly English surnames) into a small space using a simple model that approximates an English speaker's pronunciation of the words. Roughly speaking, each word is reduced to a four character string. The first character is an uppercase letter; the remaining three are digits. By comparing the soundex values of two strings, we can guess whether they sound similar.
The following program prompts for a name and looks for similarly sounding names from the password file. This same approach works on any database with names, so you could key the database on the soundex values if you wanted to. Such a key wouldn't be unique, of course.
use Text::Soundex;
use User::pwent;

print "Lookup user: ";
chomp($user = <STDIN>);
exit unless defined $user;
$name_code = soundex($user);

while ($uent = getpwent()) {
    ($firstname, $lastname) = $uent->gecos =~ /(\w+)[^,]*\b(\w+)/;

    if ($name_code eq soundex($uent->name) ||
        $name_code eq soundex($lastname)   ||
        $name_code eq soundex($firstname)  )
    {
        printf "%s: %s %s\n", $uent->name, $firstname, $lastname;
    }
}
The documentation for the standard Text::Soundex and User::pwent modules (also in Chapter 7 of Programming Perl ); your system's passwd(5) manpage; Volume 3, Chapter 6 of The Art of Computer Programming
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Program: fixstyle
Imagine you have a table with both old and new strings, such as the following.
Old Words
New Words
bonnet
hood
rubber
eraser
lorry
truck
trousers
pants
The program in Example 1.4 is a filter that changes all occurrences of each element in the first set to the corresponding element in the second set.
When called without filename arguments, the program is a simple filter. If filenames are supplied on the command line, an in-place edit writes the changes to the files, with the original versions safely saved in a file with a ".orig" extension. See Section 7.9 for a description. A -v command-line option writes notification of each change to standard error.
The table of original strings and their replacements is stored below __END__ in the main program as described in Section 7.6. Each pair of strings is converted into carefully escaped substitutions and accumulated into the $code variable like the popgrep2 program in Section 6.10.
A -t check to test for an interactive run check tells whether we're expecting to read from the keyboard if no arguments are supplied. That way if the user forgets to give an argument, they aren't wondering why the program appears to be hung.
Example 1.4. fixstyle
#!/usr/bin/perl -w
# fixstyle - switch first set of <DATA> strings to second set
#   usage: $0 [-v] [files ...]
use strict;
my $verbose = (@ARGV && $ARGV[0] eq '-v' && shift);

if (@ARGV) {
    $^I = ".orig";          # preserve old files
} else {
    warn "$0: Reading from stdin\n" if -t STDIN;
}

my $code = "while (<>) {\n";
# read in config, build up code to eval
while (<DATA>) {
    chomp;
    my ($in, $out) = split /\s*=>\s*/;
    next unless $in && $out;
    $code .= "s{\\Q$in\\E}{$out}g";
    $code .= "&& printf STDERR qq($in => $out at \$ARGV line \$.\\n)" 
                                                        if $verbose;
    $code .= ";\n";
}
$code .= "print;\n}\n";

eval "{ $code } 1" || die;

__END__
analysed        => analyzed
built-in        => builtin
chastized       => chastised
commandline     => command-line
de-allocate     => deallocate
dropin          => drop-in
hardcode        => hard-code
meta-data       => metadata
multicharacter  => multi-character
multiway        => multi-way
non-empty       => nonempty
non-profit      => nonprofit
non-trappable   => nontrappable
pre-define      => predefine
preextend       => pre-extend
re-compiling    => recompiling
reenter         => re-enter
turnkey         => turn-key
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Program: psgrep
Many programs, including ps, netstat, lsof, ls -l, find -ls, and tcpdump, can produce more output than can be conveniently summarized. Logfiles also often grow too long to be easily viewed. You could send these through a filter like grep to pick out only certain lines, but regular expressions and complex logic don't mix well; just look at the hoops we jump through in Section 6.17.
What we'd really like is to make full queries on the program output or logfile. For example, to ask ps something like, "Show me all the processes that exceed 10K in size but which aren't running as the superuser." Or, "Which commands are running on pseudo-ttys?"
The psgrep program does this—and infinitely more—because the specified selection criteria are not mere regular expressions; they're full Perl code. Each criterion is applied in turn to every line of output. Only lines matching all arguments are output. The following is a list of things to find and how to find them.
Lines containing "sh" at the end of a word:
% psgrep '/sh\b/'
Processes whose command names end in "sh":
% psgrep 'command =~ /sh$/'
Processes running with a user ID below 10:
% psgrep 'uid < 10'
Login shells with active ttys:
% psgrep 'command =~ /^-/' 'tty ne "?"'
Processes running on pseudo-ttys:
% psgrep 'tty =~ /^[p-t]/'
Non-superuser processes running detached:
% psgrep 'uid && tty eq "?"'
Huge processes that aren't owned by the superuser:
% psgrep 'size > 10 * 2**10' 'uid != 0'
The last call to psgrep produced the following output when run on our system. As one might expect, only netscape and its spawn qualified.
            
                FLAGS   UID   PID  PPID PRI  NI   SIZE   RSS WCHAN     STA TTY TIME COMMAND
            
            
                    0   101  9751     1   0   0  14932  9652 do_select S   p1  0:25 netscape
            
            
               100000   101  9752  9751   0   0  10636   812 do_select S   p1  0:00 (dns helper)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Numbers
Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.
—John von Neumann (1951)
Numbers, the most basic data type of almost any programming language, can be surprisingly tricky. Random numbers, numbers with decimal points, series of numbers, and the conversion of strings to numbers all pose trouble.
Perl works hard to make life easy for you, and the facilities it provides for manipulating numbers are no exception to that rule. If you treat a scalar value as a number, Perl converts it to one. This means that when you read ages from a file, extract digits from a string, or acquire numbers from any of the other myriad textual sources that Real Life pushes your way, you don't need to jump through the hoops created by other languages' cumbersome requirements to turn an ASCII string into a number.
Perl tries its best to interpret a string as a number when you use it as one (such as in a mathematical expression), but it has no direct way of reporting that a string doesn't represent a valid number. Perl quietly converts non-numeric strings to zero, and it will stop converting the string once it reaches a non-numeric character—so "A7" is still 0, and "7A" is just 7. (Note, however, that the -w flag will warn of such improper conversions.) Sometimes (such as when validating input) you need to know if a string represents a valid number. We show you how in Section 2.1.
Section 2.16 shows how to get a number from strings containing hexadecimal or octal representations of numbers like "0xff". Perl automatically converts literals in your program code (so $a = 3 + 0xff will set $a to 258) but not data read by that program (you can't read "0xff" into $b and then say $a = 3 + $b to make $a become 258).
As if integers weren't giving us enough grief, floating-point numbers can cause even more headaches. Internally, a computer represents numbers with decimal points as floating-point numbers in binary format. Floating-point numbers are not the same as real numbers; they are an approximation of real numbers, with limited precision. Although infinitely many real numbers exist, you only have finite space to represent them, usually about 64 bits or so. You have to cut corners to fit them all in.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction
Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.
—John von Neumann (1951)
Numbers, the most basic data type of almost any programming language, can be surprisingly tricky. Random numbers, numbers with decimal points, series of numbers, and the conversion of strings to numbers all pose trouble.
Perl works hard to make life easy for you, and the facilities it provides for manipulating numbers are no exception to that rule. If you treat a scalar value as a number, Perl converts it to one. This means that when you read ages from a file, extract digits from a string, or acquire numbers from any of the other myriad textual sources that Real Life pushes your way, you don't need to jump through the hoops created by other languages' cumbersome requirements to turn an ASCII string into a number.
Perl tries its best to interpret a string as a number when you use it as one (such as in a mathematical expression), but it has no direct way of reporting that a string doesn't represent a valid number. Perl quietly converts non-numeric strings to zero, and it will stop converting the string once it reaches a non-numeric character—so "A7" is still 0, and "7A" is just 7. (Note, however, that the -w flag will warn of such improper conversions.) Sometimes (such as when validating input) you need to know if a string represents a valid number. We show you how in Section 2.1.
Section 2.16 shows how to get a number from strings containing hexadecimal or octal representations of numbers like "0xff". Perl automatically converts literals in your program code (so $a = 3 + 0xff will set $a to 258) but not data read by that program (you can't read "0xff" into $b and then say $a = 3 + $b to make $a become 258).
As if integers weren't giving us enough grief, floating-point numbers can cause even more headaches. Internally, a computer represents numbers with decimal points as floating-point numbers in binary format. Floating-point numbers are not the same as real numbers; they are an approximation of real numbers, with limited precision. Although infinitely many real numbers exist, you only have finite space to represent them, usually about 64 bits or so. You have to cut corners to fit them all in.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Checking Whether a String Is a Valid Number
You want to check whether a string represents a valid number. This is a common problem when validating input, as in a CGI script.
Compare it against a regular expression that matches the kinds of numbers you're interested in.
if ($string =~ /PATTERN/) {
    # is a number
} else {
    # is not
}
This problem gets to the heart of what we mean by a number. Even things that sound simple, like integer, make you think hard about what you will accept ("Is a leading + for positive numbers optional, mandatory, or forbidden?"). The many ways that floating-point numbers can be represented could overheat your brain.
You must decide what you will and will not accept. Then, construct a regular expression to match those things alone. Here are some precooked solutions (the cookbook's equivalent of just-add-water meals) for most common cases.
warn "has nondigits"        if     /\D/;
warn "not a natural number" unless /^\d+$/;             # rejects -3
warn "not an integer"       unless /^-?\d+$/;           # rejects +3
warn "not an integer"       unless /^[+-]?\d+$/;
warn "not a decimal number" unless /^-?\d+\.?\d*$/;     # rejects .2
warn "not a decimal number" unless /^-?(?:\d+(?:\.\d*)?|\.\d+)$/;
warn "not a C float"
       unless /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/;
These lines do not catch the IEEE notations of "Infinity" and "NaN", but unless you're worried that IEEE committee members will stop by your workplace and beat you over the head with copies of the relevant standards documents, you can probably forget about these strange numbers.
If your number has leading or trailing whitespace, those patterns won't work. Either add the appropriate logic directly, or call the trim function from Section 1.14.
If you're on a POSIX system, Perl supports the POSIX::strtod function. Its semantics are cumbersome, so here's a getnum
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Comparing Floating-Point Numbers
Floating-point arithmetic isn't precise. You want to compare two floating-point numbers and know if they're equal when carried out to a certain number of decimal places. Most of the time, this is the way you should compare floating-point numbers for equality.
Use sprintf to format the numbers to a certain number of decimal places, then compare the resulting strings:
# equal(NUM1, NUM2, ACCURACY) : returns true if NUM1 and NUM2 are
# equal to ACCURACY number of decimal places

sub equal {
    my ($A, $B, $dp) = @_;

    return sprintf("%.${dp}g", $A) eq sprintf("%.${dp}g", $B);
  }
Alternatively, store the numbers as integers by assuming the decimal place.
You need the equal routine because most computers' floating-point representations aren't accurate. See the Introduction for a discussion of this issue.
If you have a fixed number of decimal places, as with currency, you can sidestep the problem by storing your values as integers. Storing $3.50 as 350 instead of 3.5 removes the need for floating-point values. Reintroduce the decimal point on output:
$wage = 536;                # $5.36/hour
$week = 40 * $wage;         # $214.40
printf("One week's wage is: \$%.2f\n", $week/100);


                  One week's wage is: $214.40
               
            
It rarely makes sense to compare to more than 15 decimal places.
The sprintf function in perlfunc (1) and Chapter 3 of Programming Perl ; the entry on $# in the perlvar(1) manpage and Chapter 2 of Programming Perl; the documentation for the standard Math::BigFloat module (also in Chapter 7 of Programming Perl); we use sprintf in Section 2.3; Volume 2, Section 4.2.2 of The Art of Computer Programming
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Rounding Floating-Point Numbers
You want to round a floating-point value to a certain number of decimal places. This problem arises as a result of the same inaccuracies in representation that make testing for equality difficult (see Section 2.2), as well as in situations where you must reduce the precision of your answers for readability.
Use the Perl function sprintf, or printf if you're just trying to produce output:
$rounded = sprintf("%FORMATf", $unrounded);
Rounding can seriously affect some algorithms, so the rounding method used should be specified precisely. In sensitive applications like financial computations and thermonuclear missiles, prudent programmers will implement their own rounding function instead of relying on the programming language's built-in logic, or lack thereof.
Usually, though, we can just use sprintf. The f format lets you specify a particular number of decimal places to round its argument to. Perl looks at the following digit, rounds up if it is 5 or greater, and rounds down otherwise.
$a = 0.255;
$b = sprintf("%.2f", $a);
print "Unrounded: $a\nRounded: $b\n";

printf "Unrounded: $a\nRounded: %.2f\n", $a;


                  Unrounded: 0.255
               
               
                  Rounded: 0.26
               
               
                  Unrounded: 0.255
               
               
                  Rounded: 0.26
               
            
Three functions that may be useful if you want to round a floating-point value to an integral value are int , ceil, and floor. int, built into Perl, returns the integral portion of the floating-point number passed to it (int will use $_ if it was called without an argument). The POSIX module's floor and ceil functions round their argument down and up to the next integer, respectively.
use POSIX;
print "number\tint\tfloor\tceil\n";

@a = ( 3.3 , 3.5 , 3.7, -3.3 );
foreach (@a) {
    printf( "%.1f\t%.1f\t%.1f\t%.1f\n", 
        $_, int($_), floor($_), ceil($_) );
}

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Converting Between Binary and Decimal
You have an integer whose binary representation you'd like to print out, or a binary representation that you'd like to convert into an integer. You might want to do this if you were displaying non-textual data, such as what you get from interacting with certain system programs and functions.
To convert a Perl integer to a text string of ones and zeros, first pack the integer into a number in network byte order (the "N" format), then unpack it again bit by bit (the "B32"</