BUY THIS BOOK

Safari Books Online

What is this?

Looking to Reprint this content?


Mastering Regular Expressions
Mastering Regular Expressions, Second Edition By Jeffrey E. F. Friedl
July 2002
Pages: 484

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introduction to Regular Expressions
Here's the scenario: you're given the job of checking the pages on a web server for doubled words (such as "this this"), a common problem with documents subject to heavy editing. Your job is to create a solution that will:
  • Accept any number of files to check, report each line of each file that has doubled words, highlight (using standard ANSI escape sequences) each doubled word, and ensure that the source filename appears with each line in the report.
  • Work across lines, even finding situations where a word at the end of one line is repeated at the beginning of the next.
  • Find doubled words despite capitalization differences, such as with 'The the···', as well as allow differing amounts of whitespace (spaces, tabs, newlines, and the like) to lie between the words.
  • Find doubled words even when separated by HTML tags. HTML tags are for marking up text on World Wide Web pages, for example, to make a word bold: '···it is <B>very</B> very important···'.
That's certainly a tall order! But, it's a real problem that needs to be solved. At one point while working on the manuscript for this book, I ran such a tool on what I'd written so far and was surprised at the way numerous doubled words had crept in. There are many programming languages one could use to solve the problem, but one with regular expression support can make the job substantially easier.
Regular expressions are the key to powerful, flexible, and efficient text processing. Regular expressions themselves, with a general pattern notation almost like a mini programming language, allow you to describe and parse text. With additional support provided by the particular tool being used, regular expressions can add, remove, isolate, and generally fold, spindle, and mutilate all kinds of text and data.
It might be as simple as a text editor's search command or as powerful as a full text processing language. This book shows you the many ways regular expressions can increase your productivity. It teaches you how to think regular expressions so that you can master them, taking advantage of the full magnitude of their power.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Solving Real Problems
Knowing how to wield regular expressions unleashes processing powers you might not even know were available. Numerous times in any given day, regular expressions help me solve problems both large and small (and quite often, ones that are small but would be large if not for regular expressions).
Showing an example that provides the key to solving a large and important problem illustrates the benefit of regular expressions clearly, but perhaps not so obvious is the way regular expressions can be used throughout the day to solve rather "uninteresting" problems. I use "uninteresting" in the sense that such problems are not often the subject of bar-room war stories, but quite interesting in that until they're solved, you can't get on with your real work.
As a simple example, I needed to check a lot of files (the 70 or so files comprising the source for this book, actually) to confirm that each file contained 'SetSize' exactly as often (or as rarely) as it contained 'ResetSize'. To complicate matters, I needed to disregard capitalization (such that, for example, 'setSIZE' would be counted just the same as 'SetSize'). Inspecting the 32,000 lines of text by hand certainly wasn't practical.
Even using the normal "find this word" search in an editor would have been arduous, especially with all the files and all the possible capitalization differences.
Regular expressions to the rescue! Typing just a single, short command, I was able to check all files and confirm what I needed to know. Total elapsed time: perhaps 15 seconds to type the command, and another 2 seconds for the actual check of all the data. Wow! (If you're interested to see what I actually used, peek ahead to Section 2.1.)
As another example, I was once helping a friend with some email problems on a remote machine, and he wanted me to send a listing of messages in his mailbox file. I could have loaded a copy of the whole file into a text editor and manually removed all but the few header lines from each message, leaving a sort of table of contents. Even if the file wasn't as huge as it was, and even if I wasn't connected via a slow dial-up line, the task would have been slow and monotonous. Also, I would have been placed in the uncomfortable position of actually seeing the text of his personal mail.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Regular Expressions as a Language
Unless you've had some experience with regular expressions, you won't understand the regular expression ^(From|Subject): from the last example, but there's nothing magic about it. For that matter, there is nothing magic about magic. The magician merely understands something simple which doesn't appear to be simple or natural to the untrained audience. Once you learn how to hold a card while making your hand look empty, you only need practice before you, too, can "do magic." Like a foreign language — once you learn it, it stops sounding like gibberish.
Since you have decided to use this book, you probably have at least some idea of just what a "regular expression" is. Even if you don't, you are almost certainly already familiar with the basic concept.
You know that report.txt is a specific filename, but if you have had any experience with Unix or DOS/Windows, you also know that the pattern "*.txt" can be used to select multiple files. With filename patterns like this (called file globs or wildcards), a few characters have special meaning. The star means "match anything," and a question mark means "match any one character." So, with the file glob "*.txt", we start with a match-anything * and end with the literal .txt , so we end up with a pattern that means "select the files whose names start with anything and end with .txt".
Most systems provide a few additional special characters, but, in general, these filename patterns are limited in expressive power. This is not much of a shortcoming because the scope of the problem (to provide convenient ways to specify groups of files) is limited, well, simply to filenames.
On the other hand, dealing with general text is a much larger problem. Prose and poetry, program listings, reports, HTML, code tables, word lists... you name it, if a particular need is specific enough, such as "selecting files," you can develop some kind of specialized scheme or tool to help you accomplish it. However, over the years, a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Regular-Expression Frame of Mind
As we'll soon see, complete regular expressions are built up from small buildingblock units. Each individual building block is quite simple, but since they can be combined in an infinite number of ways, knowing how to combine them to achieve a particular goal takes some experience. So, this chapter provides a quick overview of some regular-expression concepts. It doesn't go into much depth, but provides a basis for the rest of this book to build on, and sets the stage for important side issues that are best discussed before we delve too deeply into the regular expressions themselves.
While some examples may seem silly (because some are silly), they represent the kind of tasks that you will want to do — you just might not realize it yet. If each point doesn't seem to make sense, don't worry too much. Just let the gist of the lessons sink in. That's the goal of this chapter.
If you're already familiar with regular expressions, much of this overview will not be new, but please be sure to at least glance over it anyway. Although you may be aware of the basic meaning of certain metacharacters, perhaps some of the ways of thinking about and looking at regular expressions will be new.
Just as there is a difference between playing a musical piece well and making music, there is a difference between knowing about regular expressions and really understanding them. Some of the lessons present the same information that you are already familiar with, but in ways that may be new and which are the first steps to really understanding.
Finding text is one of the simplest uses of regular expressions—many text editors and word processors allow you to search a document using a regular-expression pattern. Even simpler is the utility egrep. Give egrep a regular expression and some files to search, and it attempts to match the regular expression to each line of each file, displaying only those lines in which a match is found. egrep is freely available for many systems, including DOS, MacOS, Windows, Unix, and so on. See this book's web site, regex.info, for links on how to obtain a copy of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Egrep Metacharacters
Let's start to explore some of the egrep metacharacters that supply its regular expression power. I'll go over them quickly with a few examples, leaving the detailed examples and descriptions for later chapters.
Typographical Conventions Before we begin, please make sure to review the typographical conventions explained in the preface . This book forges a bit of new ground in the area of typesetting, so some of my notations may be unfamiliar at first.
Probably the easiest metacharacters to understand are ^ (caret) and $ (dollar), which represent the start and end, respectively, of the line of text as it is being checked. As we've seen, the regular expression cat finds c·a·t anywhere on the line, but ^cat matches only if the c·a·t is at the beginning of the line—the ^ is used to effectively anchor the match (of the rest of the regular expression) to the start of the line. Similarly, cat$ finds c·a·t only at the end of the line, such as a line ending with scat.
It's best to get into the habit of interpreting regular expressions in a rather literal way. For example, don't think
^cat matches a line with cat at the beginning
but rather:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Expanding the Foundation
I hope the examples and explanations so far have helped to establish the basis for a solid understanding of regular expressions, but please realize that what I've provided so far lacks depth. There's so much more out there.
I mentioned a number of regular expression features that most versions of egrep support. There are other features, some of which are not supported by all versions, which I'll leave for later chapters.
Unfortunately, the regular expression language is no different from any other in that it has various dialects and accents. It seems each new program employing regular expressions devises its own "improvements." The state of the art continually moves forward, but changes over the years have resulted in a wide variety of regular expression "flavors." We'll see many examples in the following chapters.
From the broadest top-down view, a regular expression either matches within a lump of text (with egrep, each line) or it doesn't. When crafting a regular expression, you must consider the ongoing tug-of-war between having your expression match the lines you want, yet still not matching lines you don't want.
Also, while egrep doesn't care where in the line the match occurs, this concern is important for many other regular-expression uses. If your text is something such as
...zip is 44272. If you write, send $4.95 to cover postage and...
and you merely want to find lines matching [0-9]+ , you don't care which numbers are matched. However, if your intent is to do something with the number (such as save to a file, add, replace, and such — we will see examples of this kind of processing in the next chapter), you'll care very much exactly which numbers are matched.
As with any language, experience is a very good thing, so I'm including a few more examples of regular expressions to match some common constructs.
Half the battle when writing regular expressions is getting successful matches when and where you want them. The other half is to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Personal Glimpses
The doubled-word task at the start of this chapter might seem daunting, yet regular expressions are so powerful that we could solve much of the problem with a tool as limited as egrep, right here in the first chapter. I'd like to fill this chapter with flashy examples, but because I've concentrated on the solid foundation for the later chapters, I fear that someone completely new to regular expressions might read this chapter, complete with all the warnings and cautions and rules and such, and feel "why bother?"
Recently, my brothers were teaching some friends how to play schaffkopf, a card game that's been in my family for generations. It is much more exciting than it appears at first glance, but has a rather steep learning curve. After about half an hour, my sister-in-law Liz, normally the quintessence of patience, got frustrated with the seemingly complex rules and said "Can't we just play rummy?" Yet, as it turned out, they ended up playing late into the night. Once they were able to get over the initial hump of the learning curve, a first-hand taste of the excitement was all it took to hook them. My brothers knew it would, but it took some time and work to get to the point where Liz and the others new to the game could appreciate what they were getting into.
It might take some time to become acclimated to regular expressions, so until you get a real taste of the excitement by using them to solve your problems, it might all feel just a bit too academic. If so, I hope you will resist the desire to "play rummy." Once you understand the power that regular expressions provide, the small amount of work spent learning them will feel trivial indeed.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Quiz Answer
Answers to the questions in Section 1.4 .
^cat$
Literally means: matches if the line has a beginning-of-line (which, of course, all lines have), followed immediately by c·a·t, and then followed immediately by the end of the line.
Effectively means: a line that consists of only cat — no extra words, spaces, punctuation... just 'cat'.
^$
Literally means: matches if the line has a beginning-of-line, followed immediately by the end of the line.
Effectively means: an empty line (with nothing in it, not even spaces).
^
Literally means: matches if the line has a beginning-of-line.
Effectively meaningless! Since every line has a beginning, every line will match—even lines that are empty!
Answer to the question in Section 1.4.3 .
Why doesn't
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Extended Introductory Examples
Remember the doubled-word problem from the first chapter? I said that a full solution could be written in just a few lines in a language like Perl. Such a solution might look like:
$/ = ".\n";
while (< >) {
  next if !s/\b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)/\e[7m$1\e[m$2\e[7m$3\e[m/ig;
  s/^(?:[^\e]*\n)+//mg;        # Remove any unmarked lines.
  s/^/$ARGV: /mg;              # Ensure lines begin with filename.
  print;
}
Yup, that's the whole program.
Even if you're familiar with Perl, I don't expect you to understand it (yet! ). Rather, I wanted to show an example beyond what egrep can allow, and to whet your appetite for the real power of regular expressions.
Most of this program's work revolves around its three regular expressions:
  • \b([a-z]+)((?:\s|<[^>]+>)+)(\1\b)
  • ^(?:[^\e]*\n)+
  • ^
Though this is a Perl example, these three regular expressions can be used verbatim (or with only a few changes) in many other languages, including Python, Java, Visual Basic .NET, Tcl, and more.
Now, looking at these, that last ^ is certainly recognizable, but the other expressions have items unfamiliar to our egrep-only experience. This is because Perl's regex flavor is not the same as egrep's. Some of the notations are different, and Perl (as well as most modern tools) tend to provide a much richer set of metacharacters than egrep. We'll see many examples throughout this chapter.
This chapter takes a few sample problems — validating user input; working with email headers; converting plain text to HTML — and wanders through the regular expression landscape with them. As I develop them, I'll "think out loud" to offer a few insights into the thought processes that go into crafting a regex. During our journey, we'll see some constructs and features that
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
About the Examples
This chapter takes a few sample problems — validating user input; working with email headers; converting plain text to HTML — and wanders through the regular expression landscape with them. As I develop them, I'll "think out loud" to offer a few insights into the thought processes that go into crafting a regex. During our journey, we'll see some constructs and features that egrep doesn't have, and we'll take plenty of side trips to look at other important concepts as well.
Toward the end of this chapter, and in subsequent chapters, I'll show examples in a variety of languages including Java and Visual Basic .NET, but the examples throughout most of this chapter are in Perl. Any of these languages, and most others for that matter, allow you to employ regular expressions in much more complex ways than egrep, so using any of them for the examples would allow us to see interesting things. I choose to start with Perl primarily because it has the most ingrained, easily accessible regex support among the popular languages. Also, Perl provides many other concise data-handling constructs that alleviate much of the "dirty work" of our example tasks, letting us concentrate on regular expressions.
Just to quickly demonstrate some of these powers, recall the file-check example from Section 1.1, where I needed to ensure that each file contained 'ResetSize' exactly as many times as 'SetSize'. The utility I used was Perl, and the command was:
% perl -0ne 'print "$ARGV\n" if s/ResetSize//ig != s/SetSize//ig' *
(I don't expect that you understand this yet — I hope merely that you'll be impressed with the brevity of the solution.)
I like Perl, but it's important not to get too caught up in its trappings here. Remember, this chapter concentrates on regular expressions. As an analogy, consider the words of a computer science professor in a first-year course: "You're going to learn computer-science concepts here, but we'll use Pascal to show you."
Since this chapter doesn't assume that you know Perl, I'll be sure to introduce enough to make the examples understandable. (Chapter 7, which looks at all the nitty-gritty details of Perl, does assume some basic knowledge.) Even if you have experience with a variety of programming languages, normal Perl may seem quite odd at first glance because its syntax is very compact and its semantics thick. In the interest of clarity, I won't take advantage of much that Perl has to offer, instead presenting programs in a more generic, almost pseudo-code style. While not "bad," the examples are not the best models of The Perl Way of programming. But, we
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Matching Text with Regular Expressions
Perl uses regular expressions in many ways, the simplest being to check if a regex matches text (or some part thereof) held in a variable. This snippet checks the string held in variable $reply and reports whether it contains only digits:
     if ($reply =~ m/^[0-9]+$/) {
          print "only digits\n";
     } else {
          print "not only digits\n";
     }
The mechanics of the first line might seem a bit strange: the regular expression is ^[0-9]+$ , while the surrounding m/···/ tells Perl what to do with it. The m means to attempt a regular expression match, while the slashes delimit the regex itself. The preceding =~ links m/···/ with the string to be searched, in this case the contents of the variable $reply.
Don't confuse =~ with = or == . The operator == tests whether two numbers are the same. (The operator eq , as we will soon see, is used to test whether two strings are the same.) The = operator is used to assign a value to a variable, as with $celsius = 20 . Finally, =~ links a regex search with the target string to be searched. In the example, the search is m/^[0-9]+$/ and the target is $reply. Other languages approach this differently, and we'll see examples in the next chapter.
It might be convenient to read =~ as "matches," such that
     if ($reply =~ m/^[0-9]+$/)
         
becomes:
if the text contained in the variable $reply matches the regex ^[0-9]+$ , then ...
The whole result of $reply =~ m/^[0-9]+$/ is a true value if the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Modifying Text with Regular Expressions
So far, the examples have centered on finding, and at times, "plucking out" information from a string. Now we look at substitution (also called search and replace), a regex feature that Perl and many tools offer.
As we have seen, / regex / attempts to match the given regular expression to the text in the given variable, and returns true or false appropriately. The similar construct / regex / replacement / takes it a step further: if the regex is able to match somewhere in the string held by $var, the text actually matched is replaced by replacement. The regex is the same as with m/···/, but the replacement (between the middle and final slash) is treated as a double-quoted string. This means that you can include references to variables, including $1, $2, and so on to refer to parts of what was just matched.
Thus, with $var =~ s/···/···/ the value of the variable is actually changed. (If there is no match to begin with, no replacement is made and the variable is left unchanged.) For example, if $var contained Jeff•Friedl and we ran
     $var =~ s/Jeff/Jeffrey/;
$var would end up with Jeffrey•Friedl. And if we did that again, it would end up with Jeffreyrey•Friedl. To avoid that, perhaps we should use a wordboundary metacharacter. As mentioned in the first chapter, some versions of egrep support \< and \> for their start-of-word and end-of-word metacharacters. Perl, however, provides the catch-all \b , which matches either:
     $var =~ s/\bJeff\b/Jeffrey/;
Here's a slightly tricky quiz: like
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Quiz Answers
Answer to the question in Section 2.2.3.
How do [•]* and •*|* compare?
(•*|*) allows either •* or * to match, which allows either some spaces (or nothing) or some tabs (or nothing). It doesn't, however, allow a combination of spaces and tabs.
On the other hand, [•]* matches [•] any number of times. With a string such as ' ••' it matches three times, a tab the first time and spaces the rest.
[•]* is logically equivalent to (•|)* , although for reasons shown in Chapter 4, a character class is often much more efficient.
Answer to the question in Section 2.3.
Just what does $var =~ s/\bJeff\b/Jeff/i do?
It might be tricky because of the way I posed it. Had I used \bJEFF\b or
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Overview of Regular Expression Features and Flavors
Now that you have a feel for regular expressions and a few diverse tools that use them, you might think we're ready to dive into using them wherever they're found. But even a simple comparison among the egrep versions of the first chapter and the Perl and Java in the previous chapter shows that regular expressions and the way they're used can vary wildly from tool to tool.
When looking at regular expressions in the context of their host language or tool, there are three broad issues to consider:
  • What metacharacters are supported, and their meaning. Often called the regex "flavor."
  • How regular expressions "interface" with the language or tool, such as how to specify regular-expression operations, what operations are allowed, and what text they operate on.
  • How the regular-expression engine actually goes about applying a regular expression to some text. The method that the language or tool designer uses to implement the regular-expression engine has a strong influence on the results one might expect from any given regular expression.
Regular Expressions and Cars
The considerations just listed parallel the way one might think while shopping for a car. With regular expressions, the metacharacters are the first thing you notice, just as with a car it's the body shape, shine, and nifty features like a CD player and leather seats. These are the types of things you'll find splashed across the pages of a glossy brochure, and a list of metacharacters like the one in Section 1.5.6 is the regular-expression equivalent. It's important information, but only part of the story.
How regular expressions interface with their host program is also important. The interface is partly cosmetic, as in the syntax of how to actually provide a regular expression to the program. Other parts of the interface are more functional, defining what operations are supported, and how convenient they are to use. In our car comparison, this would be how the car "interfaces" with us and our lives. Some issues might be cosmetic, such as what side of the car you put gas in, or whether the windows are powered. Others might be a bit more important, such as if it has an automatic or manual transmission. Still others deal with functionality: can you fit the thing in your garage? Can you transport a king-size mattress? Skis? Five adults? (And how easy is it for those five adults to get in and out of the car—easier with four doors than with two.) Many of these issues are also mentioned in the glossy brochure, although you might have to read the small print in the back to get all the details.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Casual Stroll Across the Regex Landscape
I'd like to start with the story about the evolution of some regular expression flavors and their associated programs. So, grab a hot cup (or frosty mug) of your favorite brewed beverage and relax as we look at the sometimes wacky history behind the regular expressions we have today. The idea is to add color to our regex understanding, and to develop a feeling as to why "the way things are" are the way things are. There are some footnotes for those that are interested, but for the most part, this should be read as a light story for enjoyment.
The seeds of regular expressions were planted in the early 1940s by two neurophysiologists, Warren McCulloch and Walter Pitts, who developed models of how they believed the nervous system worked at the neuron level. Regular expressions became a reality several years later when mathematician Stephen Kleene formally described these models in an algebra he called regular sets. He devised a simple notation to express these regular sets, and called them regular expressions.
Through the 1950s and 1960s, regular expressions enjoyed a rich study in theoretical mathematics circles. Robert Constable has written a good summary for the mathematically inclined.
Although there is evidence of earlier work, the first published computational use of regular expressions I have actually been able to find is Ken Thompson's 1968 article Regular Expression Search Algorithm in which he describes a regularexpression compiler that produced IBM 7094 object code. This led to his work on qed, an editor that formed the basis for the Unix editor ed.
ed 's regular expressions were not as advanced as those in qed, but they were the first to gain widespread use in non-technical fields. ed had a command to display lines of the edited file that matched a given regular expression. The command, " g/ Regular Expression /p ", was read "Global Regular Expression Print." This particular function was so useful that it was made into its own utility, grep (after which egrep —extended grep —was later modeled).

Section 3.1.1.1: Grep's metacharacters

Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Care and Handling of Regular Expressions
The second concern outlined at the start of the chapter is the syntactic packaging that tells an application "Hey, here's a regex, and this is what I want you to do with it." egrep is a simple example because the regular expression is expected as an argument on the command line. Any extra syntactic sugar, such as the single quotes I used throughout the first chapter, are needed only to satisfy the command shell, not egrep. Complex systems, such as regular expressions in programming languages, require more complex packaging to inform the system exactly what the regex is and how it should be used.
The next step, then, is to look at what you can do with the results of a match. Again, egrep is simple in that it pretty much always does the same thing (displays lines that contain a match), but as the previous chapter began to show, the real power is in doing much more interesting things. The two basic actions behind those interesting things are match (to check if a regex matches in a string, and to perhaps pluck information from the string), and search-and-replace, to modify a string based upon a match. There are many variations of these actions, and many variations on how individual languages let you perform them.
In general, a programming language can take one of three approaches to regular expressions: integrated, procedural, and object-oriented. With the first, regular expression operators are built directly into the language, as with Perl. In the other two, regular expressions are not part of the low-level syntax of the language. Rather, normal strings are passed as arguments to normal functions, which then interpret the strings as regular expressions. Depending on the function, one or more regex-related actions are then performed. One derivative or another of this style is use by most (non-Perl) languages, including Java, the .NET languages, Tcl, Python, PHP, Emacs lisp, and Ruby.
We've already seen a bit of Perl's integrated approach, such as this example from Section 2.3.4:
if ($line =~ m/
               ^Subject: (.*)/i) {
    $subject = $1;
}
Here, for clarity, variable names I've chosen are in italic, while the regex-related items are bold, and the regular expression itself is underlined. We know that Perl applies the regular expression
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Strings, Character Encodings, and Modes
Before getting into the various type of metacharacters generally available, there are a number of global issues to understand: regular expressions as strings, character encodings, and match modes.
These are simple concepts, in theory, and in practice, some indeed are. With most, though, the small details, subtleties, and inconsistencies among the various implementations sometimes makes it hard to pin down exactly how they work in practice. The next sections cover some of the common and sometimes complex issues you'll face.
The concept is simple: in most languages except Perl, awk, and sed, the regex engine accepts regular expressions as normal strings — strings that are often provided as string literals like "^From:(.*)". What confuses many, especially early on, is the need to deal with the language's own string-literal metacharacters when composing a string to be used as a regular expression.
Each language's string literals have their own set of metacharacters, and some languages even have more than one type of string literal, so there's no one rule that works everywhere, but the concepts are all the same. Many languages' string literals recognize escape sequences like \t, \\, and \x2A, which are interpreted while the string's value is being composed. The most common regex-related aspect of this is that each backslash in a regex requires two backslashes in the corresponding string literal. For example, "\\n" is required to get the regex \n .
If you forgot the extra backslash for the string literal and used "\n", with many languages you'd then get , which just happens to do exactly the same thing as \n . Well, actually, if the regex is in an /x type of free-spacing mode,
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Common Metacharacters and Features
The following overview of current regex metacharacters covers common items and concepts. It doesn't discuss every issue, and no one tool includes everything presented here. In one respect, this is just a summary of much of what you've seen in the first two chapters, but in light of the wider, more complex world presented at the beginning of this chapter. During your first pass through this section, a light glance should allow you to continue on to the next chapters. You can come back here to pick up details as you need them.
Some tools add a lot of new and rich functionality and some gratuitously change common notations to suit their whim or special needs. Although I'll sometimes comment about specific utilities, I won't address too many tool-specific concerns here. Rather, in this section I'll just try to cover some common metacharacters and their uses, and some concerns to be aware of. I encourage you to follow along with the manual of your favorite utility.
The following is an outline of the constructs covered in this section, with pointers to the page where each sub-section starts:
Character Representationssee Section 3.4.1.1 see Section 3.4.1.3 see Section 3.4.1.4 see Section 3.4.1.5 Character Shorthands: \n\t\e Octal Escapes: num Hex/Unicode Escapes: num {num} num,num, Control Characters: char
Character Classes and class-like constructssee Section 3.4.2.1 see Section 3.4.2.2 see Section 3.4.2.4 see Section 3.4.2.5 see Section 3.4.2.6 see Section 3.4.2.7 see Section 3.4.2.8 see Section 3.4.2.9 see Section 3.4.2.10 see Section 3.4.2.11 Normal classes: [a-z] and [^a-z]
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Guide to the Advanced Chapters
Now that we're familiar with metacharacters, flavors, syntactic packaging, and the like, it's time to start getting into the nitty-gritty details of the third concern raised at the start of this chapter, the specifics of how a tool's regex engine goes about applying a regex to some text. In Chapter 4, The Mechanics of Expression Processing, we see how the implementation of the match engine influences whether a match is achieved, which text is matched, and how much time the whole thing takes. We'll look at all the details. As a byproduct of this knowledge, you'll find it much easier to craft complex expressions with confidence. Chapter 5, Practical Regex Techniques helps to solidify that knowledge with extended examples.
That brings us to Chapter 6, Crafting an Efficient Expression. Once you know the basics about how an engine works, you can learn techniques to take full advantage of that knowledge. Chapter 6 looks at regex pitfalls that often lead to unwelcome surprises, and turns the tables to put them to use for us.
Chapters 4, 5, and 6 are the central core of this book. These first three chapters merely lead up to them, and the discussions in the tool-specific chapters that follow rely on them. They're not necessarily what you would call "light reading," but I've taken great care to stay away from math, algebra, and all that stuff that's just mumbo-jumbo to most of us. As with any large amount of new information, it likely takes time to sink in and internalize.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: The Mechanics of Expression Processing
The previous chapter started with an analogy between cars and regular expressions. The bulk of the chapter discussed features, regex flavors, and other "glossy brochure" issues of regular expressions. This chapter continues with that analogy, talking about the all-important regular-expression engine, and how it goes about its work.
Why would you care how it works? As we'll see, there are several types of regex engines, and the type most commonly used — the type used by Perl, Tcl, Python, the .NET languages, Ruby, PHP, all Java packages I've seen, and more — works in such a way that how you craft your expression can influence whether it can match a particular string, where in the string it matches, and how quickly it finds the match or reports the failure. If these issues are important to you, this chapter is for you.
Let's see how much I can milk this engine analogy. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the sound system. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?
Well, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.
Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue.
Let's stoke the fire by adding another variable: the California Emissions Standards. Some engines adhere to California's strict pollution standards, and some engines don't. These aren't really different kinds of engines, just new variations on what's already around. The standard regulates a result of the engine's work, the emissions, but doesn't say anything about how the engine should go about achieving those cleaner results. So, our two classes of engine are divided into four types: electric (adhering and non-adhering) and gasoline (adhering and non-adhering).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Start Your Engines!
Let's see how much I can milk this engine analogy. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the sound system. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?
Well, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.
Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue.
Let's stoke the fire by adding another variable: the California Emissions Standards. Some engines adhere to California's strict pollution standards, and some engines don't. These aren't really different kinds of engines, just new variations on what's already around. The standard regulates a result of the engine's work, the emissions, but doesn't say anything about how the engine should go about achieving those cleaner results. So, our two classes of engine are divided into four types: electric (adhering and non-adhering) and gasoline (adhering and non-adhering).
Come to think of it, I bet that an electric engine can qualify for the standard without much change—the standard just "blesses" the clean results that are already par for the course. The gas engine, on the other hand, needs some major tweaking and a bit of re-tooling before it can qualify. Owners of this kind of engine need to pay particular care to what they feed it—use the wrong kind of gas and you're in big trouble.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Match Basics
Before looking at the differences among these engine types, let's first look at their similarities. Certain aspects of the drive train are the same (or for all practical purposes appear to be the same), so these examples can cover all engine types.
This chapter is primarily concerned with a generic, full-function regex engine, so some tools won't support exactly everything presented. In my examples, the dipstick might be to the left of the oil filter, while under your hood it might be behind the distributor cap. Your goal is to understand the concepts so that you can drive and maintain your favorite regex package (and ones you find interest in later).
I'll continue to use Perl's notation for most of the examples, although I'll occasionally show others to remind you that the notation is superficial and that the issues under discussion transcend any one tool or flavor. To cut down on wordiness here, I'll rely on you to check Chapter 3 (see Section 3.4) if I use an unfamiliar construct.
This chapter details the practical effects of how a match is carried out. It would be nice if everything could be distilled down to a few simple rules that could be memorized without needing to understand what is going on. Unfortunately, that's not the case. In fact, with all this chapter offers, I identify only two all-encompassing rules:
  1. The match that begins earliest (leftmost) wins.
  2. The standard quantifiers ( * , + , ? , and {m,n} ) are greedy.
We'll look at these rules, their effects, and much more throughout this chapter. Let's start by diving into the details of the first rule.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Regex-Directed Versus Text-Directed
The two basic engine types reflect a fundamental difference in algorithms available for applying a regular expression to a string. I call the gasoline-driven NFA engine "regex-directed," and the electric-driven DFA "text-directed."
Let's consider one way an engine might match to(nite|knight|night) against the text '···tonight···' . Starting with the t , the regular expression is examined one component at a time, and the "current text" is checked to see whether it is matched by the current component of the regex. If it does, the next component is checked, and so on, until all components have matched, indicating that an overall match has been achieved.
With the to(nite;knight;night) example, the first component is t , which repeatedly fails until a 't' is reached in the target string. Once that happens, the o is checked against the next character, and if it matches, control moves to the next component. In this case, the "next component" is (nite|knight|night) which really means " nite or knight or night . " Faced with three possibilities, the engine just tries each in turn. We (humans with advanced neural nets between our ears) can see that if we're matching
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Backtracking
The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be.
Situations where it has to decide among courses of action include anything with a quantifier (decide whether to try another match), and alternation (decide which alternative to try, and which to leave for later).
Whichever course of action is attempted, if it's successful and the rest of the regex is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the first option, and can continue with the match by trying the other option. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).
Backtracking is like leaving a pile of bread crumbs at every fork in the road. If the path you choose turns out to be a dead end, you can retrace your steps, giving up ground until you come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, you can backtrack further, retracing your steps to the next pile of crumbs, and so on, until you eventually find a path that leads to your goal, or until you run out of untried paths.
There are various situations when the regex engine needs to choose between two (or more) options — the alternation we saw earlier is only one example. Another example is that upon reaching ···x?··· , the engine must decide whether it should attempt x . Upon reaching ···x+···