|
|
|
|
Mastering Regular ExpressionsPowerful Techniques for Perl and Other ToolsBy Jeffrey E. F. Friedl1st Edition January 1997 1-56592-257-3, Order Number: 2573 368 pages, $34.95 |
Chapter 4
Now that we have some background under our belt, let's delve into the mechanics of how a regex engine really goes about its work. Here we don't care much about the Shine and Finish of the previous chapter; this chapter is all about the engine and the drive train, the stuff that grease monkeys talk about in bars. We'll spend a fair amount of time under the hood, so expect to get a bit dirty with some practical hands-on experience.
The Mechanics of Expression ProcessingStart Your Engines!
Let's see how much I can milk this engine analogy for. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the Rich Corinthian Leather. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?Two Kinds of Engines
Well, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue.
New Standards
Let's stoke the fire by adding another variable: the California Emissions Standards.1 Some engines adhere to California's strict pollution standards, and some engines don't. These aren't really different kinds of engines, just new variations on what's already around. The standard regulates a result of the engine's work, the emissions, but doesn't say one thing or the other about how the engine should go about achieving those cleaner results. So, our two classes of engine are divided into four types: electric (adhering and non-adhering) and gasoline (adhering and non-adhering).
1 California has rather strict standards regulating the amount of pollution a car can produce. Because of this, many cars sold in America come in "for California'' and "non-California'' models. Come to think of it, I bet that an electric engine can qualify for the standard without much change, so it's not really impacted very much -- the standard just "blesses'' the clean results that are already par for the course. The gas engine, on the other hand, needs some major tweaking and a bit of re-tooling before it can qualify. Owners of this kind of engine need to pay particular care to what they feed it -- use the wrong kind of gas and you're in big trouble in more ways than one.
The impact of standards
Better pollution standards are a good thing, but they require that the driver exercise more thought and foresight (well, at least for gas engines, as I noted in the previous paragraph). Frankly, however, the standard doesn't impact most people since all the other states still do their own thing and don't follow California's standard... yet. It's probably just a matter of time.Okay, so you realize that these four types of engines can be classified into three groups (the two kinds for gas, and electric in general). You know about the differences, and that in the end they all still turn the wheels. What you don't know is what the heck this has to do with regular expressions!
More than you might imagine.
Regex Engine Types
There are two fundamentally different types of regex engines: one called "DFA'' (the electric engine of our story) and one called "NFA'' (the gas engine). The details follow shortly (=>101), but for now just consider them names, like Bill and Ted. Or electric and gas.Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include Tcl, Perl, Python, GNU Emacs, ed, sed, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex. Table 4-1 lists a few common programs available for a wide variety of platforms and the regex engine that most versions use. A generic version means that it's an old tool with many clones -- I have listed notably different clones that I'm aware of.2
2 Where I could find them, I used comments in the source code to identify the author (or, for the generic tools, the original author). I relied heavily on Libes and Ressler's Life With Unix (Prentice Hall, 1989) to fill in the gaps. As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs resulted in a lot of needless variety. Things were dirty. The POSIX standard came in to clean things up by specifying clearly which metacharacters an engine should support, and exactly the results you could expect from them. Superficial details aside, the DFAs (our electric engines) were already well suited to adhere to the standard, but the kind of results an NFA traditionally provided were quite different from the new standard, so changes were needed. As a result, broadly speaking, we have three types of regex engines:
- DFA (POSIX or not -- similar either way)
- Traditional NFA
- POSIX NFA
Table 4-1: Some Tools and Their Regex Engines Program (Original) Author Version Regex Engine awk Aho, Weinberger, Kernighan generic DFA new awk Brian Kernighan generic DFA GNU awk Arnold Robbins recent Mostly DFA, some NFA MKS awk Mortice Kern Systems POSIX NFA mawk Mike Brennan all POSIX NFA egrep Alfred Aho generic DFA MKS egrep Mortice Kern Systems POSIX NFA GNU Emacs Richard Stallman all Trad. NFA (POSIX NFA available) Expect Don Libes all Traditional NFA expr Dick Haight generic Traditional NFA grep Ken Thompson generic Traditional NFA GNU grep Mike Haertel Version 2.0 Mostly DFA, but some NFA GNU find GNU Traditional NFA lex Mike Lesk generic DFA flex Vern Paxson all DFA lex Mortice Kern Systems POSIX NFA more Eric Schienbrood generic Traditional NFA less Mark Nudelman Variable (usually Trad. NFA) Perl Larry Wall all Traditional NFA Python Guido van Rossum all Traditional NFA sed Lee McMahon generic Traditional NFA Tcl John Ousterhout all Traditional NFA vi Bill Joy generic Traditional NFA
POSIX standardized the workings of over 70 programs, including traditional regex-wielding tools such as awk, ed, egrep, expr, grep, lex, and sed. Most of these tools' regex flavor had (and still have) the weak powers equivalent to a moped. So weak, in fact, that I don't find them interesting for discussing regular expressions. Although they can certainly be extremely useful tools, you won't find much mention of expr, ed, and sed in this book. Well, to be fair, some modern versions of these tools have been retrofitted with a more-powerful flavor. This is commonly done to grep, a direct regex sibling of sed, ed, and expr.
On the other hand, egrep, awk, and lex were normally implemented with the electric DFA engine, so the new standard primarily just confirmed the status quo -- no big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIX-compliant. The gas engines that passed the California Emission Standards tests (POSIX NFA) were fine in that they produced results according to the standard, but the necessary changes only enhanced their fickleness to proper tuning. Where before you might get by with slightly misaligned spark plugs, you now have absolutely no tolerance. Gasoline that used to be "good enough'' now causes knocks and pings. But so long as you know how to maintain your baby, the engine runs smoothly. And cleanly.
From the Department of Redundancy Department
At this point I'll ask you to go back and review the story about engines. Every sentence there rings with some truth about regular expressions. A second reading should raise some questions. Particularly, what does it mean that an electric DFA more or less "just runs?'' What kind of things affect a gas-powered NFA? How can I tune my NFA? What special concerns does an emissions-controlled POSIX NFA have? What's a "stalled engine'' in the regex world? Last, and certainly least, just what is the regex counterpart to Rich Corinthian Leather?Match Basics
Before looking at the differences among these engine types, let's first look at their similarities. Certain aspects of the drive train are the same (or for all practical purposes appear to be the same), so these examples can cover all engine types.About the Examples
This chapter is primarily concerned with a generic, full-function regex engine, so some tools won't support exactly everything presented. In my examples, the dip stick might be to the left of the oil filter, while under your hood it might be behind the distributor cap. Your goal is to understand the concepts so that you can drive and maintain your favorite regex package (and ones you find interest in later).I'll continue to use Perl's notation for most of the examples, although I'll occasionally show others to remind you that the notation is superficial and that the issues under discussion transcend any one tool or flavor. To cut down on wordiness here, I'll rely on you to check Chapter 3 if I use an unfamiliar construct.
This chapter details the practical effects of how a match is carried out. It would be nice if everything could be distilled down to a few simple rules that could be memorized without needing to understand what is going on. Unfortunately, that's not the case. In fact, with all this chapter offers, I identify only two all-encompassing rules I can list for you:
- the earliest match wins
- the quantifiers are greedy
We'll look at these rules, their effects, and much more throughout this chapter. Let's start by diving into the details of the first rule.
Rule 1: The Earliest Match Wins
Let's get down to business with The First Rule of Regular Expressions:The match that begins earliest wins
This rule says that any match that begins earlier in the string is always preferred over any plausible match that begins later. This rule doesn't say anything about how long the winning match might be (we'll get into that shortly), merely that among all the matches possible anywhere in the string, the one that begins the leftmost in the string is chosen. Actually, since more than one plausible match can start at the same earliest point, perhaps the rule should read "a match...'' instead of "the match...,'' but that sounds odd.
Here's how the rule comes about: the match is first attempted at the very beginning (just before the first character) of the string to be searched. "Attempted'' means that every permutation of the entire (perhaps complex) regex is tested starting right at that spot. If all possibilities are exhausted and a match is not found, the complete expression is re-tried starting from just before the second character. This full retry occurs at each position in the string until a match is found. A "no match'' result is reported only if no match is found after the full retry has been attempted at each position all the way to the end of the string (just after the last character).
Thus, when trying to match
ORAagainstFLORAL, the first attempt at the start of the string fails (sinceORAcan't matchFLO). The attempt starting at the second character also fails (it doesn't matchLOReither). The attempt starting at the third position, however, does match, so the engine stops and reports the match:FLORAL.If you didn't know this rule, results might sometimes surprise you. For example, when matching
catagainstthe match is in
The dragging belly indicates your cat is too fatindicates, not at the wordcatthat appears later in the line. This wordcatcould match, but thecatinindicateappears earlier in the string, so it is the one that is matched. For an application like egrep which cares only whether there is a match, not where the match might be, the distinction is irrelevant. For other uses, such as with a search-and-replace, it becomes paramount.Remember, the regex is tried completely each time, so something such as
fat|cat|belly|yourmatches `belly'rather than
The dragging belly indicates your cat is too fatfat, even thoughfatis listed first among the alternatives. Sure, the regex could conceivably matchfatand the others, but since they are not the earliest (starting furthest to the left) possible match, they are not the one chosen. As I said, the entire regex is attempted completely from one spot before moving along the string to try again from the next spot, and in this case that means trying each alternativefat,cat,belly, andyourat each position before moving on.The "Transmission'' and the Bump-Along
It might help to think of this rule as the car's transmission, connecting the engine to the drive train while adjusting for the gear you're in (or changing gears for you if it's an automatic -- perhaps the automotive equivalent of some internal optimizations we'll be talking about in the next chapter). The engine itself does the real work (turning the crank); the transmission transfers this work to the wheels.The transmission's main work: the bump-along
If the engine can't find a match starting at the beginning of the string, it's the transmission that bumps the regex engine along to attempt a match at the next position in the string, and the next, and the next, and so on. Usually. If, for instance, a regex begins with a start-of-string anchor, the transmission can realize that any bump-along would be futile, for only the attempt at the start of the string could possibly be successful. This is an example of the "String/Line Anchors" optimization discussed in the next chapter.Engine Pieces and Parts
An engine is made up of parts of various types and sizes. You can't possibly hope to understand how the whole thing works if you don't know much about the individual parts. In a regex, these parts are the individual units -- literal characters, quantifiers (star and friends), character classes, parentheses, and so on. The combination of these parts (and the engine's treatment of them) makes a regex what it is, so looking at ways they can be combined and how they interact is our primary interest. First, let's take a look at some of the individual parts:
- Literal text
- With a literal, non-metacharacter like
zor!, the match attempt is simply "Does this literal character match the current text character?'' If your regex is only literal text, such asusa, it is treated as "uand thensand thena.'' It's a bit more complicated if you have the engine do a case-insensitive match, wherebmatchesBand vice-versa, but it's still pretty straightforward.- Character classes, dot, and the like
- Matching a character class is not difficult either. Regardless of the length of the character class, it still matches just one character.3 A character class represents a set of characters that can match. Characters are included explicitly, or in a negated class excluded explicitly. Dot is just a shorthand for a large character class that matches any character (any character except newline and/or null in some flavors), so it's not a problem either. The same applies to other shorthand conveniences such as
\w,\W,\d,\D,\s,\S, and the like.
3 Actually, as we saw in the previous chapter (=>81), a POSIX collating sequence can match multiple characters, but this is not common. - Anchors
- A few other metacharacters are almost as simple, but they don't actually match characters in the target string; rather, they match a position in the target string. This includes string/line boundaries (caret and dollar), as well as word boundaries
\<,\b, and such. The tests are simple because, for the most part, they simply compare two adjacent characters in the target string.- Simple parentheses
- Certain parentheses used only for capturing text, as opposed to those used merely for grouping, have some performance impact (discussed in Chapter 5), but for the most part, they don't change how the match is carried out.
No electric parentheses or backreferences
I'd like to first concentrate on the similarities among the engines, but as foreshadowing, I'll show an interesting difference. Capturing parentheses (and the associated backreferences) are like a gas additive -- they affect a gasoline engine, but they are irrelevant to an electric engine because it doesn't use gas in the first place. The way a DFA engine works completely precludes the concept of backreferences and capturing parentheses. It just can't happen.4 This explains why tools developed with DFAs don't provide these features. You'll notice that awk, lex, and egrep don't have backreferences or any$1type functionality.
4 This does not, of course, mean that there can't be some mixing of technologies to try to get the best of both worlds. This is discussed in a sidebar on page 121. You might, however, notice that GNU's version of egrep does support backreferences. It does so by having two complete engines under the hood! It first uses a DFA engine to see whether a match is likely, and then uses an NFA engine (which supports the full flavor, including backreferences) to confirm the match. Later in this chapter, we'll see why a DFA engine can't deal with backreferences or capturing, and why anyone ever bothers with such an engine at all. (It has some major advantages, such as being able to match very quickly.)
Rule 2: Some Metacharacters Are Greedy
So far, we've seen matching that is quite straightforward. It is also rather uninteresting -- you can't do much without involving more-powerful metacharacters such as star, plus, alternation, and so on. Their added power requires more information to understand them fully.First, you need to know that the quantifiers (
?,*,+, and{min,max}) are greedy. When one of these governs a subexpression, such as theaina?, the(expr)in(expr)*, or the[0-9]in[0-9]+, there is a minimum number of matches that are required before it can be considered successful, and a maximum number that it will ever attempt to match. This has been mentioned in earlier chapters -- what's new here concerns The Second Rule of Regular Expressions:Items that are allowed to match a variable number of times always attempt to match as much as possible. They are greedy.
In other words, the quantifiers settle for the minimum number of required matches if they have to, but they always attempt to match as many times as they can, up to their maximum allowed.
The only time they settle for anything less than their maximum allowed is when matching too much ends up causing some later part of the regex to fail. A simple example is using
\<\w+s\>to match words ending with an `s', such asregexes. The\w+alone is happy to match the entire word, but if it does, thescan not match. To achieve the overall match, the\w+settles for matchingregexes, thereby allowings\>, and thus the full regex, to match.If it turns out that the only way the rest of the regex can succeed is when the greedy construct in question matches nothing (and if zero matches are allowed, as with star, question, and
{0,max}intervals), well, that's perfectly fine too. However, it turns out this way only if the requirements of some later subexpression forces the issue. Because the quantifiers always (or, at least, try to) take more than they minimally need, they are called greedy.This rule has many useful (but sometimes troublesome) implications. It explains, for example, why
[0-9]+matches the full number inMarch·1998. Once the1has been matched, the plus has fulfilled its minimum requirement, but because it tries for its maximum number of matches, it continues and matches the998before being forced to stop by the end of the string. (Since[0-9]can't match the nothingness at the end of the string, the plus finally stops.)A subjective example
Of course, this method of grabbing things is useful for more than just numbers. Let's say you have a line from an email header and want to check whether it is the subject line. As we saw in earlier chapters, you simply use^Subject:·. However, if you use^Subject:·(.*), you can later access the text of the subject itself via the tool's after-the-fact parenthesis memory (for example,$1in Perl).5
5 This example uses capturing as a forum for presenting greediness, so the example itself is appropriate only for NFAs (because only NFAs support capturing). The lessons on greediness, however, apply to all engines, including the non-capturing DFA. Before looking at why
.*matches the entire subject, be sure to understand that once the^Subject:·part matches, you're guaranteed that the entire regular expression will eventually match. You know this because there's nothing after^Subject:·that could cause the expression to fail --.*can never fail since the worst case of "no matches" is still considered successful for star.Why do we even bother adding
.*? Well, we know that because star is greedy, it attempts to match dot as many times as possible, so we use it to "fill"$1. In fact, the parentheses add nothing to the logic of what the regular expression matches -- in this case we use them simply to capture the text matched by.*.Once
.*hits the end of the string, the dot isn't able to match, so the star finally stops and lets the next item in the regular expression attempt to match (for even though the starred dot could match no further, perhaps a subexpression later in the regex could). Ah, but since it turns out that there is no next item, we reach the end of the regex and we know that we have a successful match.
So, with a variable
$lineholding a string such as
Subject: Re: happy birthday
the Perl code
if ( $line =~ m/^Subject: (.*)/ ) { print "The subject is: $1\n"; }
produces `The subject is: Re: happy birthday'.To make the example more concrete, here's the snippet in Tcl
if [regexp "^Subject: (.*)" $line all exp1] { puts "The subject is: $exp1" }
and in Python:
reg = regex.compile("Subject: \(.*\)") if reg.match(line) > 0: print "The subject is:", reg.group(1)As you can see, each language handles regular expressions in its own way, but the concept (and result) of greediness stays the same.
Regarding replies
To extend this example, let's consider bypassing that pesky `Re:·' that most mail systems prepend to a subject to indicate the message is a reply. Our goal is to ignore any `Re:·' that might begin the subject, printing only the "real" subject part.We can use greediness to our advantage and take care of this right in the regex. Consider
^Subject:·(Re:·)?(.*), with(Re:·)?added before(.*). Both subexpressions are greedy, but(Re:·)?gets to be greedy first, nabbing `Re:·' before.*takes what's left. In fact, we could use(Re:·)*just as well, which scoops up all theRe:that might have been stacked up over the course of back and forth replies.The parentheses in
(Re:·)?are intended only for grouping, but they still count as a set of parentheses. This means that our original parentheses, which grab what.*matches, become the second set. This, in turn, means that the subject of our interest becomes$2, not$1. So, if our code iswe get `
if ( $line =~ m/^Subject: (Re: )?(.*)/ ) { print "The subject is: $2\n"; }The subject is: happy birthday'.
You might even imagine something like:
if ( $line =~ m/^Subject: (Re: )?(.*)/ ) { # we have a match -- alter our report depending upon $1if ($1 eq "Re: ") { print "The reply is: $2\n"; } else { print "The subject is: $2\n"; } }Even if the set of parentheses that fills
$1is not able to match, the second set still stuffs its matched text into$2.
As a final comparison, let's look at the same expression with one of the parentheses moved:
Since sets of parentheses are always labeled by counting open parentheses from the left, the
Re:·parentheses become$2, and the whole subject becomes$1. However, this time the `Re:·' that might be matched into$2is also within the first set of parentheses, so$1will have that same `Re:·' (as well as the rest of the subject). Although this isn't useful with our example so far, it would be if you wanted to access the subject with any `Re:·' intact, but also want a simple way to tell whether it is a reply.Being too greedy
Let's get back to the concept of a greedy quantifier being as greedy as it can be. Consider how the matching and results would change if we add another.*:^Subject:·(.*).*. The answer is: nothing would change. The initial.*(inside the parentheses) is so greedy that it matches all the subject text, never leaving anything for the second.*to match. Again, failing to match is not a problem, since the star does not require a match to be successful. Were the second.*in parentheses as well, the resulting$2would always be empty.Does this mean that after
.*, a regular expression can never have anything that is expected to actually match? No, of course not. It is possible for something later in the regex to force something previously greedy to give something back (that is, relinquish or "unmatch") if that's what is necessary to achieve an overall match.Let's consider the possibly useful
^.*([0-9][0-9]), which matches the last two digits in the string, wherever they might be, and saves them to$1. Here's how it works: At first,.*matches the entire string. Because the following([0-9][0-9])is required, its failure to match, in effect, tells.*"Hey, you took too much! Give me back something so that I can have a chance to match." Greedy components first try to take as much as they can, but they always defer to the greater need to achieve an overall match. They're just stubborn about it, and only do so when forced. Of course, they'll never give up something that hadn't been optional in the first place, such as a plus quantifier's first match.With this in mind, let's apply
^.*([0-9][0-9])to `about·24·characters·long'. Once.*matches the whole string, the requirement for the first[0-9]to match forces.*to give up `g' (the last thing it had matched). That doesn't, however, allow[0-9]to match, so.*is again forced to relinquish something, this time theninlong. This cycle continues 15 more times until.*finally gets around to giving up4.Unfortunately, even though the first
[0-9]can then match, the second still cannot. So,.*is forced to relinquish more in search of the overall match. This time.*gives up the2, which the first[0-9]can match. Now, the4is free for the second[0-9]to match, and the entire expression matches `about·24·char...', with$1getting `24'.First come, first served
Consider changing that regex to^.*([0-9]+), ostensibly to match not just the last two digits, but the last whole number however long it might be. It won't work. As before,.*is forced to relinquish some of what it had matched because the subsequent[0-9]+requires a match to be successful. With the `about·24·char...' example, that means unmatching until the4. Like before,[0-9]can then match. Unlike before, there's then nothing further that must match, so.*is not forced to give up the2or any other digits it might have matched. Were.*to do so, the[0-9]+would certainly be a grateful recipient, but nope, first come first served. Greedy constructs are very possessive -- once something is in their clutches, they'll give it up if forced, but never just to be nice.If this feels counter-intuitive, realize that
[0-9]+is just one match away from[0-9]*, which is in the same league as.*. Substituting into^.*([0-9]+), we get^.*(.*)as our regex, which looks suspiciously like the^Subject:·(.*).*from a page or so ago.Getting down to the details
I should clear up a few things here. Phrases like "the.*gives up...'' and "the[0-9]forces...'' are slightly misleading. I used these terms because they're easy to grasp, and the end result appears to be the same as reality. However, what really happens behind the scenes depends on the basic engine type, DFA or NFA. So it's time to see what these really are.Regex-Directed vs. Text-Directed
The two basic engine types reflect a fundamental difference in how one might apply a regular expression in checking a string. I call the gasoline-driven NFA engine "regex-directed,'' and the electric-driven DFA "text-directed.''NFA Engine: Regex-Directed
Let's consider one way an engine might matchto(nite|knight|night)against the text `...tonight...'. Starting with thet, the regular expression is examined one component at a time, and the "current text'' is checked to see whether it matches. If it does, the next component is checked, and so on, until all components have matched, indicating that an overall match has been achieved.With the
to(nite|knight|night)example, the first component ist, which repeatedly fails until atis reached. Once that happens, theois checked against the next character, and if it matches, control moves to the next component. In this case, the "next component'' is(nite|knight|night)which really means "niteorknightornight.'' Faced with three possibilities, the engine just tries each in turn. We (humans with advanced neural nets between our ears) can see that if we're matchingtonight, the third alternative is the one that leads to a match. Despite their brainy origins (=>60), a regex-directed engine can't come to that conclusion until actually going through the motions to check.Attempting the first alternative,
nite, involves the same component-at-a-time treatment as before: "Try to matchn, theni, thent, and finallye.'' If this fails, as it eventually does, the engine tries another alternative, and so on until it achieves a match or must report failure. Control moves within the regex from component to component, so I call it "regex-directed.''The control benefits of an NFA engine
In essence, each subexpression of a regex in a regex-directed match is checked independently of the others -- their only interrelation is that of proximity to one another by virtue of being subexpressions thrown together to make a single regex. The layout of the components is what controls an engine's overall movement through a match.Since the regex directs the NFA engine, the driver (the writer of the regular expression) has considerable opportunity to craft just what he or she wants to happen. (Chapter 5 is devoted to this very subject.) What this really means may seem vague now, but it will all be spelled out just after the mysteries of life are revealed
DFA Engine: Text-Directed
Contrast the regex-directed NFA engine with an engine that, while scanning the string, keeps track of all matches "currently in the works.'' In thetonightexample, the engine knows a possible match has started the moment it hitst:
in string in regex after ...t|onight...possible matches: t|o(nite|knight|night)
Each subsequent character scanned updates the list of possible matches. After a few more characters are matched, the situation becomes
in string in regex after ...toni|ght...possible matches: to(ni|te|knight|ni|ght)
with two possible matches in the works (and one alternative,knight, ruled out). Yet, with thegthat follows, only the third alternative remains viable. Once thehandtare scanned as well, the engine realizes it has a complete match and can return success.I call this "text-directed'' matching because each character scanned controls the engine. As in the example above, a partial match might be the start of any number of different, yet possible, matches. Matches that are no longer viable are pruned as subsequent characters are scanned. There are even situations where a "partial match in progress'' is also a full match. With
to(...)?, for example, if any match in the works ends inside the parentheses, a full match (of `to') is already confirmed and in reserve in case the longer matches don't pan out.If you reach a character in the text that invalidates all the matches in the works, you must either: 1) revert to one of the full matches in reserve, or, failing that, 2) declare that there are no matches at the attempt's starting point.
Foreshadowing
If you compare these two engines based only on what I've mentioned so far, you might conclude that the text-directed DFA engine is generally faster. The regex-directed NFA engine might waste time attempting to match different subexpressions against the same text (such as the three alternatives in the example).You would be right. During the course of an NFA match, the same character of the target might be checked by many different parts of the regex (or even by the same part, over and over). Even if a subexpression can match, it might have to be applied again (and again and again) as it works in concert with the rest of the regex to find a match. A local subexpression can fail or match, but you just never know about the overall match until you eventually work your way to the end of the regex. (You know, if I could find a way to include "It's not over until the fat lady sings.'' in this paragraph, I would.) On the other hand, a DFA engine is determinate -- each character in the target is checked once (at most). When a character matches, you don't know yet if it will be part of the final match (it could be part of a possible match that doesn't pan out), but since the engine keeps track of all possible matches in parallel, it need be checked only once, period.
The Mysteries of Life Revealed
The foreshadowing in the last section might have been a bit thick, so I'd better come clean now, at least about some of it. The two basic technologies behind regular-expression engines have the somewhat imposing names Nondeterministic Finite Automaton (NFA) and Deterministic Finite Automaton (DFA). With mouthfuls like this, you see why I stick to just "NFA'' and "DFA.'' We won't be seeing these phrases spelled out again.6
6 I suppose I could explain the underlying theory that goes into these names... if I only knew it! As I hinted, the word determinate is pretty important, but for the most part the theory is not so long as we understand the practical effects. By the end of this chapter, we will. However, do see the sidebar on page 104. Because of the regex-directed nature of an NFA, the details of how the engine attempts a match are very important. As I said before, the writer can exercise a fair amount of control simply by changing how the regex is written. With the
tonightexample, perhaps less work would have been wasted had the regex been written differently, such asto(ni(ght|te)|knight),tonite|toknight|tonight, or perhapsto(k?night|nite). With any given text, they all end up matching exactly the same text, but in doing so direct the engine in different ways. At this point, we don't know enough to judge which regexes, if any, are better than others, but that's coming soon.It's the exact opposite with a DFA -- since the engine keeps track of all matches simultaneously, none of these differences in representation matter so long as in the end they all represent the same possible matches. There could be a hundred different ways to achieve the same result, but since the DFA keeps track of them all simultaneously (almost magically -- more on this later), it doesn't matter which form the regex takes. To a pure DFA, even expressions that appear as different as
abcand[aa-a](b|b{1}|b)care utterly indistinguishable.It all boils down to...
Three things come to my mind when describing a DFA engine:
· DFA matching is very fast
· DFA matching is very consistent
· Talking about DFA matching is very boring
(I'll eventually expand on all these points.)The regex-directed nature of an NFA makes it interesting to talk about. NFAs provide plenty of room for creative juices to flow. There are great benefits in crafting an expression well, and even greater penalties for doing it poorly. A gasoline engine is not the only engine that can stall and conk out completely. To get to the bottom of this, we need to look at the essence of an NFA engine: backtracking.
Backtracking
The essence of an NFA engine is this: it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be. If the attempted option is successful and the rest of the regex is also successful, you are finished with the match. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the option and can continue with the match by trying another. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).A Really Crummy Analogy
Backtracking is like leaving a pile of bread crumbs at every fork in the road. If the path you choose turns out to be a dead end, you can retrace your steps, giving up ground until you come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, you can continue to backtrack, retracing your steps to the next pile of crumbs, and so on, until you eventually find a path that leads to your goal or until you run out of untried paths.There are various situations when the regex engine needs to choose between two (or more) options -- the alternation we saw earlier is only one example. Another example is that upon reaching
...x?..., the engine must decide whether it should attemptxor not. Upon reaching...x+..., however, there is no question about trying to matchxat least once -- the plus requires at least one match, and that's non-negotiable. Once the firstxhas been matched, though, the requirement is lifted and it then must decide to match anotherxor not. If it decides to match, it must decide if it will then attempt to match another... and another... and so on. At each of these many decision points, a virtual "pile of crumbs'' is left behind as a reminder that another option (to match or not to match, whichever wasn't chosen at each point) remains viable at that point.A crummy little example
Let's look at a full example using our earlierto(nite|knight|night)regex on the string `hot·tonic·tonight!' (silly, yes, but a good example). The first componenttis attempted at the start of the string. It fails to matchh, so the entire regex fails at that point. The engine's transmission then bumps along to retry the regex from the second position (which also fails), and again at the third. This time thetmatches, but the subsequentofails in matching the `·', so again the whole attempt fails.The attempt that eventually starts at
...|tonic...is more interesting. Once thetohas been matched, the three alternatives become three available options. The regex engine picks one to try, remembering the others ("leaving some bread crumbs'') in case the first fails. For the purposes of discussion, let's say that the engine first choosesnite. That expression breaks down to "n+i+t...,'' which gets to...toni|c...before failing. Unlike the earlier failures, this failure doesn't mean the end of the overall attempt because other options still remain. The engine chooses one, we'll sayknight, but it fails right away. That leaves only one final option,night, but it eventually fails. Since that was the final untried option, its failure means the failure of the entire attempt starting at...|tonic..., so the transmission kicks in again.Once the engine gets to the attempt starting at
...|tonight!it gets interesting again, but this time, thenightalternative successfully matches to the end. The successful matching to the end of the regex means an overall match, so the engine can report success at that point.Two Important Points on Backtracking
The general idea of how backtracking works is fairly simple, but some of the details are quite important for real-world use. Specifically, when faced with multiple choices, which choice should be tried first? Secondly, when forced to backtrack, which saved choice should the engine use?In situations where the decision is between "make an attempt'' and "skip an attempt,'' as with items governed by a question, star, and the like, the engine always chooses to first make the attempt. It will return later (to try skipping the item) only if forced by the overall need to reach a global expression-wide match.
This simple rule has far-reaching repercussions. For starters, it helps explain regex greediness, but not completely. To complete the picture, we need to know which (among possibly many) saved options to use when we backtrack. Simply put:
The most recently saved option is the one returned to when a local failure forces backtracking. It's LIFO (last in first out).
This is easily understood in the crummy analogy -- if your path becomes blocked, you simply retrace your steps until you come across a pile of bread crumbs. The first you'll return to is the most recently laid. The traditional analogy for describing LIFO also holds: like stacking and unstacking dishes, the most-recently stacked will be the first you'll unstack.
Saved States
In NFA regular expression nomenclature, the piles of bread crumbs are known as saved states. A state indicates where a test can restart from if need be. It reflects both the position in the regex and the point in the string where an untried option begins. Because this is the basis for NFA matching, let me go over what I've already said with some simple but verbose examples. If you're comfortable with the discussion so far, feel free to skip ahead.A match without backtracking
Let's look a simple example, matchingab?cagainstabc. Once theahas matched, the current state of the match is reflected by:
at ` a|bc'matching a|b?c
However, now that
b?is up to match, the regex engine has a decision to make: attempt thebor not. Well, since?is greedy, it attempts the match. But so that it can recover if that attempt fails or eventually leads to failure, it adds
at ` a|bc'matching ab?|c
to its otherwise empty list of saved states. This indicates that the engine can later pick up the match in the regex just after theb?, picking up in the text from just before theb(that is, where it is now). Thus, in effect, skipping thebas the question mark allows.Once the engine carefully places that pile of crumbs, it goes ahead and checks the
b. With the example text, it matches, so the new current state becomes:
at ` ab|c'matching ab?|c
The finalcmatches as well, so we have an overall match. The one saved state is no longer needed, so it is simply forgotten.A match after backtracking
Now, if `ac' had been the text to match, everything would have been the same until thebattempt was made. Of course, this time it wouldn't match. This means that the path that resulted from actually attempting the...?failed. Since there is a saved state available to return to, this "local failure'' does not mean overall failure. The engine backtracks, meaning that it takes the most recently saved state as its new current state. In this case, that would be the
at ` a|c'matching ab?|c
that had been saved as the untried option before thebhad been attempted. This time, thecandcmatch up, so the overall match is achieved.A non-match
Now let's look at the same expression, but againstabX. Before thebis attempted, the question mark causes the state
at ` a|bX'matching ab?|c
to be saved. Thebmatches, but that avenue later turns out to be a dead end because thecfails to matchX. The failure results in a backtrack to the saved state. The engine next testscagainst thebthat the backtrack effectively "unmatched.'' Obviously, this test fails, too. If there were other saved states, another backtrack would occur, but since there aren't any, the overall match at the current starting position is deemed a failure.Are we done? No. The engine's transmission will still do its "bump along the string and retry the regex,'' which might be thought of as a pseudo-backtrack. The match restarts at:
at ` a|bX'matching |ab?c
The whole match is attempted again from the new spot, but like before, all paths lead to failure. After the next two attempts (from
ab|XandabX|) similarly fail, a true overall failure is finally reported.Backtracking and Greediness
For tools that use this NFA regex-directed backtracking engine, understanding how backtracking works with your regular expression is the key to writing expressions that accomplish what you want, and accomplish it quickly. We've seen how?greediness works, so let's look at star (and plus) greediness.Star, plus, and their backtracking
If you considerx*to be more or less the same asx?x?x?x?x?x?...(or, more appropriately,(x(x(x(x...?)?)?)?)?),7 it's not too different from what we have already seen. Before checking the item quantified by the star, the engine saves a state indicating that if the check fails (or leads to failure), the match can pick up after the star. This is done over and over, until the star match actually does fail.
7 Just for comparison, remember that a DFA doesn't care much about the form you use to express which matches are possible; the three examples are identical to a DFA.
Thus, when matching
[0-9]+against `a·1234·num', once[0-9]fails trying to match the space after the4, there are four saved states indicating that the match can pick up in the regex at[0-9]+|at each of the string positions:These represent the fact that the attempt of
a 1|234 num a 12|34 num a 123|4 num a 1234|num[0-9]had been optional at each of these positions. When it fails to match the space, the engine backtracks to the most recently saved state (the last one listed), picking up at `a·1234|·num' in the text and at[0-9]+|in the regex. Well, that's at the end of the regex. Now that we're actually there and notice it, we realize that we have an overall match.Note that in the above list of four string positions, `
a·|1234·num' is not a member because the first match using the plus quantifier is required, not optional. Would it have been in the list had the regex been[0-9]*? (hint: it's a trick question) ¤ Turn the page to check your answer.
Where does [0-9]*match?¤ Answer to the question on page 106.
No, `
a·|1234·num' would not be part of a saved state during a match of[0-9]*. I posed this question because it's a mistake that new users commonly make.Remember, a component that has star applied can always match. If that's the entire regex, the entire regex can always match. This certainly includes the attempt when the transmission applies the engine the first time, at the start of the string. In this case, the regex matches at `
|a·1234·num' and that's the end of it -- it never even gets as far the digits.Revisiting a fuller example
With our more detailed understanding, let's revisit the^.*([0-9][0-9])example from page 97. This time, instead of just pointing to "greediness'' to explain why the match turns out as it does, we can use our knowledge of the mechanics of a match to explain why in precise terms. (If you're feeling a bit snowed in with the details, feel free to skim ahead for the general feel of what this chapter offers -- you can review the examples in more detail later on.)I'll use `
CA·95472,·USA' as an example. Once the.*has successfully matched to the end of the string, there are a dozen saved states accumulated from the star-governed dot matching 12 things that are (if need be) optional. These states note that the match can pick up in the regex at^.*|([0-9][0-9]), and in the string at each point where a state was created.Now that we've reached the end of the string and pass control to the first
[0-9], the match obviously fails. No problem: we have a saved state to try (a baker's dozen of them, actually). We backtrack, resetting the current state to the one most recently saved, to just before where.*matched the finalA. Skipping that match (or "unmatching'' it, if you like) leaves us trying thatAagainst the first[0-9]. It fails.This backtrack-and-test cycle continues until the engine effectively unmatches the
2, at which point the first[0-9]can match. The second, however, can't, so we must continue to backtrack. It's now irrelevant that the first[0-9]matched during the previous attempt -- the backtrack resets the current state to before the first[0-9]. As it turns out, the same backtrack resets the string position to just before the7, so the first[0-9]can match again. This time, so can the second (matching the2). Thus, we have a match: `CA·95472,·USA', with$1getting72.A few observations: first, the backtracking also entailed maintaining the status of the text being matched by the subexpression within parentheses. The backtracks always caused the match to be picked up at
^.*|([0-9][0-9]). As far as the simple match attempt is concerned, this is the same as^.*|[0-9][0-9], so I used phrases such as "picks up before the first[0-9].'' However, moving in and out of the parentheses involves updating the status of what$1should be, and this impacts efficiency. This is discussed fully in the next chapter (=> 150).It is important to realize that something governed by star (or any of the quantifiers) first matches as much as it can without regard to what might follow in the regex. In our example, the
.*does not magically know to stop at the first digit, or the second to the last digit, or any other place until the dot finally fails. We saw this earlier when looking at how^.*([0-9]+)would never match more than a single digit by the[0-9]+part (=> 98).More About Greediness
Many concerns (and benefits) of greediness are shared by both an NFA and a DFA. I'd like to look at some ramifications of greediness for both, but with examples explained in terms of an NFA. The lessons apply to a DFA just as well, but not for the same reasons. A DFA is greedy, period, and there's not much more to say after that. It's very constant to use, but pretty boring to talk about. An NFA, however, is interesting because of the creative outlet its regex-directed nature provides. An NFA engine affords the regex author direct control over how a match is carried out. This provides many benefits, as well as some efficiency-related pitfalls. (Discussions of efficiency are taken up in the next chapter.)Despite these differences, the match results are often similar. For the next few pages, I'll talk of both engine types, but offer the description in the more easily understandable regex-directed terms of an NFA. By the end of this chapter, we'll have a firm grasp of just when the results might differ, as well as exactly why.
Problems of Greediness
As we saw with the last example,.*always marches to the end of the line.8 This is because.*just thinks of itself and grabs what it can, only later giving up something if it is required to achieve an overall match.
8 Or, with a tool where a dot can also match a newline, and strings that contain multi-line data, it matches through all the logical lines to the end of the string. Sometimes this can be a real pain. Consider a regex to match text wrapped in doublequotes. At first, you might want to write
".*", but knowing what we know about.*, guess where it will match in:
The name "McDonald's" is said "makudonarudo" in JapaneseActually, since we understand the mechanics of matching, we don't need to guess -- we can know. Once the initial quote matches,
.*is free to match, and immediately does so all the way to the end of the string. It will back off (or, perhaps more appropriately, be backed off by the regex engine) only as much as is needed until the final quote can match. Running this through in your head, you realize that it will matchwhich is obviously not the doublequoted string that was intended. This is one reason why I caution against the overuse of
The name "McDonald's" is said "makudonarudo" in Japanese.*, as it can often lead to surprising results if you don't pay careful attention to greediness.So, how can we have it match only
"McDonald's"? The key is to realize that we don't want "anything'' between the quotes, but rather "anything except a quote.'' If we use[^"]*rather than.*, it won't overshoot the closing quote.The regex engine's basic approach with
"[^"]*"is exactly the same as before. Once the initial quote matches,[^"]*gets a shot at matching as much as it can. In this case, that's up to the quote afterMcDonald's, at which point it finally stops because[^"]can't match the quote. So, at that point, control moves to the closing quote. It happily matches, resulting in overall success:
The name "McDonald's" is said "makudonarudo" in JapaneseMulti-Character "Quotes''
In the first chapter, I talked a bit about matching HTML tags, such as with the sequence...<B>very</B>...causing the "very'' to be rendered in bold if the browser can do so. Attempting to match a<B>...</B>sequence seems similar to matching a quoted string, except the "quotes'' in this case are the multi-character sequences<B>and</B>. Like the quoted string example, multiple sets of "quotes'' cause problems:If we use
...<B>Billions</B> and <B>Zillions</B> of suns...<B>.*</B>, the greedy.*causes the match-in-progress to zip to the end of the line, backtracking only far enough to allow the</B>to match, matching the last</B>on the line instead of the one corresponding to the opening<B>at the start of the match.Unfortunately, since the closing delimiter is more than one character, we can't solve the problem in the same way as we did with doublequoted strings. We can't expect something as ridiculous as
<B>[^</B>]*</B>to work. A character class represents only one character and not the full</B>sequence that we want.9
9 Don't let the apparent structure of [^</B>]fool you. It is just a class to match one character, any character except<,>,/, andB. It is the same as, say[^/<>B], and certainly won't work as an "anything not</B>'' construct.Laziness?
This whole problem arises because star and friends (the quantifiers) are greedy. For a moment, let's imagine they are "lazy'' (or "lax'' or "minimal-matching'' or "non-greedy'' or "ungreedy'' or whatever you'd like to call it). With a lazy<B>.*</B>and theexample, after the initial
...<B>Billions</B> and <B>Zillions</B> of suns...<B>has matched, a lazy.*would immediately decide that since it didn't require any matches, it would lazily not bother trying to perform any. So, it would immediately pass control to the following<.The
<wouldn't match at that point, so control would return back to the lazy.*where it still had its untried option to attempt a match (to attempt multiple matches, actually). It would begrudgingly do so, with the dot then matching...<B>Billions.... Again, the star has the option to match more, or to stop. We're assuming it's lazy for the example, so it first tries stopping. The subsequent<still fails, so.*has to again exercise its untried match option. After eight cycles,.*will have eventually matchedBillions, at which point the subsequent<(and the whole</B>subexpression) will finally be able to match:
...<B>Billions</B> and <B>Zillions</B> of suns...So, as we've seen, the greediness of star and friends can be a real boon at times, while troublesome at others. Having non-greedy, lazy versions is wonderful, as they allow you to do things that are otherwise very difficult (or even impossible). As it turns out, Perl provides ungreedy quantifiers in addition to the normal greedy versions. Like most great inventions, the idea is simple; we just had to wait for someone to think of it (in this case, Perl's author, Larry Wall).
Unfortunately, if you are not using Perl and don't have a lazy star quantifier, you are still faced with how to solve the
<B>...</B>multi-character quote problem. Frankly, it is quite difficult to solve using a single, straight regular expression -- I recommend splitting the work into two parts, one to find the opening delimiter, the other to search from that point to find the closing delimiter.Greediness Always Favors a Match.
Recall the price-fixing (so to speak) example from Chapter 2 (=> 46). Due to floating-point representation problems, values that should have been "1.625'' or 3.00'' were sometimes coming out like "1.62500000002828'' and "3.00000000028822''. To fix this, I usedto lop off all but the first two or three decimal digits from the value stored in the variable
$price =~ s/(\.\d\d[1-9]?)\d*/$1/$price. The\.\d\dmatches the first two decimal digits regardless, while the[1-9]?matches the third digit only if it is non-zero.I then noted:
Anything matched so far is what we want to keep, so we wrap it in parentheses to capture to$1. We can then use$1in the replacement string. If this is the only thing that matches, we replace exactly what was matched with itself -- not very useful. However, we go on to match other items outside the$1parentheses. They don't find their way to the replacement string, so the effect is that they're removed. In this case, the "to be removed'' text is any extra digits, the\d*at the end of the regex.
So far so good, but let's consider what happens when the contents of the variable
$priceis already well-formed. When it is27.625, the(\.\d\d[1-9]?)part matches the entire decimal part. Since the trailing\d*doesn't match anything, the substitution replaces the `.625' with `.625' -- an effective no-op.This is the desired result, but wouldn't it be just a bit more efficient to do the replacement only when it would have some real effect? In such a case, the
\d*would have to actually match at least one digit, since only it matches text to be omitted.Well, we know how to write "at least one digit''! Simply replace
\d*with\d+:
$price =~ s/(\.\d\d[1-9]?)\d+/$1/With crazy numbers like "1.62500000002828'', it still works as before, but with something such as "9.43'', the trailing
\d+isn't able to match, so rightly, no substitution occurs. So, this is a great modification, yes? No! What happens with a three-digit decimal value like27.625?Stop for a moment to work through the match of
27.625yourself.In hindsight, the problem is really fairly simple. Picking up in the action once
(\.\d\d[1-9]?)\d+has matched27.625, we find that\d+can't match. That's no problem for the regex engine, since the match of `5' by[1-9]was optional and there is still a saved state to try. This is the option of having[1-9]?match nothing, leaving the5to fulfill the must-match-one requirement of\d+. Thus, we get the match, but not the right match:.625is replaced by.62, and the value becomes incorrect.The lesson here is that a match always takes precedence over a non-match, and this includes taking from what had been greedy if that's what is required to achieve a match.10
Is Alternation Greedy?
The only major control I haven't yet discussed in depth is|, alternation. How alternation works is an important point because it can work in fundamentally different ways with different regex engines. When alternation is reached, any number of the alternatives might be able to match in the string, but which will? The Second Rule of Regular Expressions refers to quantifiers (star and friends) but not alternation. Is alternation greedy?Let's look at an NFA engine. When faced with alternation, it tries each alternative in turn. You can pretty much count on each alternative being checked in the order given in the expression. Let's say the regex is
^(Subject|Date):·. When the alternation is reached, the first alternative,Subject, is attempted. If it matches, the rest of the regex,:·, is given a chance. If it turns out that it can't match, and if other alternatives remain (in this case,Date), the regex engine will backtrack to try them. This is just another case of the regex engine backtracking to a point where untried options are still available. This continues until an overall match is achieved, or until all options (in this case, alternatives) are exhausted.What text will actually be matched by
tour|to|tournamentwhen applied to the string `three·tournaments·won'? All the alternatives are attempted (and fail) during each attempt (at the 1st character position, 2nd, 3rd, and so on) until the transmission starts the attempt at `three·|tournaments·won'. This time, the first alternative,tour, matches. Since the alternation is the last thing in the regex, the moment thetourmatches, the whole regex is done. The other alternatives are not even tried again.So, we see that alternation is not greedy, at least not for an NFA. Well, to be specific, alternation is not greedy for a Traditional NFA. Greedy alternation would have matched the longest possible alternative (
tournament), wherever in the list it happened to be. A POSIX NFA, or any DFA would have indeed done just that, but I'm getting a bit ahead of myself.To make sure you are on your toes, let me ask: which kind of alternation would result in
tour|to|tournamentmatching the same text asto(ur(nament)?)?? Before answering, make sure you realize that both are logically the same: they can matchtour,to, andtournament, but nothing else. The question here is, in practice, which text willto(ur(nament)?)?actually match:tour(as with non-greedy alternation),tournament(as with greedy alternation), or something else altogether? ¤ Turn the page to check your answer.
Is to(ur(nament)?)?Greedy?¤ Answer to the question on page 112.
Once the initial
tohas matched, we know that an overall match is guaranteed since nothing in the regex that follows is required. Although this may well end up being the final overall match, the engine can't decide that yet, as there is still the possibility that more can be matched -- the question marks are greedy, so they attempt to match what they quantify.The
(ur(nament)?)quantified by the outermost question mark matches if possible, and within that, thenamentalso matches if possible. So overall, the entireto+ur+namentmatches if at all possible. In practice, it matches the same text as a greedy-alternationtour|to|tournament: the longest possible.Uses for Non-Greedy Alternation
Let's revisit the(\.\d\d[1-9]?)\d*example from page 110. If we realize that\.\d\d[1-9]?, in effect, says "allow either\.\d\dor\.\d\d[1-9]'', we can rewrite the entire expression as(\.\d\d|\.\d\d[1-9])\d*. (There is not a compelling reason to make this change -- it's merely a handy example.) Is it really the same as(\.\d\d[1-9]?)\d*? If alternation is greedy, then yes, it is, but the two are quite different with non-greedy alternation.Let's consider it as non-greedy for the moment. If the first alternative,
\.\d\d, is able to match, the\d*that follows the alternation will certainly succeed. This might include matching a non-zero digit (which, if you'll recall the original problem, is a digit we really want to match only within the parentheses). Also, realize that the second alternative begins with a copy of the entire first alternative -- if the first alternative fails, the second will certainly fail as well. The regex engine will nevertheless make the attempt, but I'll leave that issue of efficiency to the next chapter.Interestingly, if we use
(\.\d\d[1-9]|\.\d\d)\d*, which is the same except that the alternatives have been swapped, we do effectively get a replica of the original(\.\d\d[1-9]?)\d*. The alternation has meaning in this case because if the first alternative fails due to the[1-9], the second alternative still stands a chance.In distributing the
[1-9]?to two alternatives and placing the shorter one first, we fashioned a non-greedy?of sorts. It ends up being meaningless in this particular example because there is nothing that could ever allow the second alternative to match if the first fails. I see this kind of faux-alternation often, and it is invariably a mistake. In one book I've read,a*((ab)*|b*)is used as an example in explaining something about regex parentheses. A rather silly example, isn't it? Since the first alternative,(ab)*, can never fail, any other alternatives (justb*in this case) are utterly meaningless. You could add
a*((ab)*|b*|.*|partridge·in·a·pear·tree|[a-z])and it wouldn't change the meaning a bit.
Non-greedy alternation pitfalls
You can often use non-greedy alternation to your advantage by crafting just the match you want, but non-greedy alternation can also lead to unexpected pitfalls for the unaware. Consider wanting to match a January date of the form `Jan 31'. We need something more sophisticated than, say,Jan·[0123][0-9], as that allows "dates'' such as `Jan 00', `Jan 39', and disallows, say, `Jan 7'.One simple way to match the date part is to attack it in sections. To match from the first through the ninth, using
0?[1-9]allows a leading zero. Adding[12][0-9]allows for the tenth through the 29th, and3[01]rounds it out. Putting it all together, we getJan·(0?[1-9]|[12][0-9]|3[01]).Where do you think this matches in `
Jan 31 is my dad's birthday'? We want it to match `Jan 31', of course, but non-greedy alternation actually matches only `Jan 3'. Surprised? During the match of the first alternative,0?[1-9], the leading0?fails, but the alternative matches because the subsequent[1-9]has no trouble matching the3. Since that's the end of the expression, the match is complete.Were the order of the alternatives reversed, or the alternation greedy, this problem would not have surfaced. Another approach to our date-matching task could be
Jan·(31|[123]0|[012]?[1-9]). Like the first solution, this requires careful arrangement of the alternatives to avoid the problem. Yet a third approach isJan·(0[1-9]|[12][0-9]?|3[01]?|[4-9]), which works properly regardless of the ordering. Comparing and contrasting these three expressions can prove quite interesting (an exercise I'll leave for your free time, although the "A Few Ways to Slice and Dice a Date'' sidebar on page 115 should be helpful).Greedy Alternation in Perspective
As we've seen, non-greedy alternation is more powerful than greedy alternation because you have more control over just how a match is attempted -- it doesn't say "Take any of these'' so much as "Try this, then that, and finally the other.''With an NFA, alternation can entail a lot of backtracking, and finding ways to reduce it is usually a good way to make the regex more efficient, which means faster execution. We'll see some examples soon, and more in the next chapter.
Character Classes vs. Alternation
Because of the superficial similarity between[abc]anda|b|c, you might tend to think that character classes are implemented similarly, but with an NFA nothing could be further from the truth. With a DFA, it makes no difference one way or the other, but with an NFA a character class is an atomic unit which acts like a sieve, allowing a character to pass only if it is one of the target characters. This test is, in effect, done in parallel. It is much more efficient than the comparable NFA alternation, which checks each alternative in turn (entailing a lot of backtracking).NFA, DFA, and POSIX
"The Longest-Leftmost''
Let me repeat: when the transmission starts a DFA engine from some particular point in the string, if any match is to be found from that position, the DFA will find the longest possible, period. Since it's the longest from among all possible matches that start equally furthest to the left, it's the "longest-leftmost'' match.Really, the longest
Issues of which match is longest aren't confined to alternation. Consider how an NFA matches the (horribly contrived)one(self)?(selfsufficient)?against the stringoneselfsufficient. An NFA first matchesoneand then the greedy(self)?, leaving(selfsufficient)?left to try againstsufficient. That doesn't match, but that's okay since it is optional. So, the Traditional NFA returnsoneselfsufficientand discards the untried states. (A POSIX NFA, on the other hand, is another story that we'll get to shortly.)On the other hand, a DFA finds the longer
oneselfsufficient. If the(self)?were to be non-greedy and go unmatched, the(selfsufficient)?would be able to match, yielding a longer overall match. That is the longest possible, so is the one that a DFA finds.I chose this silly example because it's easy to talk about, but I want you to realize that this issue is important in real life. For example, consider trying to match continuation lines. It's not uncommon for a data specification to allow one logical line to extend across multiple real lines if the real lines end with a backslash before the newline. As an example, consider the following:11
SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \ missing.c msg.c node.c re.c version.c
11 The actual text is irrelevant to the example, but for what it's worth, this is from the GNU awk makefile. You might normally want to use
^\w+=.*to match this kind of "var = value'' assignment line, but this regex doesn't consider the continuation lines. I'm assuming, for the example, that the tool's dot won't match a newline -- you could substitute it with something like[^\n]if need be.To match continuation lines, you might want to try appending
(\\\n.*)*to the regex. Ostensibly, this says that any number of additional logical lines are allowed so long as they follow an escaped newline. This seems reasonable, but it will never work with a traditional NFA. By the time the original.*has reached the newline, it has already passed the backslash, and nothing in what was added forces it to backtrack. Yet a DFA would find the longer multi-line match if it existed, simply because it was, indeed, the longest.POSIX and the Longest-Leftmost Rule
The POSIX standard requires that if you have multiple possible matches that start at the same position, the one matching the most text must be the one returned. Period.The standard uses the phrase "longest of the leftmost.''12 It doesn't say you have to use a DFA, so if you want to use an NFA when creating a tool, what's a programmer to do? If you want to implement a POSIX NFA, you'd have to find the full
oneselfsufficientand all the continuation lines, despite these results being "unnatural'' to your NFA.
12 The rationale associated with the standard uses the phrase "leftmost-longest,'' which is incorrect English considering what we know they're trying to say. It means "of all the equally 'longest', choose the leftmost,'' which is quite different from "longest of the leftmost.'' A Traditional NFA engine stops with the first match it finds, but what if it were to continue to try all the remaining options? Each time it reached the end of the regex, it would have another plausible match. By the time all options are exhausted, it could simply report the longest of the plausible matches it had found. Thus, a POSIX NFA.
An NFA applied to the first example would, after matching
(self)?, have saved an option noting that it could pick up matchingone(self)?|(selfsufficient)?atone|selfsufficient. Even after finding theoneselfsufficientthat a Traditional NFA returns, a POSIX NFA would continue to exhaustively check the remaining options, eventually realizing that yes, there was a way to match the longeroneselfsufficient.Really, the leftmost
Not only does POSIX mandate that the longest-leftmost match be found, but that if subexpression capturing (via parentheses) is supported, each captured subexpression must capture the maximum amount of text consistent with the overall leftmost-longest, and with its place in the regex. This means that the overall match is chosen based only on the longest-leftmost rule, but once chosen, the first set of capturing parentheses gets the maximum possible from that. After that, the second set gets the maximum possible of what's left. And so on.
(to|top)(o|polo)?(gical|o?logical)can match `topological' in various ways, but a POSIX engine would have to match `topological' as shown (the part matched by each parenthesized expression is marked). Compare this to, say the `topological' that a Traditional NFA would find. The former's first parentheses'topis longer than the latter'sto, so it would have to be the particular match chosen for a POSIX match. Similarly, even though it could then match nothingness in the second set of parentheses (matchingologicalin the third), it takes the longest match that is consistent with both the longest overall match and the earlier parentheses taking their longest match.Note, though, that many POSIX engines do not support any kind of capturing, so this issue is normally not a concern.
Speed and Efficiency
If efficiency is an issue with a Traditional NFA (and with all that backtracking, believe me, it is), it is doubly so with a POSIX NFA since there can be so much more backtracking. A POSIX NFA engine really does have to try every possible permutation of the regex every time. Examples in the next chapter show that poorly written regexes can suffer extremely severe performance penalties.DFA efficiency
The text-directed DFA is a really fantastic way around all the inefficiency of backtracking. It gets its matching speed by keeping track of all possible ongoing matches at once. How does it achieve this magic?The DFA engine spends extra time and memory before a match attempt to analyze the regular expression more thoroughly (and in a different way) than an NFA. Once it starts actually looking at the string, it has an internal map describing "If I read such-and-such a character now, it will be part of this-and-that possible match.'' As each character of the string is checked, the engine simply follows the map.
Building that map (called compiling the regex, something that must be done for an NFA as well, but it's not nearly as complex) can sometimes take a fair amount of time and memory, but once done for any particular regular expression, the results can be applied to an unlimited amount of text. It's sort of like charging the batteries of your electric car. First, your car sits in the garage for a while, plugged into the wall like a Dust Buster, but when you actually use it, you get consistent, clean power.
DFA and NFA in Comparison
Both DFA and NFA engines have their good and bad points:Differences in the pre-use compile
Before applying a regex to a search, both types of engines compile the regex to an internal form suited to their respective matching algorithms. An NFA compile is generally faster, and requires less memory. There's no real difference between a Traditional and POSIX NFA compile.Differences in match speed
For simple literal-match tests in "normal'' situations, both types match at about the same rate. A DFA's matching speed is unrelated to the particular regex, while an NFA's is directly related. For a Traditional NFA to conclude that there is no match, it must try every possible permutation of the regex. This is why I spend the entire next chapter on techniques to write an NFA regex that will match quickly. As we'll see, an NFA match can sometimes take forever (well, almost). At least a Traditional NFA can stop if and when it finds a match. A POSIX NFA, on the other hand, must always try every possible permutation to ensure that it has found the longest possible match, so it generally takes the same (possibly very long) amount of time to complete a successful match as it does completing a failed attempt. Writing efficient regexes is doubly important for a POSIX NFA.In one sense, I speak a bit too strongly in the previous paragraph, since optimizations can often reduce the work needed to return an answer. We've already seen the optimization of not trying
^-anchored regexes beyond the start of the string (=>92), and we'll see many more in the next chapter. In general, the need for optimizations is less pressing with a DFA (since its matching is so fast to begin with), but for the most part, the extra work done during the DFA's pre-use compile affords better optimizations than most NFA engines take the trouble to do.Modern DFA engines often try to reduce the time and memory used during the compile by postponing some work until a match is attempted. Often, much of the compile-time work goes unused because of the nature of the text actually checked. A fair amount of time and memory can sometimes be saved by postponing the work until it's actually needed during the match. (The technobabble term for this is lazy evaluation.) It does, however, create cases where there can be a relationship between the regex and the match speed.
Differences in what is matched
A DFA (or anything POSIX) finds the longest leftmost match. A Traditional NFA might also, or it might find something else. Any individual engine will always treat the same regex/text combination in the same way, so in that sense it's not "random,'' but other NFA engines may decide to do slightly different things. Virtually all NFA engines I've seen work exactly the way I've described here,13but it's not something absolutely guaranteed by technology or any standard.
13 I have seen two tools employ slightly different engines. Older versions of GNU awk (gawk), such as version 2.15.6, had neither greedy nor non-greedy alternation -- it seemed rather random what alternative would match. The other is MIFES, a popular Japanese editor. Some versions sometimes turn .*Xinto[^X]*X(in an effort, I suppose, to make regexes seem more "natural'' to those that don't understand them).Differences in capabilities
An NFA engine can support many things that a DFA cannot. Among them are:
- Capturing text matched by a parenthesized subexpression. Related features are backreferences and after-match information saying where in the text each parenthesized subexpression matched.
- Lookahead. Although we haven't discussed it in this chapter (because only Perl supports it:14 => 228), positive lookahead allows you to, in effect, say "This subexpression must match for me to continue on, but just check it, don't 'consume' any of the text.'' Negative lookahead is the analogous "this subexpression mustn't match.''
14 lex has trailing context, which is exactly the same thing as zero-width positive lookahead at the end of the regex, but it can't be generalized to embedded use. - [Traditional NFA only] Non-greedy quantifiers and alternation. A DFA could easily support a guaranteed shortest overall match (although for whatever reason, this option never seems to be made available to the user), but it cannot implement the local laziness that we've talked about.
Differences in implementation ease
Although they have limitations, simple versions of DFA and NFA engines are easy enough to understand and to implement. The desire for efficiency (both in time and memory) and enhanced features drives the implementation to greater and greater complexity. With code length as a metric, consider that the NFA regex support in the Version 7 (January 1979) edition of ed was less than 350 lines of C code. (For that matter, the entire source for grep was a scant 478 lines.) Henry Spencer's 1986 freely available implementation of the Version 8 regex routines was almost 1,900 lines of C, and Tom Lord's 1992 POSIX NFA packagerx(used in GNU sed, among other tools) is a stunning 9,700 lines long. For DFA implementations, the Version 7 egrep regex engine was a bit over 400 lines long, while Henry Spencer's 1992 full-featured POSIX DFA package is over 4,500 lines long. To provide the best of both worlds, GNU egrep Version 2.0 utilizes two fully functional engines (about 8,300 lines of code).Simple, however, does not necessarily mean "lack of features.'' I recently wanted to use regular expressions for some text processing with Delphi, Borland's Pascal development environment. I hadn't used Pascal since college, but it still didn't take long to write a simple NFA regex engine. It doesn't have a lot of bells and whistles, and it is not built for maximum speed, but the flavor is relatively full-featured and so even the simple package is quite usable.
DFA Speed With NFA Capabilities: Regex Nirvana? I've said several times that a DFA can't provide capturing parentheses or backreferences. This is quite true, but it certainly doesn't preclude hybrid approaches which mix technologies in an attempt to reach regex nirvana. The sidebar on page 104 told how NFAs have diverged from the theoretical straight and narrow in search of more power, and it's only natural that the same happens with DFAs. A DFAs construction makes it more difficult, but that doesn't mean impossible.
GNU grep takes a simple but effective approach. It uses a DFA when possible, reverting to an NFA when backreferences are used. GNU awk does something similar -- it uses GNU grep's fast shortest-leftmost DFA engine for simple "does it match'' checks, and reverts to a different engine for checks where the actual extent of the match must be known. Since that other engine is an NFA, GNU awk can conveniently offer capturing parentheses, and it does via its special
gensubfunction.Until recently, there seemed to be little practical work done on extending the DFA itself, but there is active research in this area. As I finish up work on this book, Henry Spencer writes that his recent and mostly DFA package supports capturing parentheses, and is "at worst quadratic in text size, while an NFA is exponential.'' This new technology needs more time to mature before it becomes widely available, but it holds promise for the future.
Practical Regex Techniques
Now that we've touched upon the basic concerns for writing regular expressions, I'd like to put this understanding to work and move on to more advanced techniques for constructing regular expressions. I'll try to pick up the pace a bit, but be aware that some of the issues are still fairly complex.Contributing Factors
Writing a good regex involves striking a balance among several concerns:
- matching what you want, but only what you want
- keeping the regex manageable and understandable
- for an NFA, being efficient (creating a regex that leads the engine quickly to a match or a non-match, as the case may be)
These concerns are often context-dependent. If I'm working on the command line and just want to grep something quickly, I'll probably not care if I match a bit more than I need, and I won't usually be too concerned to craft just the right regex for it. I'll allow myself to be sloppy in the interest of time, since I can quickly peruse the output for what I want. However, when I'm working on an important script, it's worth the time and effort to get it right: a complex regular expression is okay if that's what it takes. There is a balance among all these issues.
Even in a script, efficiency is also context-dependent. For example, with an NFA, something long like
^-(display|geometry|cemap|...|quick24|random|raw)$to check command-line arguments is inefficient because of all that alternation, but since it is only checking command-line arguments (something done perhaps a few times at the start of the program) it wouldn't matter if it took 100 times longer than needed. It's just not an important place to worry much about efficiency. Were it used to check each line of a potentially large file, the inefficiency would penalize you for the duration of the program.Be Specific
Continuing with the continuation-line example from page 116, we found that^\w+=.*(\\\n.*)*applied with a Traditional NFA wouldn't match both lines of:
SRC=array.c builtin.c eval.c field.c gawkmisc.c io.c main.c \ missing.c msg.c node.c re.c version.cThe problem is that the first
.*matches past the backslash, pulling it out from under the(\\\n.*)*that we want it to be matched by. If we don't want to match past the backslash, we should say that in the regex:15
15 Notice how I made sure to include \nin the class? You'll remember that one of the assumptions of the original regex was that dot didn't match a newline, and we don't want its replacement to match a newline either (=> 79).
^\w+=[^\n\\]*(\\\n[^\n\\]*)*This might be too specific, though, since it doesn't allow backslashes except those at the end of lines. You can take advantage of a regex-directed NFA's non-greedy alternation, starting over with the original expression and changing the
[^\n\\]*to(\\\n|.)*, which gives us:
^\w+=(\\\n|.)*The part we appended earlier, added to match the escaped newlines and their subsequent continuation lines, is now unneeded -- the main
(\\\n|.)*matches right through newlines, but only if they are escaped.Well, not exactly. A line ending with
\\(not common, but possible) happens to have a backslash before the newline, but the newline is not escaped so there is no continuation line. The problem is that the dot will match the first backslash, allowing the second to match\\\non the next cycle of the star, resulting in a false continuation line. We weren't specific enough -- if we want that the second alternative should not match an escape, we should say that using the[^\n\\]from before.The problem with
(\\\n|[^\n\\])*is that it now doesn't allow anything to be escaped except for newlines. We really want the first alternative to say "any escaped byte.'' If dot doesn't match newline, don't make the silly mistake of trying[\n.]. If octal escapes are supported,(\\[\000-\377]|[^\n\\])*works.Finally, a hint to what the next chapter covers: Since there's now no ambiguity between the two alternatives, we may as well swap them so that the one likely to be used most often comes first. This will allow an NFA to match a bit more quickly.
Matching an IP address
As another example that we'll take much further, let's match an IP (Internet Protocol) address: four numbers separated by periods, such as1.2.3.4. Often, the numbers are padded to three digits, as in001.002.003.004. If you want to check a string for one of these, you could use[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*, but that's so vague that it even matches `and then.....?'. Look at the regex: it doesn't even require any numbers to exist -- its only requirements are three periods (with nothing but digits, if anything, between).To fix this regex, we first change the star to a plus, since we know that each number must have at least one digit. To ensure that the entire string is only the IP address, we wrap the regex with
^...$. This gives us:
^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$Using Perl's
\das a shorthand for[0-9], this becomes the more-readable16^\d+\.\d+\.\d+\.\d+$, but it still allows things that aren't IP addresses, like1234.5678.9101112.131415. (Each number must be in the range of 0-255.) To enforce that each number is three digits long, you could use
16 Or maybe not -- it depends on what you are used to. If you are new to regular expressions, at first they all seem odd. Perl has perhaps the richest regular expression flavor to be found, and lacking enough symbols to serve as metacharacters, many items such as \dcombine a backslash and a letter for their representation. To some, these added "features'' are merely superficial gimmicks that add more backslashes. Personally, I don't like a lot of backslashes either, but I enjoy the features (superficial or not) and so use them.
^\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d$but now we are too specific. We still need to allow one- and two-digit numbers (as in
1.2.3.4). If the tool's regex flavor provides the{min,max}notation, you can use^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$, and if not, you can always use\d\d?\d?or\d(\d\d?)?for each part. Either regex allows one to three digits, but in a slightly different way.Depending on your needs, you might be happy with some of the various degrees of vagueness in the expressions so far (just as my editor was comfortable with the simple expression he used on page 5). If you really want to be strict, you have to worry that
\d\d\dcan match999, which is above 255, and thus an invalid component of an IP address.Several approaches would ensure that only numbers from 0 to 255 appear. One silly approach is
0|1|2|3|...253|254|255. Actually, this doesn't allow the zero-padding that is allowed, so you really need0|00|000|1|01|001|..., which is even more ridiculous. For a DFA engine, it is ridiculous only in that it's so long and verbose -- it still matches just as fast as any regex describing the same text. For an NFA, however, all the alternation kills efficiency.A realistic approach would concentrate on which digits are allowed in a number, and where. If a number is only one or two digits long, there is no worry as to whether the value is within range, so
\d|\d\dtakes care of them. There's also no worry about the value for a three-digit number beginning with a0or1, since such a number is in the range 000-199. This lets us add[01]\d\d, leaving us with\d|\d\d|[01]\d\d. You might recognize this as being similar to the date example earlier in this chapter (=> 115), and the time example in Chapter 1 (=> 24).Continuing, a three-digit number beginning with a
2is allowed if the number is 255 or less, so a second digit less than5means the number is valid. If the second digit is5, the third must be less than6. This can all be expressed as2[0-4]\d|25[0-5].This may seem confusing at first, but the approach makes sense when you think about it. The result is
\d|\d\d|[01]\d\d|2[0-4]\d|25[0-5]. Actually, we can combine the first three alternatives to yield[01]?\d\d?|2[0-4]\d|25[0-5]. Doing so is more efficient for an NFA, since any alternative that fails results in a backtrack. Note that using\d\d?in the first alternative, rather than\d?\d, allows an NFA to fail just a bit more quickly when there is no digit at all. I'll leave the analysis to you -- walking through a simple test case with both should illustrate the difference. We could do other things to make this part of the expression more efficient, but I'll leave that facet of the discussion for the next chapter.Now that we have a subexpression to match a single number from 0 through 255, we can wrap it in parentheses and insert it in place of each
\d{1,3}in the earlier regex. This gives us (broken across lines to fit the width of the page):
^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$Quite a mouthful! Was the trouble worth it? You have to decide yourself based upon your own needs. This still allows
0.0.0.0, which is invalid because all the digits are zero, but creating a regex to disallow this would be much tougher. You can't just disallow0from each number, since something like123.202.0.188is valid. At some point, depending on your needs, you have to decide when it is not worth the trouble to be more specific -- the cost/benefit ratio starts to suffer from diminishing returns. Sometimes it's better to take some of the work out of the regex. For example, going back to^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$and wrapping each component in parentheses will stuff the numbers into$1,$2,$3, and$4, which can then be validated by other programming constructs.One thing to consider, though, is the negative lookahead that Perl provides. It can disallow specific cases before the engine even tries the "main'' expression. In this case, prepending
(?!0+\.0+\.0+\.0+$)causes immediate failure when every component is zero. This is explained further in Chapter 7 (=> 230).Know your context
It's important to realize that the two anchors are required to make this regex work. Without them, it would be more than happy to matchip=72123.3.21.993, or for a Traditional NFA, evenip=123.3.21.223.In that second case, it does not even fully match the final
223that should have been allowed. Well, it is allowed, but there's nothing (such as a separating period, or the trailing anchor) to force that match. The final group's first alternative,[01]?\d\d?, matched the first two digits, and then that was the end of the regex. Like with the date-matching problem on page 114, we can arrange the order of the alternatives to achieve the desired effect. In this case, that would be such that the alternatives matching three digits come first, so any proper three-digit number is matched in full before the two-digit-okay alternative is g