O'Reilly logo

Introducing Regular Expressions by Michael Fitzgerald

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 4. Alternation, Groups, and Backreferences

You have already seen groups in action. Groups surround text with parentheses to help perform some operation, such as the following:

  • Performing alternation, a choice between two or more optional patterns

  • Creating subpatterns

  • Capturing a group to later reference with a backreference

  • Applying an operation to a grouped pattern, such as a quantifer

  • Using non-capturing groups

  • Atomic grouping (advanced)

We’ll be using a few contrived examples, in addition to the text from “The Rime of the Ancyent Mariner” again, in rime.txt. This time, I’ll use the desktop version of RegExr, as well as other tools like sed. You can download the desktop version of RegExr from http://www.regexr.com, for Windows, Mac, or Linux (it was written with Adobe AIR). Click the Desktop Version link on the RegExr web page (lower-right corner) for more information.

Alternation

Simply said, alternation gives you a choice of alternate patterns to match. For example, let’s say you wanted to find out how many occurrences of the article the are in the “The Rime of the Ancient Mariner.” The problem is, the word occurs as THE, The, and the in the poem. You can use alternation to deal with this peculiarity.

Open the RegExr desktop application by double-clicking on its icon. It looks very much like the online version but has the advantage of being local on your machine, so you won’t suffer the network issues that sometimes occur when using web applications. I’ve copied and pasted the entire poem in RegExr desktop for the next exercise. I’m using it on a Mac running OS X Lion.

In the top text box, enter the pattern:

(the|The|THE)

and you’ll see all occurrences of the in the poem highlighted in the lower box (see Figure 4-1). Use the scroll bar to view more of the result.

Using alternation in RegExr desktop version

Figure 4-1. Using alternation in RegExr desktop version

We can make this group shorter by applying an option. Options let you specify the way you would like to search for a pattern. For example, the option:

(?i)

makes your pattern case-insensitive, so instead of using the original pattern with alternation, you can do this instead:

(?i)the

Try this in RegExr to see how it works. You can also specify case-insensitivity by checking ignoreCase in RegExr, but both will work. This and other options or modifiers are listed in Table 4-1.

Table 4-1. Options in regular expressions

OptionDescriptionSupported by

(?d)

Unix lines

Java

(?i)

Case insensitive

PCRE, Perl, Java

(?J)

Allow duplicate names

PCRE[a]

(?m)

Multiline

PCRE, Perl, Java

(?s)

Single line (dotall)

PCRE, Perl, Java

(?u)

Unicode case

Java

(?U)

Default match lazy

PCRE

(?x)

Ignore whitespace, comments

PCRE, Perl, Java

(?-…)

Unset or turn off options

PCRE

[a] See “Named Subpatterns” in http://www.pcre.org/pcre.txt.

Let’s now use alternation with grep. The options in Table 4-1, by the way, don’t work with grep, so you are going to use the original alternation pattern. To count the number of lines where the word the occurs, regardless of case, one or more times, use:

grep -Ec "(the|The|THE)" rime.txt

and get this answer:

327

This result does not tell the whole story. Stay tuned.

Here is an analysis of the grep command:

  • The -E option means that you want to use extended regular expressions (EREs) rather than basic regular expressions (BREs). This, for example, saves you from having to escape the parentheses and the vertical bar, like \(THE\|The\|the\), as you must with BREs.

  • The -c option returns a count of the matched lines (not matched words).

  • The parentheses group the choice or alternation of the, The, or THE.

  • The vertical bar separates possible choices, which are evaluated left to right.

To get a count of actual words used, this approach will return each occurrence of the word, one per line:

grep -Eo "(the|The|THE)" rime.txt | wc -l

This returns:

412

And here is a bit more analysis:

  • The -o option means to show only that part of the line that matches the pattern, though this is not apparent due to the pipe (|) to wc.

  • The vertical bar, in this context, pipes the output of the grep command to the input of the wc command. wc is a word count command, and -l counts the number of lines of the input.

Why the big difference between 327 and 412? Because -c gives you a count of matching lines, but there can be more than one match on each line. If you use -o with wc -l, then each occurrence of the various forms of the word will appear on a separate line and be counted, giving the higher number.

To perform this same match with Perl, write your command this way:

perl -ne 'print if /(the|The|THE)/' rime.txt

Or better yet, you can do it with the (?i) option mentioned earlier, but without alternation:

perl -ne 'print if /(?i)the/' rime.txt

Or even better yet, append the i modifier after the last pattern delimiter:

perl -ne 'print if /the/i' rime.txt

and you will get the same outcome. The simpler the better. For a list of additional modifiers (also called flags), see Table 4-2. Also, compare options (similar but with a different syntax) in Table 4-1.

Table 4-2. Perl modifiers (flags)[1]

ModifierDescription

a

Match \d, \s, \w, and POSIX in ASCII range only

c

Keep current position after match fails

d

Use default, native rules of the platform

g

Global matching

i

Case-insensitive matching

l

Use current locale’s rules

m

Multiline strings

p

Preserve the matched string

s

Treat strings as a single line

u

Use Unicode rules when matching

x

Ignore whitespace and comments

Subpatterns

Most often, when you refer to subpatterns in regular expressions, you are referring to a group or groups within groups. A subpattern is a pattern within a pattern. Often, a condition in a subpattern is matchable when a preceding pattern is matched, but not always. Subpatterns can be designed in a variety of ways, but we’re concerned primarily with those defined within parentheses here.

In one sense, the pattern you saw earlier:

(the|The|THE)

has three subpatterns: the is the first subpattern, The is the second, and THE the third, but matching the second subpattern, in this instance, is not dependent on matching the first. (The leftmost pattern is matched first.)

Now here is one where the subpattern(s) depend on the previous pattern:

(t|T)h(e|eir)

In plain language, this will match the literal characters t or T followed by an h followed by either an e or the letters eir. Accordingly, this pattern will match any of:

  • the

  • The

  • their

  • Their

In this case, the second subpattern (e|eir) is dependent on the first (tT).

Subpatterns don’t require parentheses. Here is an example of subpatterns done with character classes:

\b[tT]h[ceinry]*\b

This pattern can match, in addition to the or The, words such as thee, thy and thence. The two word boundaries (\b) mean the pattern will match whole words, not letters embedded in other words.

Here is a complete analysis of this pattern:

  • \b matches a beginning word boundary.

  • [tT] is a character class that matches either an lowercase t or an uppercase T. We can consider this the first subpattern.

  • Then the pattern matches (or attempts to match) a lowercase h.

  • The second or last subpattern is also expressed as a character class [ceinry] followed by a quantifier * for zero or more.

  • Finally, another word boundary \b ends the pattern.

Note

One interesting aspect of the state of regular expressions is that terminology, while usually close in meaning, can also range far. In defining subpattern and other terms in this book, I’ve examined a variety of sources and have tried to bring them together under one roof. But I suspect that there are some who would argue that a character class is not a subpattern. My take is they can function as subpatterns, so I lump them in.

Capturing Groups and Backreferences

When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. You can reuse that content if you wish by using a backreference, in the form:

\1

or:

$1

where \1 or $1 reference the first captured group, \2 or $2 reference the second captured group, and so on. sed will only accept the \1 form, but Perl accepts both.

Note

Originally, sed supported backreferences in the range \1 through \9, but that limitation does not appear to exist any longer.

You have already seen this in action, but I’ll demonstrate it here again. We’ll use it to rearrange the wording of a line of the poem, with apologies to Samuel Taylor Coleridge. In the top text box in RegExr, after clicking the Replace tab, enter this pattern:

(It is) (an ancyent Marinere)

Scroll the subject text (third text area) down until you can see the highlighted line, and then in the second box, enter:

$2 $1

and you’ll see in the lowest box the line rearranged as:

an ancyent Marinere It is,

(See Figure 4-2.)

Referencing backreferences with $1 and $2

Figure 4-2. Referencing backreferences with $1 and $2

Here is how to accomplish the same result with sed:

sed -En 's/(It is) (an ancyent Marinere)/\2 \1/p' rime.txt

and the output will be:

an ancyent Marinere It is,

just as in RegExr. Let’s analyze the sed command to help you understand everything that is going on:

  • The -E option once again invokes EREs, so you don’t have to quote the parentheses, for example.

  • The -n option suppresses the default behavior of printing every line.

  • The substitute command searches for a match for the text “It is an ancyent Marinere,” capturing it into two groups.

  • The substitute command also replaces the match by rearranging the captured text in the output, with the backreference \2 first, then \1.

  • The p at the end of the substitute command means you want to print the line.

A similar command in Perl will do the same thing:

perl -ne 'print if s/(It is) (an ancyent Marinere)/\2 \1/' rime.txt

Notice that this uses the \1 style syntax. You can, of course, use the $1 syntax, too:

perl -ne 'print if s/(It is) (an ancyent Marinere)/$2 $1/' rime.txt

I like how Perl lets you print a selected line without jumping through hoops.

I’d like to point out something about the output:

an ancyent Marinere It is,

The capitalization got mixed up in the transformation. Perl can fix that with \u and \l. Here’s how:

perl -ne 'print if s/(It is) (an ancyent Marinere)/\u$2 \l$1/' rime.txt

Now the result looks much better:

An ancyent Marinere it is,

And here is why:

  • The \l syntax does not match anything, but it changes the character that follows to lowercase.

  • The \u syntax capitalizes the character that follows it.

  • The \U directive (not shown) turns the text string that follows into all uppercase.

  • The \L directive (not shown) turns the text string that follows into all lowercase.

These directives remain in effect until another is found (like \l or \E, the end of a quoted string). Experiment with these to see how they work.

Named Groups

Named groups are captured groups with names. You can access those groups by name later, rather than by integer. I’ll show you how here in Perl:

perl -ne 'print if s/(?<one>It is) (?<two>an ancyent Marinere)/\u$+{two} 
       \l$+{one}/' rime.txt

Let’s look at it:

  • Adding ?<one> and ?<two> inside the parentheses names the groups one and two, respectively.

  • $+{one} references the group named one, and $+{two}, the group named two.

You can also reuse named groups within the pattern where the group was named. I’ll show you what I mean. Let’s say you were searching for a string that contained six zeros all together:

000000

It’s a shallow example, but serves to show you how this works. So name a group of three zeros with this pattern (the z is arbitrary):

(?<z>0{3})

You can then use the group again like this:

(?<z>0{3})\k<z>

Or this:

(?<z>0{3})\k'z'

Or this:

(?<z>0{3})\g{z}

Try this in RegExr for quick results. All these examples will work. Table 4-3 shows many of the possibilities with named group syntax.

Table 4-3. Named group syntax

SyntaxDescription

(?<name>…)

A named group

(?name…)

Another named group

(?P<name>…)

A named group in Python

\k<name>

Reference by name in Perl

\k'name'

Reference by name in Perl

\g{name}

Reference by name in Perl

\k{name}

Reference by name in .NET

(?P=name)

Reference by name in Python

Non-Capturing Groups

There are also groups that are non-capturing groups—that is, they don’t store their content in memory. Sometimes this is an advantage, especially if you never intend to reference the group. Because it doesn’t store its content, it is possible it may yield better performance, though performance issues are hardly perceptible when running the simple examples in this book.

Remember the first group discussed in this chapter? Here it is again:

(the|The|THE)

You don’t need to backreference anything, so you could write a non-capturing group this way:

(?:the|The|THE)

Going back to the beginning of this chapter, you could add an option to make the pattern case-insensitive, like this (though the option obviates the need for a group):

(?i)(?:the)

Or you could do it this way:

(?:(?i)the)

Or, better yet, the pièce de résistance:

(?i:the)

The option letter i can be inserted between the question mark and the colon.

Atomic Groups

Another kind of non-capturing group is the atomic group. If you are using a regex engine that does backtracking, this group will turn backtracking off, not for the entire regular expression but just for that part enclosed in the atomic group. The syntax looks like this:

(?>the)

When would you want to use atomic groups? One of the things that can really slow regex processing is backtracking. The reason why is, as it tries all the possibilities, it takes time and computing resources. Sometimes it can gobble up a lot of time. When it gets really bad, it’s called catastrophic backtracking.

You can turn off backtracking altogether by using a non-backtracking engine like re2 (http://code.google.com/p/re2/) or by turning it off for parts of your regular expression with atomic grouping.

Note

My focus in this book is to introduce syntax. I talk very little about performance tuning here. Atomic groups are mainly a performance consideration in my view.

In Chapter 5, you’ll learn about character classes.

What You Learned in Chapter 4

  • That alternation allows a choice between two or more patterns

  • What options modifiers are and how to use them in a pattern

  • Different kinds of subpatterns

  • How to use capturing groups and backreferences

  • How to use named groups and how to reference them

  • How to use non-capturing groups.

  • A little about atomic grouping.

Technical Notes

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required