In the context of this book, a regular expression is a specific kind of text pattern that you can use with many modern applications and programming languages. You can use them to verify whether input fits into the text pattern, to find text that matches the pattern within a larger body of text, to replace text matching the pattern with other text or rearranged bits of the matched text, to split a block of text into a list of subtexts, and to shoot yourself in the foot. This book helps you understand exactly what you’re doing and avoid disaster.
If you use regular expressions with skill, they simplify many programming and text processing tasks, and allow many that wouldn’t be at all feasible without the regular expressions. You would need dozens if not hundreds of lines of procedural code to extract all email addresses from a document—code that is tedious to write and hard to maintain. But with the proper regular expression, as shown in Recipe 4.1, it takes just a few lines of code, or maybe even one line.
But if you try to do too much with just one regular expression, or use regexes where they’re not really appropriate, you’ll find out why some people say:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
The second problem those people have is that they didn’t read the owner’s manual, which you are holding now. Read on. Regular expressions are a powerful tool. If your job involves manipulating or extracting text on a computer, a firm grasp of regular expressions will save you plenty of overtime.
All right, the title of the previous section was a lie. We didn’t define what regular expressions are. We can’t. There is no official standard that defines exactly which text patterns are regular expressions and which aren’t. As you can imagine, every designer of programming languages and every developer of text processing applications has a different idea of exactly what a regular expression should be. So now we’re stuck with a whole palette of regular expression flavors.
Fortunately, most designers and developers are lazy. Why create something totally new when you can copy what has already been done? As a result, all modern regular expression flavors, including those discussed in this book, can trace their history back to the Perl programming language. We call these flavors Perl-style regular expressions. Their regular expression syntax is very similar, and mostly compatible, but not completely so.
Writers are lazy, too. We’ll usually type regex or regexp to denote a single regular expression, and regexes to denote the plural.
Regex flavors do not correspond one-to-one with programming languages. Scripting languages tend to have their own, built-in regular expression flavor. Other programming languages rely on libraries for regex support. Some libraries are available for multiple languages, while certain languages can draw on a choice of different libraries.
This introductory chapter deals with regular expression flavors only and completely ignores any programming considerations. Chapter 3 begins the code listings, so you can peek ahead to Programming Languages and Regex Flavors in Chapter 3 to find out which flavors you’ll be working with. But ignore all the programming stuff for now. The tools listed in the next section are an easier way to explore the regex syntax through “learning by doing.”
For this book, we selected the most popular regex flavors in use today. These are all Perl-style regex flavors. Some flavors have more features than others. But if two flavors have the same feature, they tend to use the same syntax. We’ll point out the few annoying inconsistencies as we encounter them.
All these regex flavors are part of programming languages and libraries that are in active development. The list of flavors tells you which versions this book covers. Further along in the book, we mention the flavor without any versions if the presented regex works the same way with all flavors. This is almost always the case. Aside from bug fixes that affect corner cases, regex flavors tend not to change, except to add features by giving new meaning to syntax that was previously treated as an error:
The Microsoft .NET Framework provides a full-featured Perl-style regex flavor through the
System.Text.RegularExpressionspackage. This book covers .NET versions 1.0 through 4.0. Strictly speaking, there are only two versions of the .NET regex flavor: 1.0 and 2.0. No changes were made to the Regex classes at all in .NET 1.1, 3.0, and 3.5. The Regex class got a few new methods in .NET 4.0, but the regex syntax is unchanged.
Any .NET programming language, including C#, VB.NET, Delphi for .NET, and even COBOL.NET, has full access to the .NET regex flavor. If an application developed with .NET offers you regex support, you can be quite certain it uses the .NET flavor, even if it claims to use “Perl regular expressions.” For a long time, a glaring exception was Visual Studio (VS) itself. Up until Visual Studio 2010, the VS integrated development environment (IDE) had continued to use the same old regex flavor it has had from the beginning, which was not Perl-style at all. Visual Studio 11, which is in beta when we write this, finally uses the .NET regex flavor in the IDE too.
Java 4 is the first Java release to provide built-in regular expression support through the
java.util.regexpackage. It has quickly eclipsed the various third-party regex libraries for Java. Besides being standard and built in, it offers a full-featured Perl-style regex flavor and excellent performance, even when compared with applications written in C. This book covers the
java.util.regexpackage in Java 4, 5, 6, and 7.
PCRE is the “Perl-Compatible Regular Expressions” C library developed by Philip Hazel. You can download this open source library at http://www.pcre.org. This book covers versions 4 through 8 of PCRE.
Though PCRE claims to be Perl-compatible, and is so more than any other flavor in this book, it really is just Perl-style. Some features, such as Unicode support, are slightly different, and you can’t mix Perl code into your regex, as Perl itself allows.
Because of its open source license and solid programming, PCRE has found its way into many programming languages and applications. It is built into PHP and wrapped into numerous Delphi components. If an application claims to support “Perl-compatible” regular expressions without specifically listing the actual regex flavor being used, it’s likely PCRE.
Perl’s built-in support for regular expressions is the main reason why regexes are popular today. This book covers Perl 5.6, 5.8, 5.10, 5.12, and 5.14. Each of these versions adds new features to Perl’s regular expression syntax. When this book indicates that a certain regex works with a certain version of Perl, then it works with that version and all later versions covered by this book.
Many applications and regex libraries that claim to use Perl or Perl-compatible regular expressions in reality merely use Perl-style regular expressions. They use a regex syntax similar to Perl’s, but don’t support the same set of regex features. Quite likely, they’re using one of the regex flavors further down this list. Those flavors are all Perl-style.
Python supports regular expressions through its
remodule. This book covers Python 2.4 until 3.2. The differences between the
remodules in Python 2.4, 2.5, 2.6, and 2.7 are negligible. Python 3.0 improved Python’s handling of Unicode in regular expressions. Python 3.1 and 3.2 brought no regex-related changes.
Ruby’s regular expression support is part of the Ruby language itself, similar to Perl. This book covers Ruby 1.8 and 1.9. A default compilation of Ruby 1.8 uses the regular expression flavor provided directly by the Ruby source code. A default compilation of Ruby 1.9 uses the Oniguruma regular expression library. Ruby 1.8 can be compiled to use Oniguruma, and Ruby 1.9 can be compiled to use the older Ruby regex flavor. In this book, we denote the native Ruby flavor as Ruby 1.8, and the Oniguruma flavor as Ruby 1.9.
To test which Ruby regex flavor your site uses, try to use the regular expression ‹
a++›. Ruby 1.8 will say the regular expression is invalid, because it does not support possessive quantifiers, whereas Ruby 1.9 will match a string of one or more
The Oniguruma library is designed to be backward-compatible with Ruby 1.8, simply adding new features that will not break existing regexes. The implementors even left in features that arguably should have been changed, such as using ‹
(?m)› to mean “the dot matches line breaks,” where other regex flavors use ‹