O'Reilly logo

Regular Expressions Cookbook by Steven Levithan, Jan Goyvaerts

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Tools for Working with Regular Expressions

Unless you have been programming with regular expressions for some time, we recommend that you first experiment with regular expressions in a tool rather than in source code. The sample regexes in this chapter and Chapter 2 are plain regular expressions that don’t contain the extra escaping that a programming language (even a Unix shell) requires. You can type these regular expressions directly into an application’s search box.

Chapter 3 explains how to mix regular expressions into your source code. Quoting a literal regular expression as a string makes it even harder to read, because string escaping rules compound regex escaping rules. We leave that until Recipe 3.1. Once you understand the basics of regular expressions, you’ll be able to see the forest through the backslashes.

The tools described in this section also provide debugging, syntax checking, and other feedback that you won’t get from most programming environments. Therefore, as you develop regular expressions in your applications, you may find it useful to build a complicated regular expression in one of these tools before you plug it in to your program.

RegexBuddy

RegexBuddy (Figure 1-1) is the most full-featured tool available at the time of this writing for creating, testing, and implementing regular expressions. It has the unique ability to emulate all the regular expression flavors discussed in this book, and even convert among the different flavors.

RegexBuddy was designed and developed by Jan Goyvaerts, one of this book’s authors. Designing and developing RegexBuddy made Jan an expert on regular expressions, and using RegexBuddy helped get coauthor Steven hooked on regular expressions to the point where he pitched this book to O’Reilly.

RegexBuddy

Figure 1-1. RegexBuddy

If the screenshot (Figure 1-1) looks a little busy, that’s because we’ve arranged most of the panels side by side to show off RegexBuddy’s extensive functionality. The default view tucks all the panels neatly into a row of tabs. You also can drag panels off to a secondary monitor.

To try one of the regular expressions shown in this book, simply type it into the edit box at the top of RegexBuddy’s window. RegexBuddy automatically applies syntax highlighting to your regular expression, making errors and mismatched brackets obvious.

The Create panel automatically builds a detailed English-language analysis while you type in the regex. Double-click on any description in the regular expression tree to edit that part of your regular expression. You can insert new parts to your regular expression by hand, or by clicking the Insert Token button and selecting what you want from a menu. For instance, if you don’t remember the complicated syntax for positive lookahead, you can ask RegexBuddy to insert the proper characters for you.

Type or paste in some sample text on the Test panel. When the Highlight button is active, RegexBuddy automatically highlights the text matched by the regex.

Some of the buttons you’re most likely to use are:

List All

Displays a list of all matches.

Replace

The Replace button at the top displays a new window that lets you enter replacement text. The Replace button in the Test box then lets you view the subject text after the replacements are made.

Split (The button on the Test panel, not the one at the top)

Treats the regular expression as a separator, and splits the subject into tokens based on where matches are found in your subject text using your regular expression.

Click any of these buttons and select Update Automatically to make RegexBuddy keep the results dynamically in sync as you edit your regex or subject text.

To see exactly how your regex works (or doesn’t), click on a highlighted match or at the spot where the regex fails to match on the Test panel, and click the Debug button. RegexBuddy will switch to the Debug panel, showing the entire matching processes step by step. Click anywhere on the debugger’s output to see which regex token matched the text you clicked on. Click on your regular expression to highlight that part of the regex in the debugger.

On the Use panel, select your favorite programming language. Then, select a function to instantly generate source code to implement your regex. RegexBuddy’s source code templates are fully editable with the built-in template editor. You can add new functions and even new languages, or change the provided ones.

To test your regex on a larger set of data, switch to the GREP panel to search (and replace) through any number of files and folders.

When you find a regex in source code you’re maintaining, copy it to the clipboard, including the delimiting quotes or slashes. In RegexBuddy, click the Paste button at the top and select the string style of your programming language. Your regex will then appear in RegexBuddy as a plain regex, without the extra quotes and escapes needed for string literals. Use the Copy button at the top to create a string in the desired syntax, so you can paste it back into your source code.

As your experience grows, you can build up a handy library of regular expressions on the Library panel. Make sure to add a detailed description and a test subject when you store a regex. Regular expressions can be cryptic, even for experts.

If you really can’t figure out a regex, click on the Forum panel and then the Login button. If you’ve purchased RegexBuddy, the login screen appears. Click OK and you are instantly connected to the RegexBuddy user forum. Steven and Jan often hang out there.

RegexBuddy runs on Windows 98, ME, 2000, XP, and Vista. For Linux and Apple fans, RegexBuddy also runs well on VMware, Parallels, CrossOver Office, and with a few issues on WINE. You can download a free evaluation copy of RegexBuddy at http://www.regexbuddy.com/RegexBuddyCookbook.exe. Except for the user forum, the trial is fully functional for seven days of actual use.

RegexPal

RegexPal (Figure 1-2) is an online regular expression tester created by Steven Levithan, one of this book’s authors. All you need to use it is a modern web browser. RegexPal is written entirely in JavaScript. Therefore, it supports only the JavaScript regex flavor, as implemented in the web browser you’re using to access it.

RegexPal

Figure 1-2. RegexPal

To try one of the regular expressions shown in this book, browse to http://www.regexpal.com. Type the regex into the box that says “Enter regex here.” RegexPal automatically applies syntax highlighting to your regular expression, which immediately reveals any syntax errors in the regex. RegexPal is aware of the cross-browser issues that can ruin your day when dealing with JavaScript regular expressions. If certain syntax doesn’t work correctly in some browsers, RegexPal will highlight it as an error.

Now type or paste some sample text into the box that says “Enter test data here.” RegexPal automatically highlights the text matched by your regex.

There are no buttons to click, making RegexPal one of the most convenient online regular expression testers.

More Online Regex Testers

Creating a simple online regular expression tester is easy. If you have some basic web development skills, the information in Chapter 3 is all you need to roll your own. Hundreds of people have already done this; a few have added some extra features that make them worth mentioning.

regex.larsolavtorvik.com

Lars Olav Torvik has put a great little regular expression tester online at http://regex.larsolavtorvik.com (see Figure 1-3).

regex.larsolavtorvik.com

Figure 1-3. regex.larsolavtorvik.com

To start, select the regular expression flavor you’re working with by clicking on the flavor’s name at the top of the page. Lars offers PHP PCRE, PHP POSIX, and JavaScript. PHP PCRE, the PCRE regex flavor discussed in this book, is used by PHP’s preg functions. POSIX is an old and limited regex flavor used by PHP’s ereg functions, which are not discussed in this book. If you select JavaScript, you’ll be working with your browser’s JavaScript implementation.

Type your regular expression into the Pattern field and your subject text into the Subject field. A moment later, the Matches field displays your subject text with highlighted regex matches. The Code field displays a single line of source code that applies your regex to your subject text. Copying and pasting this into your code editor saves you the tedious job of manually converting your regex into a string literal. Any string or array returned by the code is displayed in the Result field. Because Lars used Ajax technology to build his site, results are updated in just a few moments for all flavors. To use the tool, you have to be online, as PHP is processed on the server rather than in your browser.

The second column displays a list of regex commands and regex options. These depend on the regex flavor. The regex commands typically include match, replace, and split operations. The regex options consist of common options such as case insensitivity, as well as implementation-specific options. These commands and options are described in Chapter 3.

Nregex

http://www.nregex.com (Figure 1-4) is a straightforward online regex tester built on .NET technology by David Seruyange. Although the site doesn’t say which flavor it implements, it’s .NET 1.x at the time of this writing.

Nregex

Figure 1-4. Nregex

The layout of the page is somewhat confusing. Enter your regular expression into the field under the Regular Expression label, and set the regex options using the checkboxes below that. Enter your subject text in the large box at the bottom, replacing the default If I just had $5.00 then "she" wouldn't be so @#$! mad.. If your subject is a web page, type the URL in the Load Target From URL field, and click the Load button under that input field. If your subject is a file on your hard disk, click the Browse button, find the file you want, and then click the Load button under that input field.

Your subject text will appear duplicated in the “Matches & Replacements” field at the center of the web page, with the regex matches highlighted. If you type something into the Replacement String field, the result of the search-and-replace is shown instead. If your regular expression is invalid, ... appears.

The regex matching is done in .NET code running on the server, so you need to be online for the site to work. If the automatic updates are slow, perhaps because your subject text is very long, tick the Manually Evaluate Regex checkbox above the field for your regular expression to show the Evaluate button. Click that button to update the “Matches & Replacements” display.

Rubular

Michael Lovitt put a minimalistic regex tester online at http://www.rubular.com (Figure 1-5), using the Ruby 1.8 regex flavor.

Rubular

Figure 1-5. Rubular

Enter your regular expression in the box between the two forward slashes under “Your regular expression.” You can turn on case insensitivity by typing an i in the small box after the second slash. Similarly, if you like, turn on the option “a dot matches a line break” by typing an m in the same box. im turns on both options. Though these conventions may seem a bit user-unfriendly if you’re new to Ruby, they conform to the /regex/im syntax used to specify a regex in Ruby source code.

Type or paste your subject text into the “Your test string” box, and wait a moment. A new “Match result” box appears to the right, showing your subject text with all regex matches highlighted.

myregexp.com

Sergey Evdokimov created several regular expression testers for Java developers. The home page at http://www.myregexp.com (Figure 1-6) offers an online regex tester. It’s a Java applet that runs in your browser. The Java 4 (or later) runtime needs to be installed on your computer. The applet uses the java.util.regex package to evaluate your regular expressions, which is new in Java 4. In this book, the “Java” regex flavor refers to this package.

myregexp.com

Figure 1-6. myregexp.com

Type your regular expression into the Regular Expression box. Use the Flags menu to set the regex options you want. Three of the options also have direct checkboxes.

If you want to test a regex that already exists as a string in Java code, copy the whole string to the clipboard. In the myregexp.com tester, click on the Edit menu, and then “Paste Regex from Java String”. In the same menu, pick “Copy Regex for Java Source” when you’re done editing the regular expression. The Edit menu has similar commands for JavaScript and XML as well.

Below the regular expression, there are four tabs that run four different tests:

Find

Highlights all regular expression matches in the sample text. These are the matches found by the Matcher.find() method in Java.

Match

Tests whether the regular expression matches the sample text entirely. If it does, the whole text is highlighted. This is what the String.matches() and Matcher.matches() methods do.

Split

The second box at the right shows the array of strings returned by String.split() or Pattern.split() when used with your regular expression and sample text.

Replace

Type in a replacement text, and the box at the right shows the text returned by String.replaceAll() or Matcher.replaceAll().

You can find Sergey’s other regex testers via the links at the top of the page at http://www.myregexp.com. One is a plug-in for Eclipse, and the other is a plug-in for IntelliJ IDEA.

reAnimator

Oliver Steele’s reAnimator at http://osteele.com/tools/reanimator (Figure 1-7) won’t bring a dead regex back to life. Rather, it’s a fun little tool that shows a graphic representation of the finite state machines that a regular expression engine uses to perform a regular expression search.

reAnimator

Figure 1-7. reAnimator

reAnimator’s regex syntax is very limited. It is compatible with all the flavors discussed in this book. Any regex you can animate with reAnimator will work with any of this book’s flavors, but the reverse is definitely not true. This is because reAnimator’s regular expressions are regular in the mathematical sense. The sidebar History of the Term ‘Regular Expression’ explains this briefly.

Start by going up to the Pattern box at the top of the page and pressing the Edit button. Type your regular expression into the Pattern field and click Set. Slowly type the subject text into the Input field.

As you type in each character, colored balls will move through the state machine to indicate the end point reached in the state machine by your input so far. Blue balls indicate that the state machine accepts the input, but needs more input for a full match. Green balls indicate that the input matches the whole pattern. No balls means the state machine can’t match the input.

reAnimator will show a match only if the regular expression matches the whole input string, as if you had put it between ^ and $ anchors. This is another property of expressions that are regular in the mathematical sense.

More Desktop Regular Expression Testers

Expresso

Expresso (not to be confused with caffeine-laden espresso) is a .NET application for creating and testing regular expressions. You can download it at http://www.ultrapico.com/Expresso.htm. The .NET framework 2.0 or later must be installed on your computer.

The download is a free 60-day trial. After the trial, you have to register or Expresso will (mostly) stop working. Registration is free, but requires you to give the Ultrapico folks your email address. The registration key is sent by email.

Expresso displays a screen like the one shown in Figure 1-8. The Regular Expression box where you type in your regular expression is permanently visible. No syntax highlighting is available. The Regex Analyzer box automatically builds a brief English-language analysis of your regular expression. It too is permanently visible.

Expresso

Figure 1-8. Expresso

In Design Mode, you can set matching options such as “Ignore Case” at the bottom of the screen. Most of the screen space is taken up by a row of tabs where you can select the regular expression token you want to insert. If you have two monitors or one large monitor, click the Undock button to float the row of tabs. Then you can build up your regular expression in the other mode (Test Mode) as well.

In Test Mode, type or paste your sample text in the lower-left corner. Then, click the Run Match button to get a list of all matches in the Search Results box. No highlighting is applied to the sample text. Click on a match in the results to select that match in the sample text.

The Expression Library shows a list of sample regular expressions and a list of recent regular expressions. Your regex is added to that list each time you press Run Match. You can edit the library through the Library menu in the main menu bar.

The Regulator

The Regulator, which you can download from http://sourceforge.net/projects/regulator, is not safe for SCUBA diving or cooking-gas canisters; it is another .NET application for creating and testing regular expressions. The latest version requires .NET 2.0 or later. Older versions for .NET 1.x can still be downloaded. The Regulator is open source, and no payment or registration required.

The Regulator does everything in one screen (Figure 1-9). The New Document tab is where you enter your regular expression. Syntax highlighting is automatically applied, but syntax errors in your regex are not made obvious. Right-click to select the regex token you want to insert from a menu. You can set regular expression options via the buttons on the main toolbar. The icons are a bit cryptic. Wait for the tooltip to see which option you’re setting with each button.

The Regulator

Figure 1-9. The Regulator

Below the area for your regex and to the right, click on the Input button to display the area for pasting in your sample text. Click the “Replace with” button to type in the replacement text, if you want to do a search-and-replace. Below the regex and to the left, you can see the results of your regex operation. Results are not updated automatically; you must click the Match, Replace, or Split button in the toolbar to update the results. No highlighting is applied to the input. Click on a match in the results to select it in the subject text.

The Regex Analyzer panel shows a simple English-language analysis of your regular expression, but it is not automatic or interactive. To update the analysis, select Regex Analyzer in the View menu, even if it is already visible. Clicking on the analysis only moves the text cursor.

grep

The name grep is derived from the g/re/p command that performed a regular expression search in the Unix text editor ed, one of the first applications to support regular expressions. This command was so popular that all Unix systems now have a dedicated grep utility for searching through files using a regular expression. If you’re using Unix, Linux, or OS X, type man grep into a terminal window to learn all about it.

The following three tools are Windows applications that do what grep does, and more.

PowerGREP

PowerGREP, developed by Jan Goyvaerts, one of this book’s authors, is probably the most feature-rich grep tool available for the Microsoft Windows platform (Figure 1-10). PowerGREP uses a custom regex flavor that combines the best of the flavors discussed in this book. This flavor is labeled “JGsoft” in RegexBuddy.

PowerGREP

Figure 1-10. PowerGREP

To run a quick regular expression search, simply select Clear in the Action menu and type your regular expression into the Search box on the Action panel. Click on a folder in the File Selector panel, and select “Include File or Folder” or “Include Folder and Subfolders” in the File Selector menu. Then, select Execute in the Action menu to run your search.

To run a search-and-replace, select “search-and-replace” in the “action type” drop-down list at the top-left corner of the Action panel after clearing the action. A Replace box will appear below the Search box. Enter your replacement text there. All the other steps are the same as for searching.

PowerGREP has the unique ability to use up to three lists of regular expressions at the same time, with any number of regular expressions in each list. While the previous two paragraphs provide all you need to run simple searches like you can in any grep tool, unleashing PowerGREP’s full potential will take a bit of reading through the tool’s comprehensive documentation.

PowerGREP runs on Windows 98, ME, 2000, XP, and Vista. You can download a free evaluation copy at http://www.powergrep.com/PowerGREPCookbook.exe. Except for saving results and libraries, the trial is fully functional for 15 days of actual use. Though the trial won’t save the results shown on the Results panel, it will modify all your files for search-and-replace actions, just like the full version does.

Windows Grep

Figure 1-11. Windows Grep

Windows Grep

Windows Grep (http://www.wingrep.com) is one of the oldest grep tools for Windows. Its age shows a bit in its user interface (Figure 1-11), but it does what it says on the tin just fine. It supports a limited regular expression flavor called POSIX ERE. For the features that it supports, it uses the same syntax as the flavors in this book. Windows Grep is shareware, which means you can download it for free, but payment is expected if you want to keep it.

To prepare a search, select Search in the Search menu. The screen that appears differs depending on whether you’ve selected Beginner Mode or Expert Mode in the Options menu. Beginners get a step-by-step wizard, whereas experts get a tabbed dialog.

When you’ve set up the search, Windows Grep immediately executes it, presenting you with a list of files in which matches were found. Click once on a file to see its matches in the bottom panel, and double-click to open the file. Select “All Matches” in the View menu to make the bottom panel show everything.

To run a search-and-replace, select Replace in the Search menu.

RegexRenamer

RegexRenamer (Figure 1-12) is not really a grep tool. Instead of searching through the contents of files, it searches and replaces through the names of files. You can download it at http://regexrenamer.sourceforge.net. RegexRenamer requires version 2.0 or later of the Microsoft .NET framework.

RegexRenamer

Figure 1-12. RegexRenamer

Type your regular expression into the Match box and the replacement text into the Replace box. Click /i to turn on case insensitivity, and /g to replace all matches in each filename rather than just the first. /x turns on free-spacing syntax, which isn’t very useful, since you have only one line to type in your regular expression.

Use the tree at the left to select the folder that holds the files you want to rename. You can set a file mask or a regex filter in the top-right corner. This restricts the list of files to which your search-and-replace regex will be applied. Using one regex to filter and another to replace is much handier than trying to do both tasks with just one regex.

Popular Text Editors

Most modern text editors have at least basic support for regular expressions. In the search or search-and-replace panel, you’ll typically find a checkbox to turn on regular expression mode. Some editors, such as EditPad Pro, also use regular expressions for various features that process text, such as syntax highlighting or class and function lists. The documentation with each editor explains all these features. Some popular text editors with regular expression support include:

  • Boxer Text Editor (PCRE)

  • Dreamweaver (JavaScript)

  • EditPad Pro (custom flavor that combines the best of the flavors discussed in this book; labeled “JGsoft” in RegexBuddy)

  • Multi-Edit (PCRE, if you select the “Perl” option)

  • NoteTab (PCRE)

  • UltraEdit (PCRE)

  • TextMate (Ruby 1.9 [Oniguruma])

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required