book

Regular Expressions Cookbook, 2nd Edition

by Jan Goyvaerts, Steven Levithan

August 2012

Intermediate to advanced

609 pages

19h 16m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Regular Expressions Cookbook
Preface
Caught in the Snarls of Different Versions
Intended Audience
Technology Covered
Organization of This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us

Acknowledgments
1. Introduction to Regular Expressions
Regular Expressions Defined
Many Flavors of Regular ExpressionsRegex Flavors Covered by This Book
Search and Replace with Regular Expressions
Many Flavors of Replacement Text
Tools for Working with Regular Expressions
RegexBuddyRegexPalRegexMagicMore Online Regex TestersRegexPlanetregex.larsolavtorvik.comNregexRubularmyregexp.comMore Desktop Regular Expression TestersExpressoThe RegulatorSDL Regex FuzzergrepPowerGREPWindows GrepRegexRenamerPopular Text Editors
2. Basic Regular Expression Skills
2.1. Match Literal Text
ProblemSolutionDiscussionVariationsBlock escapeCase-insensitive matchingSee Also
2.2. Match Nonprintable Characters
ProblemSolutionDiscussionVariations on Representations of Nonprinting CharactersThe 26 control charactersThe 7-bit character setSee Also
2.3. Match One of Many Characters
ProblemSolutionCalendar with misspellingsHexadecimal characterNonhexadecimal characterDiscussionVariationsShorthandsCase insensitivityFlavor-Specific Features.NET character class subtractionJava character class union, intersection, and subtractionSee Also
2.4. Match Any Character
ProblemSolutionAny character except line breaksAny character including line breaksDiscussionAny character except line breaksAny character including line breaksDot abuseVariationsSee Also
2.5. Match Something at the Start and/or the End of a Line
ProblemSolutionStart of the subjectEnd of the subjectStart of a lineEnd of a lineDiscussionAnchors and linesStart of the subjectEnd of the subjectStart of a lineEnd of a lineZero-length matchesVariationsSee Also
2.6. Match Whole Words
ProblemSolutionWord boundariesNonboundariesDiscussionWord boundariesNonboundariesWord CharactersSee Also
2.7. Unicode Code Points, Categories, Blocks, and Scripts
ProblemSolutionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeDiscussionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeVariationsNegated variantCharacter classesListing all charactersSee Also
2.8. Match One of Several Alternatives
ProblemSolutionDiscussionSee Also
2.9. Group and Capture Parts of the Match
ProblemSolutionDiscussionVariationsNoncapturing groupsGroup with mode modifiersSee Also
2.10. Match Previously Matched Text Again
ProblemSolutionDiscussionSee Also
2.11. Capture and Name Parts of the Match
ProblemSolutionNamed captureNamed backreferencesDiscussionNamed captureNamed backreferencesGroups with the same nameSee Also
2.12. Repeat Part of the Regex a Certain Number of Times
ProblemSolutionGoogolHexadecimal numberHexadecimal number with optional suffixFloating-point numberDiscussionFixed repetitionVariable repetitionInfinite repetitionMaking something optionalRepeating groupsSee Also
2.13. Choose Minimal or Maximal Repetition
ProblemSolutionDiscussionSee Also
2.14. Eliminate Needless Backtracking
ProblemSolutionDiscussionSee Also
2.15. Prevent Runaway Repetition
ProblemSolutionDiscussionVariationsSee Also
2.16. Test for a Match Without Adding It to the Overall Match
ProblemSolutionDiscussionLookaroundNegative lookaroundDifferent levels of lookbehindMatching the same text twiceLookaround is atomicAlternative to LookbehindSolution Without LookbehindSee Also
2.17. Match One of Two Alternatives Based on a Condition
ProblemSolutionDiscussionSee Also
2.18. Add Comments to a Regular Expression
ProblemSolutionDiscussionFree-spacing modeJava has free-spacing character classesVariations
2.19. Insert Literal Text into the Replacement Text
ProblemSolutionDiscussionWhen and how to escape characters in replacement text.NET and JavaScriptJavaPHPPerlPython and RubyMore escape rules for string literalsSee Also
2.20. Insert the Regex Match into the Replacement Text
ProblemSolutionRegular expressionReplacementDiscussionSee Also
2.21. Insert Part of the Regex Match into the Replacement Text
ProblemSolutionRegular expressionReplacementDiscussionReplacements using capturing groups$10 and higherReferences to nonexistent groupsSolution Using Named CaptureRegular expressionReplacementFlavors that support named captureSee Also
2.22. Insert Match Context into the Replacement Text
ProblemSolutionDiscussionSee Also
3. Programming with Regular Expressions
Programming Languages and Regex Flavors
Languages Covered in This ChapterMore Programming Languages
3.1. Literal Regular Expressions in Source Code
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
3.2. Import the Regular Expression Library
ProblemSolutionC#VB.NETXRegExpJavaPythonDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRuby
3.3. Create Regular Expression Objects
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyCompiling a Regular Expression Down to CILC#VB.NETDiscussionSee Also
3.4. Set Regular Expression Options
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyAdditional Language-Specific Options.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
3.5. Test If a Match Can Be Found Within a Subject String
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.6. Test Whether a Regex Matches the Subject String Entirely
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.7. Retrieve the Matched Text
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.8. Determine the Position and Length of the Match
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.9. Retrieve Part of the Matched Text
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJavaXRegExpPHPPerlPythonRubySee Also
3.10. Retrieve a List of All Matches
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.11. Iterate over All Matches
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
3.12. Validate Matches in Procedural Code
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
3.13. Find a Match Within Another Match
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
3.14. Replace All Matches
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.15. Replace Matches Reusing Parts of the Match
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJava 7XRegExpPHPPerlPythonRubySee Also
3.16. Replace Matches with Replacements Generated in Code
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.17. Replace All Matches Within the Matches of Another Regex
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
3.18. Replace All Matches Between the Matches of Another Regex
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionPerl and RubyPythonSee Also
3.19. Split a String
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
3.20. Split a String, Keeping the Regex Matches
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
3.21. Search Line by Line
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
Construct a Parser
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonPHPRubyDiscussionSee Also
4. Validation and Formatting
4.1. Validate Email Addresses
ProblemSolutionSimpleSimple, with restrictions on charactersSimple, with all valid local part charactersNo leading, trailing, or consecutive dotsTop-level domain has two to six lettersDiscussionAbout email addressesRegular expression syntaxBuilding a regex step-by-stepVariationsSee Also
4.2. Validate and Format North American Phone Numbers
ProblemSolutionRegular expressionReplacementC# exampleJavaScript exampleOther programming languagesDiscussionVariationsEliminate invalid phone numbersFind phone numbers in documentsAllow a leading “1”Allow seven-digit phone numbersSee Also
4.3. Validate International Phone Numbers
ProblemSolutionRegular expressionJavaScript exampleDiscussionVariationsValidate international phone numbers in EPP formatSee Also
4.4. Validate Traditional Date Formats
ProblemSolutionDiscussionVariationsSee Also
4.5. Validate Traditional Date Formats, Excluding Invalid Dates
ProblemSolutionC#PerlPure regular expressionDiscussionRegex with procedural codePure regular expressionVariationsSee Also
4.6. Validate Traditional Time Formats
ProblemSolutionDiscussionVariationsSee Also
4.7. Validate ISO 8601 Dates and Times
ProblemSolutionDatesWeeksTimesDate and timeXML Schema dates and timesDiscussionSee Also
4.8. Limit Input to Alphanumeric Characters
ProblemSolutionRegular expressionRuby exampleDiscussionVariationsLimit input to ASCII charactersLimit input to ASCII noncontrol characters and line breaksLimit input to shared ISO-8859-1 and Windows-1252 charactersLimit input to alphanumeric characters in any languageSee Also
4.9. Limit the Length of Text
ProblemSolutionRegular expressionPerl exampleDiscussionVariationsLimit the length of an arbitrary patternLimit the number of nonwhitespace charactersLimit the number of wordsSee Also
4.10. Limit the Number of Lines in Text
ProblemSolutionRegular expressionPHP (PCRE) exampleDiscussionVariationsWorking with esoteric line separatorsSee Also
4.11. Validate Affirmative Responses
ProblemSolutionRegular expressionJavaScript exampleDiscussionSee Also
4.12. Validate Social Security Numbers
ProblemSolutionRegular expressionPython exampleDiscussionVariationsFind Social Security numbers in documentsSee Also
4.13. Validate ISBNs
ProblemSolutionRegular expressionsJavaScript example, with checksum validationPython example, with checksum validationDiscussionISBN-10 checksumISBN-13 checksumVariationsFind ISBNs in documentsEliminate incorrect ISBN identifiersSee Also
4.14. Validate ZIP Codes
ProblemSolutionRegular expressionVB.NET exampleDiscussionSee Also
4.15. Validate Canadian Postal Codes
ProblemSolutionDiscussionSee Also
4.16. Validate U.K. Postcodes
ProblemSolutionDiscussionSee Also
4.17. Find Addresses with Post Office Boxes
ProblemSolutionRegular expressionC# exampleDiscussionSee Also
4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
ProblemSolutionRegular expressionReplacementJavaScript exampleDiscussionVariationsList surname particles at the beginning of the nameSee Also
4.19. Validate Password Complexity
ProblemSolutionLength between 8 and 32 charactersASCII visible and space characters onlyOne or more uppercase lettersOne or more lowercase lettersOne or more numbersOne or more special charactersDisallow three or more sequential identical charactersExample JavaScript solution, basicExample JavaScript solution, with x out of y validationExample JavaScript solution, with password security rankingDiscussionExample JavaScript solutionsVariationsValidate multiple password rules with a single regexSee Also
4.20. Validate Credit Card Numbers
ProblemSolutionStrip spaces and hyphensValidate the numberExample web page with JavaScriptDiscussionStrip spaces and hyphensValidate the numberIncorporating the solution into a web pageExtra Validation with the Luhn AlgorithmSee Also
4.21. European VAT Numbers
ProblemSolutionStrip whitespace and punctuationValidate the numberDiscussionStrip whitespace and punctuationValidate the numberVariationsSee Also
5. Words, Lines, and Special Characters
5.1. Find a Specific Word
ProblemSolutionDiscussionSee Also
5.2. Find Any of Multiple Words
ProblemSolutionUsing alternationExample JavaScript solutionDiscussionUsing alternationExample JavaScript solutionSee Also
5.3. Find Similar Words
ProblemSolutionColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”DiscussionUse word boundaries to match complete wordsColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”See Also
5.4. Find All Except a Specific Word
ProblemSolutionDiscussionVariationsFind words that don’t contain another wordSee Also
5.5. Find Any Word Not Followed by a Specific Word
ProblemSolutionDiscussionVariationsSee Also
5.6. Find Any Word Not Preceded by a Specific Word
ProblemSolutionLookbehind youWords not preceded by “cat”Simulate lookbehindDiscussionFixed, finite, and infinite length lookbehindSimulate lookbehindVariationsSee Also
5.7. Find Words Near Each Other
ProblemSolutionDiscussionVariationsUsing a conditionalMatch three or more words near each otherExponentially increasing permutationsThe ugly solutionExploiting empty backreferencesJavaScript backreferences by its own rulesMultiple words, any distance from each otherSee Also
5.8. Find Repeated Words
ProblemSolutionDiscussionVariationsSee Also
5.9. Remove Duplicate Lines
ProblemSolutionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileDiscussionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileSee Also
5.10. Match Complete Lines That Contain a Word
ProblemSolutionDiscussionVariationsSee Also
5.11. Match Complete Lines That Do Not Contain a Word
ProblemSolutionDiscussionSee Also
5.12. Trim Leading and Trailing Whitespace
ProblemSolutionDiscussionVariationsSee Also
5.13. Replace Repeated Whitespace with a Single Space
ProblemSolutionClean any whitespace charactersClean horizontal whitespace charactersDiscussionClean any whitespace charactersClean horizontal whitespace charactersSee Also
5.14. Escape Regular Expression Metacharacters
ProblemSolutionBuilt-in solutionsRegular expressionReplacementExample JavaScript functionDiscussionVariationsSee Also
6. Numbers
6.1. Integer Numbers
ProblemSolutionDiscussionSee Also
6.2. Hexadecimal Numbers
ProblemSolutionDiscussionSee Also
6.3. Binary Numbers
ProblemSolutionDiscussionSee Also
6.4. Octal Numbers
ProblemSolutionDiscussionSee Also
6.5. Decimal Numbers
ProblemSolutionDiscussionSee Also
6.6. Strip Leading Zeros
ProblemSolutionRegular expressionReplacementGetting the numbers in PerlStripping leading zeros in PHPDiscussionSee Also
6.7. Numbers Within a Certain Range
ProblemSolutionDiscussionSee Also
6.8. Hexadecimal Numbers Within a Certain Range
ProblemSolutionDiscussionSee Also
6.9. Integer Numbers with Separators
ProblemSolutionDiscussionSee Also
6.10. Floating-Point Numbers
ProblemSolutionDiscussionSee Also
6.11. Numbers with Thousand Separators
ProblemSolutionDiscussionSee Also
6.12. Add Thousand Separators to Numbers
ProblemSolutionBasic solutionMatch separator positions only, using lookbehindDiscussionIntroductionBasic solutionMatch separator positions only, using lookbehindVariationsDon’t add commas after a decimal pointUse infinite lookbehindSearch-and-replace within matched numbersSee Also
6.13. Roman Numerals
ProblemSolutionDiscussionConvert Roman Numerals to DecimalSee Also
7. Source Code and Log Files
Keywords
ProblemSolutionDiscussionVariationsSee Also
Identifiers
ProblemSolutionDiscussionSee Also
Numeric Constants
ProblemSolutionDiscussionSee Also
Operators
ProblemSolutionDiscussion
Single-Line Comments
ProblemSolutionDiscussionSee Also
Multiline Comments
ProblemSolutionDiscussionVariationsSee Also
All Comments
ProblemSolutionDiscussionSee Also
Strings
ProblemSolutionDiscussionVariationsSee Also
Strings with Escapes
ProblemSolutionDiscussionVariationsSee Also
Regex Literals
ProblemSolutionDiscussionSee Also
Here Documents
ProblemSolutionDiscussionSee Also
Common Log Format
ProblemSolutionDiscussionVariationsSee Also
Combined Log Format
ProblemSolutionDiscussionSee Also
Broken Links Reported in Web Logs
ProblemSolutionDiscussionSee Also
8. URLs, Paths, and Internet Addresses
8.1. Validating URLs
ProblemSolutionDiscussionSee Also
8.2. Finding URLs Within Full Text
ProblemSolutionDiscussionSee Also
8.3. Finding Quoted URLs in Full Text
ProblemSolutionDiscussionSee Also
8.4. Finding URLs with Parentheses in Full Text
ProblemSolutionDiscussionSee Also
8.5. Turn URLs into Links
ProblemSolutionDiscussionSee Also
8.6. Validating URNs
ProblemSolutionDiscussionSee Also
8.7. Validating Generic URLs
ProblemSolutionDiscussionSee Also
8.8. Extracting the Scheme from a URL
ProblemSolutionExtract the scheme from a URL known to be validExtract the scheme while validating the URLDiscussionSee Also
8.9. Extracting the User from a URL
ProblemSolutionExtract the user from a URL known to be validExtract the user while validating the URLDiscussionSee Also
8.10. Extracting the Host from a URL
ProblemSolutionExtract the host from a URL known to be validExtract the host while validating the URLDiscussionSee Also
8.11. Extracting the Port from a URL
ProblemSolutionExtract the port from a URL known to be validExtract the port while validating the URLDiscussionSee Also
8.12. Extracting the Path from a URL
ProblemSolutionDiscussionSee Also
8.13. Extracting the Query from a URL
ProblemSolutionDiscussionSee Also
8.14. Extracting the Fragment from a URL
ProblemSolutionDiscussionSee Also
8.15. Validating Domain Names
ProblemSolutionDiscussionSee Also
8.16. Matching IPv4 Addresses
ProblemSolutionRegular expressionPerlDiscussionSee Also
8.17. Matching IPv6 Addresses
ProblemSolutionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationDiscussionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationSee Also
8.18. Validate Windows Paths
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
8.19. Split Windows Paths into Their Parts
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
8.20. Extract the Drive Letter from a Windows Path
ProblemSolutionDiscussionSee Also
8.21. Extract the Server and Share from a UNC Path
ProblemSolutionDiscussionSee Also
8.22. Extract the Folder from a Windows Path
ProblemSolutionDiscussionSee Also
8.23. Extract the Filename from a Windows Path
ProblemSolutionDiscussionSee Also
8.24. Extract the File Extension from a Windows Path
ProblemSolutionDiscussionSee Also
8.25. Strip Invalid Characters from Filenames
ProblemSolutionRegular expressionReplacementDiscussionSee Also
9. Markup and Data Formats
Processing Markup and Data Formats with Regular Expressions
Basic Rules for Formats Covered in This Chapter
9.1. Find XML-Style Tags
ProblemSolutionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)DiscussionA few words of cautionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)Skip Tricky (X)HTML and XML SectionsOuter regex for (X)HTMLOuter regex for XMLSee Also
9.2. Replace Tags with 
ProblemSolutionDiscussionVariationsReplace a list of tagsSee Also
9.3. Remove All XML-Style Tags Except and 
ProblemSolutionSolution 1: Match tags except and Solution 2: Match tags except and , and any tags that contain attributesDiscussionVariationsWhitelist specific attributesSee Also
9.4. Match XML Names
ProblemSolutionXML 1.0 names (approximate)XML 1.1 names (exact)DiscussionXML 1.0 namesXML 1.1 namesVariationsSee Also
9.5. Convert Plain Text to HTML by Adding and Tags
ProblemSolutionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯Example JavaScript solutionDiscussionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯See Also
9.6. Decode XML Entities
ProblemSolutionRegular expressionReplace matches with their corresponding literal charactersExample JavaScript solutionDiscussionSee Also
9.7. Find a Specific Attribute in XML-Style Tags
ProblemSolutionTags that contain an id attribute (quick and dirty)Tags that contain an id attribute (more reliable)<div> tags that contain an id attributeTags that contain an id attribute with the value “my-id”Tags that contain “my-class” within their class attribute valueDiscussionSee Also
9.8. Add a cellspacing Attribute to <table> Tags That Do Not Already Include It
ProblemSolutionSolution 1, simplisticSolution 2, more reliableInsert the new attributeDiscussionSee Also
9.9. Remove XML-Style Comments
ProblemSolutionDiscussionHow it worksWhen comments can’t be removedVariationsFind valid XML commentsFind valid HTML commentsSee Also
9.10. Find Words Within XML-Style Comments
ProblemSolutionTwo-step approachSingle-step approachDiscussionTwo-step approachSingle-step approachVariationsSee Also
9.11. Change the Delimiter Used in CSV Files
ProblemSolutionExample web page with JavaScriptDiscussionSee Also
9.12. Extract CSV Fields from a Specific Column
ProblemSolutionExample web page with JavaScriptDiscussionVariationsMatch a CSV record and capture the field in column 1 to backreference 1Match a CSV record and capture the field in column 2 to backreference 1Match a CSV record and capture the field in column 3 or higher to backreference 1Replacement stringSee Also
9.13. Match INI Section Headers
ProblemSolutionDiscussionVariationsSee Also
9.14. Match INI Section Blocks
ProblemSolutionDiscussionSee Also
9.15. Match INI Name-Value Pairs
ProblemSolutionDiscussionSee Also
Index
About the Authors
Colophon
Copyright

Content preview from Regular Expressions Cookbook, 2nd Edition

2.15. Prevent Runaway Repetition

Problem

Use a single regular expression to match a complete HTML file, checking for properly nested html, head, title, and body tags. The regular expression must fail efficiently on HTML files that do not have the proper tags.

Solution

<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)↵
(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>

Regex options: Case insensitive, dot matches line breaks

Regex flavors: .NET, Java, PCRE, Perl, Ruby

JavaScript and Python do not support atomic grouping. There is no way to eliminate needless backtracking with these two regex flavors. When programming in JavaScript or Python, you can solve this problem by doing a literal text search for each of the tags one by one, searching for the next tag through the remainder of the subject text after the one last found.

Discussion

The proper solution to this problem is more easily understood if we start from this naïve solution:

<html>.*?<head>.*?<title>.*?</title>↵
.*?</head>.*?<body[^>]*>.*?</body>.*?</html>

Regex options: Case insensitive, dot matches line breaks

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

When you test this regex on a proper HTML file, it works perfectly well. ‹.*?› skips over anything, because we turn on “dot matches line breaks.” The lazy asterisk makes sure the regex goes ahead only one character at a time, each time checking whether the next tag can be matched. Recipes 2.4 and 2.13 explain all this.

But this regex gets you into trouble when ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449327453Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills