book

Regular Expressions Cookbook

by Jan Goyvaerts, Steven Levithan

May 2009

Intermediate to advanced

510 pages

15h

English

O'Reilly Media, Inc.

Read now

Unlock full access

Regular Expressions Cookbook
SPECIAL OFFER: Upgrade this ebook with O’Reilly
Preface
Caught in the Snarls of Different Versions
Intended Audience
Technology Covered
Organization of This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us

Acknowledgments
1. Introduction to Regular Expressions
Regular Expressions DefinedMany Flavors of Regular ExpressionsRegex Flavors Covered by This Book
Searching and Replacing with Regular Expressions
Many Flavors of Replacement Text
Tools for Working with Regular Expressions
RegexBuddyRegexPalMore Online Regex Testersregex.larsolavtorvik.comNregexRubularmyregexp.comreAnimatorMore Desktop Regular Expression TestersExpressoThe RegulatorgrepPowerGREPWindows GrepRegexRenamerPopular Text Editors
2. Basic Regular Expression Skills
2.1. Match Literal TextProblemSolutionDiscussionVariationsBlock escapeCase-insensitive matchingSee Also
2.2. Match Nonprintable Characters
ProblemSolutionDiscussionVariations on Representations of Nonprinting CharactersThe 26 control charactersThe 7-bit character setSee Also
2.3. Match One of Many Characters
ProblemSolutionCalendar with misspellingsHexadecimal characterNonhexadecimal characterDiscussionVariationsShorthandsCase insensitivityFlavor-Specific Features.NET character class subtractionJava character class union, subtraction, and intersectionSee Also
2.4. Match Any Character
ProblemSolutionAny character except line breaksAny character including line breaksDiscussionAny character except line breaksAny character including line breaksDot abuseVariationsSee Also
2.5. Match Something at the Start and/or the End of a Line
ProblemSolutionStart of the subjectEnd of the subjectStart of a lineEnd of a lineDiscussionAnchors and linesStart of the subjectEnd of the subjectStart of a lineEnd of a lineZero-length matchesVariationsSee Also
2.6. Match Whole Words
ProblemSolutionWord boundariesNonboundariesDiscussionWord boundariesNonboundariesWord CharactersSee Also
2.7. Unicode Code Points, Properties, Blocks, and Scripts
ProblemSolutionUnicode code pointUnicode property or categoryUnicode blockUnicode scriptUnicode graphemeDiscussionUnicode code pointUnicode property or categoryUnicode blockUnicode scriptUnicode graphemeVariationsNegated variantCharacter classesListing all charactersSee Also
2.8. Match One of Several Alternatives
ProblemSolutionDiscussionSee Also
2.9. Group and Capture Parts of the Match
ProblemSolutionDiscussionVariationsNoncapturing groupsGroup with mode modifiersSee Also
2.10. Match Previously Matched Text Again
ProblemSolutionDiscussionSee Also
2.11. Capture and Name Parts of the Match
ProblemSolutionNamed captureNamed backreferencesDiscussionNamed captureNamed backreferencesSee Also
2.12. Repeat Part of the Regex a Certain Number of Times
ProblemSolutionGoogolHexadecimal numberHexadecimal numberFloating-point numberDiscussionFixed repetitionVariable repetitionInfinite repetitionMaking something optionalRepeating groupsSee Also
2.13. Choose Minimal or Maximal Repetition
ProblemSolutionDiscussionSee Also
2.14. Eliminate Needless Backtracking
ProblemSolutionDiscussionSee Also
2.15. Prevent Runaway Repetition
ProblemSolutionDiscussionVariationsSee Also
2.16. Test for a Match Without Adding It to the Overall Match
ProblemSolutionDiscussionLookaroundNegative lookaroundDifferent levels of lookbehindMatching the same text twiceLookaround is atomicSolution Without LookbehindSee Also
2.17. Match One of Two Alternatives Based on a Condition
ProblemSolutionDiscussionSee Also
2.18. Add Comments to a Regular Expression
ProblemSolutionDiscussionFree-spacing modeJava has free-spacing character classesVariations
2.19. Insert Literal Text into the Replacement Text
ProblemSolutionDiscussionWhen and how to escape characters in replacement text.NET and JavaScriptJavaPHPPerlPython and RubyMore escape rules for string literalsSee Also
2.20. Insert the Regex Match into the Replacement Text
ProblemSolutionRegular expressionReplacementDiscussionSee Also
2.21. Insert Part of the Regex Match into the Replacement Text
ProblemSolutionRegular expressionReplacementDiscussionReplacements using capturing groups$10 and higherReferences to nonexistent groupsSolution Using Named CaptureRegular expressionReplacementFlavors that support named captureSee Also
2.22. Insert Match Context into the Replacement Text
ProblemSolutionDiscussionSee Also
3. Programming with Regular Expressions
Programming Languages and Regex FlavorsLanguages Covered in This ChapterMore Programming Languages
3.1. Literal Regular Expressions in Source Code
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.2. Import the Regular Expression Library
ProblemSolutionC#VB.NETJavaPythonDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRuby
3.3. Creating Regular Expression Objects
ProblemSolutionC#VB.NETJavaJavaScriptPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyCompiling a Regular Expression Down to CILC#VB.NETDiscussionSee Also
3.4. Setting Regular Expression Options
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyAdditional Language-Specific Options.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.5. Test Whether a Match Can Be Found Within a Subject String
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.6. Test Whether a Regex Matches the Subject String Entirely
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.7. Retrieve the Matched Text
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.8. Determine the Position and Length of the Match
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.9. Retrieve Part of the Matched Text
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETPHPPerlPythonSee Also
3.10. Retrieve a List of All Matches
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.11. Iterate over All Matches
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.12. Validate Matches in Procedural Code
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
3.13. Find a Match Within Another Match
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
3.14. Replace All Matches
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.15. Replace Matches Reusing Parts of the Match
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETPHPPerlPythonRubySee Also
3.16. Replace Matches with Replacements Generated in Code
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.17. Replace All Matches Within the Matches of Another Regex
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
3.18. Replace All Matches Between the Matches of Another Regex
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionPerl and RubyPythonSee Also
3.19. Split a String
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.20. Split a String, Keeping the Regex Matches
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
3.21. Search Line by Line
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
4. Validation and Formatting
4.1. Validate Email AddressesProblemSolutionSimpleSimple, with restrictions on charactersSimple, with all charactersNo leading, trailing, or consecutive dotsTop-level domain has two to six lettersDiscussionAbout email addressesRegular expression syntaxBuilding a regex step-by-stepVariationsSee Also
4.2. Validate and Format North American Phone Numbers
ProblemSolutionRegular expressionReplacementC#JavaScriptOther programming languagesDiscussionVariationsEliminate invalid phone numbersFind phone numbers in documentsAllow a leading “1”Allow seven-digit phone numbersSee Also
4.3. Validate International Phone Numbers
ProblemSolutionRegular expressionJavaScriptOther programming languagesDiscussionVariationsValidate international phone numbers in EPP formatSee Also
4.4. Validate Traditional Date Formats
ProblemSolutionDiscussionVariationsSee Also
4.5. Accurately Validate Traditional Date Formats
ProblemSolutionC#PerlPure regular expressionDiscussionSee Also
4.6. Validate Traditional Time Formats
ProblemSolutionDiscussionVariationsSee Also
4.7. Validate ISO 8601 Dates and Times
ProblemSolutionDiscussionSee Also
4.8. Limit Input to Alphanumeric Characters
ProblemSolutionRegular expressionRubyOther programming languagesDiscussionVariationsLimit input to ASCII charactersLimit input to ASCII non-control characters and line breaksLimit input to shared ISO-8859-1 and Windows-1252 charactersLimit input to alphanumeric characters in any languageSee Also
4.9. Limit the Length of Text
ProblemSolutionRegular expressionPerlOther programming languagesDiscussionVariationsLimit the length of an arbitrary patternLimit the number of nonwhitespace charactersLimit the number of wordsSee Also
4.10. Limit the Number of Lines in Text
ProblemSolutionRegular expressionPHP (PCRE)Other programming languagesDiscussionVariationsWorking with esoteric line separatorsSee Also
4.11. Validate Affirmative Responses
ProblemSolutionRegular expressionJavaScriptOther programming languagesDiscussionSee Also
4.12. Validate Social Security Numbers
ProblemSolutionRegular expressionPythonOther programming languagesDiscussionVariationsFind Social Security numbers in documentsSee Also
4.13. Validate ISBNs
ProblemSolutionRegular expressionsJavaScriptPythonOther programming languagesDiscussionISBN-10 checksumISBN-13 checksumVariationsFind ISBNs in documentsEliminate incorrect ISBN identifiersSee Also
4.14. Validate ZIP Codes
ProblemSolutionRegular expressionVB.NETOther programming languagesDiscussionSee Also
4.15. Validate Canadian Postal Codes
ProblemSolutionDiscussionSee Also
4.16. Validate U.K. Postcodes
ProblemSolutionDiscussionSee Also
4.17. Find Addresses with Post Office Boxes
ProblemSolutionRegular expressionC#Other programming languagesDiscussionSee Also
4.18. Reformat Names From “FirstName LastName” to “LastName, FirstName”
ProblemSolutionRegular expressionReplacementJavaScriptOther programming languagesDiscussionVariationsList surname particles at the beginning of the name
4.19. Validate Credit Card Numbers
ProblemSolutionStrip spaces and hyphensValidate the numberExample web page with JavaScriptDiscussionStrip spaces and hyphensValidate the numberIncorporating the solution into a web pageExtra Validation with the Luhn Algorithm
4.20. European VAT Numbers
ProblemSolutionStrip whitespace and punctuationValidate the numberDiscussionStrip whitespace and punctuationValidate the numberVariationsSee Also
5. Words, Lines, and Special Characters
5.1. Find a Specific WordProblemSolutionDiscussionSee Also
5.2. Find Any of Multiple Words
ProblemSolutionUsing alternationExample JavaScript solutionDiscussionUsing alternationExample JavaScript solutionSee Also
5.3. Find Similar Words
ProblemSolutionColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”DiscussionUse word boundaries to match complete wordsColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”See Also
5.4. Find All Except a Specific Word
ProblemSolutionDiscussionVariationsFind words that don’t contain another wordSee Also
5.5. Find Any Word Not Followed by a Specific Word
ProblemSolutionDiscussionVariationsSee Also
5.6. Find Any Word Not Preceded by a Specific Word
ProblemSolutionLookbehind youWords not preceded by “cat”Simulate lookbehindDiscussionFixed, finite, and infinite length lookbehindSimulate lookbehindVariationsSee Also
5.7. Find Words Near Each Other
ProblemSolutionDiscussionVariationsUsing a conditionalMatch three or more words near each otherExponentially increasing permutationsThe ugly solutionExploiting empty backreferencesJavaScript backreferences by its own rulesMultiple words, any distance from each otherSee Also
5.8. Find Repeated Words
ProblemSolutionDiscussionSee Also
5.9. Remove Duplicate Lines
ProblemSolutionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileDiscussionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileSee Also
5.10. Match Complete Lines That Contain a Word
ProblemSolutionDiscussionVariationsSee Also
5.11. Match Complete Lines That Do Not Contain a Word
ProblemSolutionDiscussionSee Also
5.12. Trim Leading and Trailing Whitespace
ProblemSolutionDiscussionVariationsSee Also
5.13. Replace Repeated Whitespace with a Single Space
ProblemSolutionClean any whitespace charactersClean horizontal whitespace charactersDiscussionClean any whitespace charactersClean horizontal whitespace charactersSee Also
5.14. Escape Regular Expression Metacharacters
ProblemSolutionBuilt-in solutionsRegular expressionReplacementExample JavaScript functionDiscussionVariationsSee Also
6. Numbers
6.1. Integer NumbersProblemSolutionDiscussionSee Also
6.2. Hexadecimal Numbers
ProblemSolutionDiscussionSee Also
6.3. Binary Numbers
ProblemSolutionDiscussionSee Also
6.4. Strip Leading Zeros
ProblemSolutionRegular expressionReplacementGetting the numbers in PerlStripping leading zeros in PHPDiscussionSee Also
6.5. Numbers Within a Certain Range
ProblemSolutionDiscussionSee Also
6.6. Hexadecimal Numbers Within a Certain Range
ProblemSolutionDiscussionSee Also
6.7. Floating Point Numbers
ProblemSolutionDiscussionSee Also
6.8. Numbers with Thousand Separators
ProblemSolutionDiscussionSee Also
6.9. Roman Numerals
ProblemSolutionDiscussionConvert Roman Numerals to DecimalSee Also
7. URLs, Paths, and Internet Addresses
7.1. Validating URLsProblemSolutionDiscussionSee Also
7.2. Finding URLs Within Full Text
ProblemSolutionDiscussionSee Also
7.3. Finding Quoted URLs in Full Text
ProblemSolutionDiscussionSee Also
7.4. Finding URLs with Parentheses in Full Text
ProblemSolutionDiscussionSee Also
7.5. Turn URLs into Links
ProblemSolutionDiscussionSee Also
7.6. Validating URNs
ProblemSolutionDiscussionSee Also
7.7. Validating Generic URLs
ProblemSolutionDiscussionSee Also
7.8. Extracting the Scheme from a URL
ProblemSolutionExtract the scheme from a URL known to be validExtract the scheme while validating the URLDiscussionSee Also
7.9. Extracting the User from a URL
ProblemSolutionExtract the user from a URL known to be validExtract the user while validating the URLDiscussionSee Also
7.10. Extracting the Host from a URL
ProblemSolutionExtract the host from a URL known to be validExtract the host while validating the URLDiscussionSee Also
7.11. Extracting the Port from a URL
ProblemSolutionExtract the port from a URL known to be validExtract the host while validating the URLDiscussionSee Also
7.12. Extracting the Path from a URL
ProblemSolutionDiscussionSee Also
7.13. Extracting the Query from a URL
ProblemSolutionDiscussionSee Also
7.14. Extracting the Fragment from a URL
ProblemSolutionDiscussionSee Also
7.15. Validating Domain Names
ProblemSolutionDiscussionSee Also
7.16. Matching IPv4 Addresses
ProblemSolutionRegular expressionPerlDiscussionSee Also
7.17. Matching IPv6 Addresses
ProblemSolutionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationDiscussionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationSee Also
7.18. Validate Windows Paths
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
7.19. Split Windows Paths into Their Parts
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
7.20. Extract the Drive Letter from a Windows Path
ProblemSolutionDiscussionSee Also
7.21. Extract the Server and Share from a UNC Path
ProblemSolutionDiscussionSee Also
7.22. Extract the Folder from a Windows Path
ProblemSolutionDiscussionSee Also
7.23. Extract the Filename from a Windows Path
ProblemSolutionDiscussionSee Also
7.24. Extract the File Extension from a Windows Path
ProblemSolutionDiscussionSee Also
7.25. Strip Invalid Characters from Filenames
ProblemSolutionRegular expressionReplacementDiscussionSee Also
8. Markup and Data Interchange
8.1. Find XML-Style TagsProblemSolutionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)DiscussionA few words of cautionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)Skip tricky (X)HTML and XML sectionsOuter regex for (X)HTMLOuter regex for XMLVariationsMatch valid HTML 4 tagsSee Also
8.2. Replace Tags with 
ProblemSolutionDiscussionVariationsReplace a list of tagsSee Also
8.3. Remove All XML-Style Tags Except and 
ProblemSolutionSolution 1: Match tags except and Solution 2: Match tags except and , and any tags that contain attributesDiscussionVariationsWhitelist specific attributesSee Also
8.4. Match XML Names
ProblemSolutionXML 1.0 names (approximate)XML 1.1 names (exact)DiscussionXML 1.0 namesXML 1.1 namesVariationsSee Also
8.5. Convert Plain Text to HTML by Adding and Tags
ProblemSolutionStep 1: Replace HTML special characters with character entity referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯JavaScript exampleDiscussionStep 1: Replace HTML special characters with character entity referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯See Also
8.6. Find a Specific Attribute in XML-Style Tags
ProblemSolutionTags that contain an id attribute (quick and dirty)Tags that contain an id attribute (more reliable)<div> tags that contain an id attributeTags that contain an id attribute with the value “my-id”Tags that contain “my-class” within their class attribute valueDiscussionSee Also
8.7. Add a cellspacing Attribute to <table> Tags That Do Not Already Include It
ProblemSolutionRegex 1: Simplistic solutionRegex 2: More reliable solutionInsert the new attributeDiscussionSee Also
8.8. Remove XML-Style Comments
ProblemSolutionDiscussionHow it worksWhen comments can’t be removedVariationsFind valid XML-style commentsFind C-style commentsSee Also
8.9. Find Words Within XML-Style Comments
ProblemSolutionTwo-step approachSingle-step approachDiscussionTwo-step approachSingle-step approachVariationsSee Also
8.10. Change the Delimiter Used in CSV Files
ProblemSolutionJavaScript exampleDiscussionSee Also
8.11. Extract CSV Fields from a Specific Column
ProblemSolutionJavaScript exampleDiscussionVariationsMatch a CSV record and capture the field in column 1 to backreference 1Match a CSV record and capture the field in column 2 to backreference 1Match a CSV record and capture the field in column 3 or higher to backreference 1Replacement string
8.12. Match INI Section Headers
ProblemSolutionDiscussionSee Also
8.13. Match INI Section Blocks
ProblemSolutionDiscussionSee Also
8.14. Match INI Name-Value Pairs
ProblemSolutionDiscussionSee Also
Index
About the Authors
Colophon
SPECIAL OFFER: Upgrade this ebook with O’Reilly

Content preview from Regular Expressions Cookbook

7.14. Extracting the Fragment from a URL

Problem

You want to extract the fragment from a string that holds a URL. For example, you want to extract top from http://www.regexcookbook.com#top or from /index.html#top.

Solution

#(.+)

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Extracting the fragment from a URL is trivial if you know that your subject text is a valid URL. The query is delimited from the part of the URL before it with a hash sign. The fragment is the only part of URLs in which hash signs are allowed, and the fragment is always the last part of the URL. Thus, we can easily extract the fragment by finding the first hash sign and grabbing everything until the end of the string. ‹#.+› does that nicely. Make sure to turn off free-spacing mode; otherwise, you need to escape the literal hash sign with a backslash.

This regular expression will find a match only for URLs that actually contain a fragment. The match consists of just the fragment, but includes the hash sign that delimits the fragment from the rest of the URL. The solution has an extra capturing group to retrieve just the fragment, without the delimiting #.

If you don’t already know that your subject text is a valid URL, you can use one of the regexes from Recipe 7.7. The first regex in that recipe captures the fragment, if one is present in the URL, into capturing group number 13.