book

Regular Expressions Cookbook

by Jan Goyvaerts, Steven Levithan

May 2009

Intermediate to advanced

510 pages

15h

English

O'Reilly Media, Inc.

Read now

Unlock full access

Caught in the Snarls of Different Versions

Regular Expressions DefinedMany Flavors of Regular ExpressionsRegex Flavors Covered by This Book
Many Flavors of Replacement Text
RegexBuddyRegexPalMore Online Regex Testersregex.larsolavtorvik.comNregexRubularmyregexp.comreAnimatorMore Desktop Regular Expression TestersExpressoThe RegulatorgrepPowerGREPWindows GrepRegexRenamerPopular Text Editors
2.1. Match Literal TextProblemSolutionDiscussionVariationsBlock escapeCase-insensitive matchingSee Also
ProblemSolutionDiscussionVariations on Representations of Nonprinting CharactersThe 26 control charactersThe 7-bit character setSee Also
ProblemSolutionCalendar with misspellingsHexadecimal characterNonhexadecimal characterDiscussionVariationsShorthandsCase insensitivityFlavor-Specific Features.NET character class subtractionJava character class union, subtraction, and intersectionSee Also
ProblemSolutionAny character except line breaksAny character including line breaksDiscussionAny character except line breaksAny character including line breaksDot abuseVariationsSee Also
ProblemSolutionStart of the subjectEnd of the subjectStart of a lineEnd of a lineDiscussionAnchors and linesStart of the subjectEnd of the subjectStart of a lineEnd of a lineZero-length matchesVariationsSee Also
ProblemSolutionWord boundariesNonboundariesDiscussionWord boundariesNonboundariesWord CharactersSee Also
ProblemSolutionUnicode code pointUnicode property or categoryUnicode blockUnicode scriptUnicode graphemeDiscussionUnicode code pointUnicode property or categoryUnicode blockUnicode scriptUnicode graphemeVariationsNegated variantCharacter classesListing all charactersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsNoncapturing groupsGroup with mode modifiersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionNamed captureNamed backreferencesDiscussionNamed captureNamed backreferencesSee Also
ProblemSolutionGoogolHexadecimal numberHexadecimal numberFloating-point numberDiscussionFixed repetitionVariable repetitionInfinite repetitionMaking something optionalRepeating groupsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionLookaroundNegative lookaroundDifferent levels of lookbehindMatching the same text twiceLookaround is atomicSolution Without LookbehindSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionFree-spacing modeJava has free-spacing character classesVariations
ProblemSolutionDiscussionWhen and how to escape characters in replacement text.NET and JavaScriptJavaPHPPerlPython and RubyMore escape rules for string literalsSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionReplacements using capturing groups$10 and higherReferences to nonexistent groupsSolution Using Named CaptureRegular expressionReplacementFlavors that support named captureSee Also
ProblemSolutionDiscussionSee Also
Programming Languages and Regex FlavorsLanguages Covered in This ChapterMore Programming Languages
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaPythonDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRuby
ProblemSolutionC#VB.NETJavaJavaScriptPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyCompiling a Regular Expression Down to CILC#VB.NETDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyAdditional Language-Specific Options.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETPHPPerlPythonSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionPerl and RubyPythonSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
4.1. Validate Email AddressesProblemSolutionSimpleSimple, with restrictions on charactersSimple, with all charactersNo leading, trailing, or consecutive dotsTop-level domain has two to six lettersDiscussionAbout email addressesRegular expression syntaxBuilding a regex step-by-stepVariationsSee Also
ProblemSolutionRegular expressionReplacementC#JavaScriptOther programming languagesDiscussionVariationsEliminate invalid phone numbersFind phone numbers in documentsAllow a leading “1”Allow seven-digit phone numbersSee Also
ProblemSolutionRegular expressionJavaScriptOther programming languagesDiscussionVariationsValidate international phone numbers in EPP formatSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionC#PerlPure regular expressionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionRubyOther programming languagesDiscussionVariationsLimit input to ASCII charactersLimit input to ASCII non-control characters and line breaksLimit input to shared ISO-8859-1 and Windows-1252 charactersLimit input to alphanumeric characters in any languageSee Also
ProblemSolutionRegular expressionPerlOther programming languagesDiscussionVariationsLimit the length of an arbitrary patternLimit the number of nonwhitespace charactersLimit the number of wordsSee Also
ProblemSolutionRegular expressionPHP (PCRE)Other programming languagesDiscussionVariationsWorking with esoteric line separatorsSee Also
ProblemSolutionRegular expressionJavaScriptOther programming languagesDiscussionSee Also
ProblemSolutionRegular expressionPythonOther programming languagesDiscussionVariationsFind Social Security numbers in documentsSee Also
ProblemSolutionRegular expressionsJavaScriptPythonOther programming languagesDiscussionISBN-10 checksumISBN-13 checksumVariationsFind ISBNs in documentsEliminate incorrect ISBN identifiersSee Also
ProblemSolutionRegular expressionVB.NETOther programming languagesDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionC#Other programming languagesDiscussionSee Also
ProblemSolutionRegular expressionReplacementJavaScriptOther programming languagesDiscussionVariationsList surname particles at the beginning of the name
ProblemSolutionStrip spaces and hyphensValidate the numberExample web page with JavaScriptDiscussionStrip spaces and hyphensValidate the numberIncorporating the solution into a web pageExtra Validation with the Luhn Algorithm
ProblemSolutionStrip whitespace and punctuationValidate the numberDiscussionStrip whitespace and punctuationValidate the numberVariationsSee Also
5.1. Find a Specific WordProblemSolutionDiscussionSee Also
ProblemSolutionUsing alternationExample JavaScript solutionDiscussionUsing alternationExample JavaScript solutionSee Also
ProblemSolutionColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”DiscussionUse word boundaries to match complete wordsColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”See Also
ProblemSolutionDiscussionVariationsFind words that don’t contain another wordSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionLookbehind youWords not preceded by “cat”Simulate lookbehindDiscussionFixed, finite, and infinite length lookbehindSimulate lookbehindVariationsSee Also
ProblemSolutionDiscussionVariationsUsing a conditionalMatch three or more words near each otherExponentially increasing permutationsThe ugly solutionExploiting empty backreferencesJavaScript backreferences by its own rulesMultiple words, any distance from each otherSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileDiscussionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionClean any whitespace charactersClean horizontal whitespace charactersDiscussionClean any whitespace charactersClean horizontal whitespace charactersSee Also
ProblemSolutionBuilt-in solutionsRegular expressionReplacementExample JavaScript functionDiscussionVariationsSee Also
6.1. Integer NumbersProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementGetting the numbers in PerlStripping leading zeros in PHPDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionConvert Roman Numerals to DecimalSee Also
7.1. Validating URLsProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionExtract the scheme from a URL known to be validExtract the scheme while validating the URLDiscussionSee Also
ProblemSolutionExtract the user from a URL known to be validExtract the user while validating the URLDiscussionSee Also
ProblemSolutionExtract the host from a URL known to be validExtract the host while validating the URLDiscussionSee Also
ProblemSolutionExtract the port from a URL known to be validExtract the host while validating the URLDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionPerlDiscussionSee Also
ProblemSolutionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationDiscussionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
8.1. Find XML-Style TagsProblemSolutionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)DiscussionA few words of cautionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)Skip tricky (X)HTML and XML sectionsOuter regex for (X)HTMLOuter regex for XMLVariationsMatch valid HTML 4 tagsSee Also
ProblemSolutionDiscussionVariationsReplace a list of tagsSee Also
ProblemSolutionSolution 1: Match tags except and Solution 2: Match tags except and , and any tags that contain attributesDiscussionVariationsWhitelist specific attributesSee Also
ProblemSolutionXML 1.0 names (approximate)XML 1.1 names (exact)DiscussionXML 1.0 namesXML 1.1 namesVariationsSee Also
ProblemSolutionStep 1: Replace HTML special characters with character entity referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯JavaScript exampleDiscussionStep 1: Replace HTML special characters with character entity referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯See Also
ProblemSolutionTags that contain an id attribute (quick and dirty)Tags that contain an id attribute (more reliable)<div> tags that contain an id attributeTags that contain an id attribute with the value “my-id”Tags that contain “my-class” within their class attribute valueDiscussionSee Also
ProblemSolutionRegex 1: Simplistic solutionRegex 2: More reliable solutionInsert the new attributeDiscussionSee Also
ProblemSolutionDiscussionHow it worksWhen comments can’t be removedVariationsFind valid XML-style commentsFind C-style commentsSee Also
ProblemSolutionTwo-step approachSingle-step approachDiscussionTwo-step approachSingle-step approachVariationsSee Also
ProblemSolutionJavaScript exampleDiscussionSee Also
ProblemSolutionJavaScript exampleDiscussionVariationsMatch a CSV record and capture the field in column 1 to backreference 1Match a CSV record and capture the field in column 2 to backreference 1Match a CSV record and capture the field in column 3 or higher to backreference 1Replacement string
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also

Content preview from Regular Expressions Cookbook

4.9. Limit the Length of Text

Problem

You want to test whether a string is composed of between 1 and 10 letters from A to Z.

Solution

All the programming languages covered by this book provide a simple, efficient way to check the length of text. For example, JavaScript strings have a length property that holds an integer indicating the string’s length. However, using regular expressions to check text length can be useful in some situations, particularly when length is only one of multiple rules that determine whether the subject text fits the desired pattern. The following regular expression ensures that text is between 1 and 10 characters long, and additionally limits the text to the uppercase letters A–Z. You can modify the regular expressions to allow any minimum or maximum text length, or allow characters other than A–Z.

Regular expression

^[A-Z]{1,10}$

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Perl

if ($ARGV[0] =~ /^[A-Z]{1,10}$/) {
    print "Input is valid\n";
} else {
    print "Input is invalid\n";
}

Other programming languages

See Recipe 3.5 for help with implementing this regular expression with other programming languages.

Discussion

Here’s the breakdown for this very straightforward regex:

^         # Assert position at the beginning of the string.
[A-Z]     # Match one letter from "A" to "Z"...
  {1,10}  #   between 1 and 10 times.
$         # Assert position at the end of the string.

Regex options: Free-spacing

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

The ‹^›