book

Regular Expressions Cookbook

by Jan Goyvaerts, Steven Levithan

May 2009

Intermediate to advanced

510 pages

15h

English

O'Reilly Media, Inc.

Read now

Unlock full access

Caught in the Snarls of Different Versions

Regular Expressions DefinedMany Flavors of Regular ExpressionsRegex Flavors Covered by This Book
Many Flavors of Replacement Text
RegexBuddyRegexPalMore Online Regex Testersregex.larsolavtorvik.comNregexRubularmyregexp.comreAnimatorMore Desktop Regular Expression TestersExpressoThe RegulatorgrepPowerGREPWindows GrepRegexRenamerPopular Text Editors
2.1. Match Literal TextProblemSolutionDiscussionVariationsBlock escapeCase-insensitive matchingSee Also
ProblemSolutionDiscussionVariations on Representations of Nonprinting CharactersThe 26 control charactersThe 7-bit character setSee Also
ProblemSolutionCalendar with misspellingsHexadecimal characterNonhexadecimal characterDiscussionVariationsShorthandsCase insensitivityFlavor-Specific Features.NET character class subtractionJava character class union, subtraction, and intersectionSee Also
ProblemSolutionAny character except line breaksAny character including line breaksDiscussionAny character except line breaksAny character including line breaksDot abuseVariationsSee Also
ProblemSolutionStart of the subjectEnd of the subjectStart of a lineEnd of a lineDiscussionAnchors and linesStart of the subjectEnd of the subjectStart of a lineEnd of a lineZero-length matchesVariationsSee Also
ProblemSolutionWord boundariesNonboundariesDiscussionWord boundariesNonboundariesWord CharactersSee Also
ProblemSolutionUnicode code pointUnicode property or categoryUnicode blockUnicode scriptUnicode graphemeDiscussionUnicode code pointUnicode property or categoryUnicode blockUnicode scriptUnicode graphemeVariationsNegated variantCharacter classesListing all charactersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsNoncapturing groupsGroup with mode modifiersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionNamed captureNamed backreferencesDiscussionNamed captureNamed backreferencesSee Also
ProblemSolutionGoogolHexadecimal numberHexadecimal numberFloating-point numberDiscussionFixed repetitionVariable repetitionInfinite repetitionMaking something optionalRepeating groupsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionLookaroundNegative lookaroundDifferent levels of lookbehindMatching the same text twiceLookaround is atomicSolution Without LookbehindSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionFree-spacing modeJava has free-spacing character classesVariations
ProblemSolutionDiscussionWhen and how to escape characters in replacement text.NET and JavaScriptJavaPHPPerlPython and RubyMore escape rules for string literalsSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionReplacements using capturing groups$10 and higherReferences to nonexistent groupsSolution Using Named CaptureRegular expressionReplacementFlavors that support named captureSee Also
ProblemSolutionDiscussionSee Also
Programming Languages and Regex FlavorsLanguages Covered in This ChapterMore Programming Languages
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaPythonDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRuby
ProblemSolutionC#VB.NETJavaJavaScriptPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyCompiling a Regular Expression Down to CILC#VB.NETDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyAdditional Language-Specific Options.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETPHPPerlPythonSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionPerl and RubyPythonSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
4.1. Validate Email AddressesProblemSolutionSimpleSimple, with restrictions on charactersSimple, with all charactersNo leading, trailing, or consecutive dotsTop-level domain has two to six lettersDiscussionAbout email addressesRegular expression syntaxBuilding a regex step-by-stepVariationsSee Also
ProblemSolutionRegular expressionReplacementC#JavaScriptOther programming languagesDiscussionVariationsEliminate invalid phone numbersFind phone numbers in documentsAllow a leading “1”Allow seven-digit phone numbersSee Also
ProblemSolutionRegular expressionJavaScriptOther programming languagesDiscussionVariationsValidate international phone numbers in EPP formatSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionC#PerlPure regular expressionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionRubyOther programming languagesDiscussionVariationsLimit input to ASCII charactersLimit input to ASCII non-control characters and line breaksLimit input to shared ISO-8859-1 and Windows-1252 charactersLimit input to alphanumeric characters in any languageSee Also
ProblemSolutionRegular expressionPerlOther programming languagesDiscussionVariationsLimit the length of an arbitrary patternLimit the number of nonwhitespace charactersLimit the number of wordsSee Also
ProblemSolutionRegular expressionPHP (PCRE)Other programming languagesDiscussionVariationsWorking with esoteric line separatorsSee Also
ProblemSolutionRegular expressionJavaScriptOther programming languagesDiscussionSee Also
ProblemSolutionRegular expressionPythonOther programming languagesDiscussionVariationsFind Social Security numbers in documentsSee Also
ProblemSolutionRegular expressionsJavaScriptPythonOther programming languagesDiscussionISBN-10 checksumISBN-13 checksumVariationsFind ISBNs in documentsEliminate incorrect ISBN identifiersSee Also
ProblemSolutionRegular expressionVB.NETOther programming languagesDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionC#Other programming languagesDiscussionSee Also
ProblemSolutionRegular expressionReplacementJavaScriptOther programming languagesDiscussionVariationsList surname particles at the beginning of the name
ProblemSolutionStrip spaces and hyphensValidate the numberExample web page with JavaScriptDiscussionStrip spaces and hyphensValidate the numberIncorporating the solution into a web pageExtra Validation with the Luhn Algorithm
ProblemSolutionStrip whitespace and punctuationValidate the numberDiscussionStrip whitespace and punctuationValidate the numberVariationsSee Also
5.1. Find a Specific WordProblemSolutionDiscussionSee Also
ProblemSolutionUsing alternationExample JavaScript solutionDiscussionUsing alternationExample JavaScript solutionSee Also
ProblemSolutionColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”DiscussionUse word boundaries to match complete wordsColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”See Also
ProblemSolutionDiscussionVariationsFind words that don’t contain another wordSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionLookbehind youWords not preceded by “cat”Simulate lookbehindDiscussionFixed, finite, and infinite length lookbehindSimulate lookbehindVariationsSee Also
ProblemSolutionDiscussionVariationsUsing a conditionalMatch three or more words near each otherExponentially increasing permutationsThe ugly solutionExploiting empty backreferencesJavaScript backreferences by its own rulesMultiple words, any distance from each otherSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileDiscussionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionClean any whitespace charactersClean horizontal whitespace charactersDiscussionClean any whitespace charactersClean horizontal whitespace charactersSee Also
ProblemSolutionBuilt-in solutionsRegular expressionReplacementExample JavaScript functionDiscussionVariationsSee Also
6.1. Integer NumbersProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementGetting the numbers in PerlStripping leading zeros in PHPDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionConvert Roman Numerals to DecimalSee Also
7.1. Validating URLsProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionExtract the scheme from a URL known to be validExtract the scheme while validating the URLDiscussionSee Also
ProblemSolutionExtract the user from a URL known to be validExtract the user while validating the URLDiscussionSee Also
ProblemSolutionExtract the host from a URL known to be validExtract the host while validating the URLDiscussionSee Also
ProblemSolutionExtract the port from a URL known to be validExtract the host while validating the URLDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionPerlDiscussionSee Also
ProblemSolutionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationDiscussionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
8.1. Find XML-Style TagsProblemSolutionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)DiscussionA few words of cautionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)Skip tricky (X)HTML and XML sectionsOuter regex for (X)HTMLOuter regex for XMLVariationsMatch valid HTML 4 tagsSee Also
ProblemSolutionDiscussionVariationsReplace a list of tagsSee Also
ProblemSolutionSolution 1: Match tags except and Solution 2: Match tags except and , and any tags that contain attributesDiscussionVariationsWhitelist specific attributesSee Also
ProblemSolutionXML 1.0 names (approximate)XML 1.1 names (exact)DiscussionXML 1.0 namesXML 1.1 namesVariationsSee Also
ProblemSolutionStep 1: Replace HTML special characters with character entity referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯JavaScript exampleDiscussionStep 1: Replace HTML special characters with character entity referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯See Also
ProblemSolutionTags that contain an id attribute (quick and dirty)Tags that contain an id attribute (more reliable)<div> tags that contain an id attributeTags that contain an id attribute with the value “my-id”Tags that contain “my-class” within their class attribute valueDiscussionSee Also
ProblemSolutionRegex 1: Simplistic solutionRegex 2: More reliable solutionInsert the new attributeDiscussionSee Also
ProblemSolutionDiscussionHow it worksWhen comments can’t be removedVariationsFind valid XML-style commentsFind C-style commentsSee Also
ProblemSolutionTwo-step approachSingle-step approachDiscussionTwo-step approachSingle-step approachVariationsSee Also
ProblemSolutionJavaScript exampleDiscussionSee Also
ProblemSolutionJavaScript exampleDiscussionVariationsMatch a CSV record and capture the field in column 1 to backreference 1Match a CSV record and capture the field in column 2 to backreference 1Match a CSV record and capture the field in column 3 or higher to backreference 1Replacement string
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also

Content preview from Regular Expressions Cookbook

8.4. Match XML Names

Problem

You want to check whether a string is a legitimate XML name (a common syntactic construct). XML provides precise rules for the characters that can occur in a name, and reuses those rules for element, attribute, and entity names, processing instruction targets, and more. Names must be composed of a letter, underscore, or colon as the first character, followed by any combination of letters, digits, underscores, colons, hyphens, and periods. That’s actually an approximate description, but it’s pretty close. The exact list of permitted characters depends on the version of XML in use.

Alternatively, you might want to splice a pattern for matching valid names into other XML-handling regexes, when the extra precision warrants the added complexity.

Following are some examples of valid names:

thing
_thing_2_
:Российские-Вещь
fantastic4:the.thing
日本の物

Note that letters from non-Latin scripts are allowed, even including the ideographic characters in the last example. Likewise, any Unicode digit is allowed after the first character, not just the Arabic numerals 0–9.

For comparison, here are several examples of invalid names that should not be matched by the regex:

thing!
thing with spaces
.thing.with.a.dot.in.front
-thingamajig
2nd_thing

Solution

Like identifiers in many programming languages, there is a set of characters that can occur in an XML name, and a subset that can be used as the first character. Those character lists are dramatically different for XML 1.0 Fourth Edition ...