book

Regular Expressions Cookbook, 2nd Edition

by Jan Goyvaerts, Steven Levithan

August 2012

Intermediate to advanced

609 pages

19h 16m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Many Flavors of Regular ExpressionsRegex Flavors Covered by This Book
Many Flavors of Replacement Text
RegexBuddyRegexPalRegexMagicMore Online Regex TestersRegexPlanetregex.larsolavtorvik.comNregexRubularmyregexp.comMore Desktop Regular Expression TestersExpressoThe RegulatorSDL Regex FuzzergrepPowerGREPWindows GrepRegexRenamerPopular Text Editors
ProblemSolutionDiscussionVariationsBlock escapeCase-insensitive matchingSee Also
ProblemSolutionDiscussionVariations on Representations of Nonprinting CharactersThe 26 control charactersThe 7-bit character setSee Also
ProblemSolutionCalendar with misspellingsHexadecimal characterNonhexadecimal characterDiscussionVariationsShorthandsCase insensitivityFlavor-Specific Features.NET character class subtractionJava character class union, intersection, and subtractionSee Also
ProblemSolutionAny character except line breaksAny character including line breaksDiscussionAny character except line breaksAny character including line breaksDot abuseVariationsSee Also
ProblemSolutionStart of the subjectEnd of the subjectStart of a lineEnd of a lineDiscussionAnchors and linesStart of the subjectEnd of the subjectStart of a lineEnd of a lineZero-length matchesVariationsSee Also
ProblemSolutionWord boundariesNonboundariesDiscussionWord boundariesNonboundariesWord CharactersSee Also
ProblemSolutionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeDiscussionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeVariationsNegated variantCharacter classesListing all charactersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsNoncapturing groupsGroup with mode modifiersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionNamed captureNamed backreferencesDiscussionNamed captureNamed backreferencesGroups with the same nameSee Also
ProblemSolutionGoogolHexadecimal numberHexadecimal number with optional suffixFloating-point numberDiscussionFixed repetitionVariable repetitionInfinite repetitionMaking something optionalRepeating groupsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionLookaroundNegative lookaroundDifferent levels of lookbehindMatching the same text twiceLookaround is atomicAlternative to LookbehindSolution Without LookbehindSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionFree-spacing modeJava has free-spacing character classesVariations
ProblemSolutionDiscussionWhen and how to escape characters in replacement text.NET and JavaScriptJavaPHPPerlPython and RubyMore escape rules for string literalsSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionReplacements using capturing groups$10 and higherReferences to nonexistent groupsSolution Using Named CaptureRegular expressionReplacementFlavors that support named captureSee Also
ProblemSolutionDiscussionSee Also
Languages Covered in This ChapterMore Programming Languages
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETXRegExpJavaPythonDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRuby
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyCompiling a Regular Expression Down to CILC#VB.NETDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyAdditional Language-Specific Options.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJavaXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJava 7XRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionPerl and RubyPythonSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonPHPRubyDiscussionSee Also
ProblemSolutionSimpleSimple, with restrictions on charactersSimple, with all valid local part charactersNo leading, trailing, or consecutive dotsTop-level domain has two to six lettersDiscussionAbout email addressesRegular expression syntaxBuilding a regex step-by-stepVariationsSee Also
ProblemSolutionRegular expressionReplacementC# exampleJavaScript exampleOther programming languagesDiscussionVariationsEliminate invalid phone numbersFind phone numbers in documentsAllow a leading “1”Allow seven-digit phone numbersSee Also
ProblemSolutionRegular expressionJavaScript exampleDiscussionVariationsValidate international phone numbers in EPP formatSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionC#PerlPure regular expressionDiscussionRegex with procedural codePure regular expressionVariationsSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDatesWeeksTimesDate and timeXML Schema dates and timesDiscussionSee Also
ProblemSolutionRegular expressionRuby exampleDiscussionVariationsLimit input to ASCII charactersLimit input to ASCII noncontrol characters and line breaksLimit input to shared ISO-8859-1 and Windows-1252 charactersLimit input to alphanumeric characters in any languageSee Also
ProblemSolutionRegular expressionPerl exampleDiscussionVariationsLimit the length of an arbitrary patternLimit the number of nonwhitespace charactersLimit the number of wordsSee Also
ProblemSolutionRegular expressionPHP (PCRE) exampleDiscussionVariationsWorking with esoteric line separatorsSee Also
ProblemSolutionRegular expressionJavaScript exampleDiscussionSee Also
ProblemSolutionRegular expressionPython exampleDiscussionVariationsFind Social Security numbers in documentsSee Also
ProblemSolutionRegular expressionsJavaScript example, with checksum validationPython example, with checksum validationDiscussionISBN-10 checksumISBN-13 checksumVariationsFind ISBNs in documentsEliminate incorrect ISBN identifiersSee Also
ProblemSolutionRegular expressionVB.NET exampleDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionC# exampleDiscussionSee Also
ProblemSolutionRegular expressionReplacementJavaScript exampleDiscussionVariationsList surname particles at the beginning of the nameSee Also
ProblemSolutionLength between 8 and 32 charactersASCII visible and space characters onlyOne or more uppercase lettersOne or more lowercase lettersOne or more numbersOne or more special charactersDisallow three or more sequential identical charactersExample JavaScript solution, basicExample JavaScript solution, with x out of y validationExample JavaScript solution, with password security rankingDiscussionExample JavaScript solutionsVariationsValidate multiple password rules with a single regexSee Also
ProblemSolutionStrip spaces and hyphensValidate the numberExample web page with JavaScriptDiscussionStrip spaces and hyphensValidate the numberIncorporating the solution into a web pageExtra Validation with the Luhn AlgorithmSee Also
ProblemSolutionStrip whitespace and punctuationValidate the numberDiscussionStrip whitespace and punctuationValidate the numberVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionUsing alternationExample JavaScript solutionDiscussionUsing alternationExample JavaScript solutionSee Also
ProblemSolutionColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”DiscussionUse word boundaries to match complete wordsColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”See Also
ProblemSolutionDiscussionVariationsFind words that don’t contain another wordSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionLookbehind youWords not preceded by “cat”Simulate lookbehindDiscussionFixed, finite, and infinite length lookbehindSimulate lookbehindVariationsSee Also
ProblemSolutionDiscussionVariationsUsing a conditionalMatch three or more words near each otherExponentially increasing permutationsThe ugly solutionExploiting empty backreferencesJavaScript backreferences by its own rulesMultiple words, any distance from each otherSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileDiscussionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionClean any whitespace charactersClean horizontal whitespace charactersDiscussionClean any whitespace charactersClean horizontal whitespace charactersSee Also
ProblemSolutionBuilt-in solutionsRegular expressionReplacementExample JavaScript functionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementGetting the numbers in PerlStripping leading zeros in PHPDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionBasic solutionMatch separator positions only, using lookbehindDiscussionIntroductionBasic solutionMatch separator positions only, using lookbehindVariationsDon’t add commas after a decimal pointUse infinite lookbehindSearch-and-replace within matched numbersSee Also
ProblemSolutionDiscussionConvert Roman Numerals to DecimalSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussion
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionExtract the scheme from a URL known to be validExtract the scheme while validating the URLDiscussionSee Also
ProblemSolutionExtract the user from a URL known to be validExtract the user while validating the URLDiscussionSee Also
ProblemSolutionExtract the host from a URL known to be validExtract the host while validating the URLDiscussionSee Also
ProblemSolutionExtract the port from a URL known to be validExtract the port while validating the URLDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionPerlDiscussionSee Also
ProblemSolutionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationDiscussionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
Basic Rules for Formats Covered in This Chapter
ProblemSolutionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)DiscussionA few words of cautionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)Skip Tricky (X)HTML and XML SectionsOuter regex for (X)HTMLOuter regex for XMLSee Also
ProblemSolutionDiscussionVariationsReplace a list of tagsSee Also
ProblemSolutionSolution 1: Match tags except and Solution 2: Match tags except and , and any tags that contain attributesDiscussionVariationsWhitelist specific attributesSee Also
ProblemSolutionXML 1.0 names (approximate)XML 1.1 names (exact)DiscussionXML 1.0 namesXML 1.1 namesVariationsSee Also
ProblemSolutionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯Example JavaScript solutionDiscussionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯See Also
ProblemSolutionRegular expressionReplace matches with their corresponding literal charactersExample JavaScript solutionDiscussionSee Also
ProblemSolutionTags that contain an id attribute (quick and dirty)Tags that contain an id attribute (more reliable)<div> tags that contain an id attributeTags that contain an id attribute with the value “my-id”Tags that contain “my-class” within their class attribute valueDiscussionSee Also
ProblemSolutionSolution 1, simplisticSolution 2, more reliableInsert the new attributeDiscussionSee Also
ProblemSolutionDiscussionHow it worksWhen comments can’t be removedVariationsFind valid XML commentsFind valid HTML commentsSee Also
ProblemSolutionTwo-step approachSingle-step approachDiscussionTwo-step approachSingle-step approachVariationsSee Also
ProblemSolutionExample web page with JavaScriptDiscussionSee Also
ProblemSolutionExample web page with JavaScriptDiscussionVariationsMatch a CSV record and capture the field in column 1 to backreference 1Match a CSV record and capture the field in column 2 to backreference 1Match a CSV record and capture the field in column 3 or higher to backreference 1Replacement stringSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also

Content preview from Regular Expressions Cookbook, 2nd Edition

9.3. Remove All XML-Style Tags Except and

Problem

You want to remove all tags in a string except  and .

In a separate case, you not only want to remove all tags other than  and , you also want to remove  and  tags that contain attributes.

Solution

This is a perfect setting to put negative lookahead (explained in Recipe 2.16) to use. Applied to this problem, negative lookahead lets you match what looks like a tag, except when certain words come immediately after the opening < or </. If you then replace all matches with an empty string (following the code in Recipe 3.14), only the approved tags are left behind.

Solution 1: Match tags except and

</?(?!(?:em|strong)\b)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

< /?                   # Permit closing tags
(?!
    (?: em | strong )  # List of tags to avoid matching
    \b                 # Word boundary avoids partial word matches
)
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             # Any character except >, ", or '
  | "[^"]*"            # Double-quoted attribute value
  | '[^']*'            # Single-quoted attribute value
)*
>

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Solution 2: Match tags except and , and any tags that contain attributes

With one change (replacing the ‹\b› with ‹\s*>›), you can make the regex also match any  and  tags that contain ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781449327453Supplemental Content Errata Page

Regular Expressions Cookbook, 2nd Edition

by Jan Goyvaerts, Steven Levithan

9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

Solution

Solution 1: Match tags except <em> and <strong>

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Regular Expressions Cookbook

Mastering Regular Expressions, 3rd Edition

Introducing Regular Expressions

An Introduction to Regular Expressions

Publisher Resources