book

Regular Expressions Cookbook, 2nd Edition

by Jan Goyvaerts, Steven Levithan

August 2012

Intermediate to advanced

609 pages

19h 16m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Many Flavors of Regular ExpressionsRegex Flavors Covered by This Book
Many Flavors of Replacement Text
RegexBuddyRegexPalRegexMagicMore Online Regex TestersRegexPlanetregex.larsolavtorvik.comNregexRubularmyregexp.comMore Desktop Regular Expression TestersExpressoThe RegulatorSDL Regex FuzzergrepPowerGREPWindows GrepRegexRenamerPopular Text Editors
ProblemSolutionDiscussionVariationsBlock escapeCase-insensitive matchingSee Also
ProblemSolutionDiscussionVariations on Representations of Nonprinting CharactersThe 26 control charactersThe 7-bit character setSee Also
ProblemSolutionCalendar with misspellingsHexadecimal characterNonhexadecimal characterDiscussionVariationsShorthandsCase insensitivityFlavor-Specific Features.NET character class subtractionJava character class union, intersection, and subtractionSee Also
ProblemSolutionAny character except line breaksAny character including line breaksDiscussionAny character except line breaksAny character including line breaksDot abuseVariationsSee Also
ProblemSolutionStart of the subjectEnd of the subjectStart of a lineEnd of a lineDiscussionAnchors and linesStart of the subjectEnd of the subjectStart of a lineEnd of a lineZero-length matchesVariationsSee Also
ProblemSolutionWord boundariesNonboundariesDiscussionWord boundariesNonboundariesWord CharactersSee Also
ProblemSolutionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeDiscussionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeVariationsNegated variantCharacter classesListing all charactersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsNoncapturing groupsGroup with mode modifiersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionNamed captureNamed backreferencesDiscussionNamed captureNamed backreferencesGroups with the same nameSee Also
ProblemSolutionGoogolHexadecimal numberHexadecimal number with optional suffixFloating-point numberDiscussionFixed repetitionVariable repetitionInfinite repetitionMaking something optionalRepeating groupsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionLookaroundNegative lookaroundDifferent levels of lookbehindMatching the same text twiceLookaround is atomicAlternative to LookbehindSolution Without LookbehindSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionFree-spacing modeJava has free-spacing character classesVariations
ProblemSolutionDiscussionWhen and how to escape characters in replacement text.NET and JavaScriptJavaPHPPerlPython and RubyMore escape rules for string literalsSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionReplacements using capturing groups$10 and higherReferences to nonexistent groupsSolution Using Named CaptureRegular expressionReplacementFlavors that support named captureSee Also
ProblemSolutionDiscussionSee Also
Languages Covered in This ChapterMore Programming Languages
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETXRegExpJavaPythonDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRuby
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyCompiling a Regular Expression Down to CILC#VB.NETDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyAdditional Language-Specific Options.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJavaXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJava 7XRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionPerl and RubyPythonSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonPHPRubyDiscussionSee Also
ProblemSolutionSimpleSimple, with restrictions on charactersSimple, with all valid local part charactersNo leading, trailing, or consecutive dotsTop-level domain has two to six lettersDiscussionAbout email addressesRegular expression syntaxBuilding a regex step-by-stepVariationsSee Also
ProblemSolutionRegular expressionReplacementC# exampleJavaScript exampleOther programming languagesDiscussionVariationsEliminate invalid phone numbersFind phone numbers in documentsAllow a leading “1”Allow seven-digit phone numbersSee Also
ProblemSolutionRegular expressionJavaScript exampleDiscussionVariationsValidate international phone numbers in EPP formatSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionC#PerlPure regular expressionDiscussionRegex with procedural codePure regular expressionVariationsSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDatesWeeksTimesDate and timeXML Schema dates and timesDiscussionSee Also
ProblemSolutionRegular expressionRuby exampleDiscussionVariationsLimit input to ASCII charactersLimit input to ASCII noncontrol characters and line breaksLimit input to shared ISO-8859-1 and Windows-1252 charactersLimit input to alphanumeric characters in any languageSee Also
ProblemSolutionRegular expressionPerl exampleDiscussionVariationsLimit the length of an arbitrary patternLimit the number of nonwhitespace charactersLimit the number of wordsSee Also
ProblemSolutionRegular expressionPHP (PCRE) exampleDiscussionVariationsWorking with esoteric line separatorsSee Also
ProblemSolutionRegular expressionJavaScript exampleDiscussionSee Also
ProblemSolutionRegular expressionPython exampleDiscussionVariationsFind Social Security numbers in documentsSee Also
ProblemSolutionRegular expressionsJavaScript example, with checksum validationPython example, with checksum validationDiscussionISBN-10 checksumISBN-13 checksumVariationsFind ISBNs in documentsEliminate incorrect ISBN identifiersSee Also
ProblemSolutionRegular expressionVB.NET exampleDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionC# exampleDiscussionSee Also
ProblemSolutionRegular expressionReplacementJavaScript exampleDiscussionVariationsList surname particles at the beginning of the nameSee Also
ProblemSolutionLength between 8 and 32 charactersASCII visible and space characters onlyOne or more uppercase lettersOne or more lowercase lettersOne or more numbersOne or more special charactersDisallow three or more sequential identical charactersExample JavaScript solution, basicExample JavaScript solution, with x out of y validationExample JavaScript solution, with password security rankingDiscussionExample JavaScript solutionsVariationsValidate multiple password rules with a single regexSee Also
ProblemSolutionStrip spaces and hyphensValidate the numberExample web page with JavaScriptDiscussionStrip spaces and hyphensValidate the numberIncorporating the solution into a web pageExtra Validation with the Luhn AlgorithmSee Also
ProblemSolutionStrip whitespace and punctuationValidate the numberDiscussionStrip whitespace and punctuationValidate the numberVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionUsing alternationExample JavaScript solutionDiscussionUsing alternationExample JavaScript solutionSee Also
ProblemSolutionColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”DiscussionUse word boundaries to match complete wordsColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”See Also
ProblemSolutionDiscussionVariationsFind words that don’t contain another wordSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionLookbehind youWords not preceded by “cat”Simulate lookbehindDiscussionFixed, finite, and infinite length lookbehindSimulate lookbehindVariationsSee Also
ProblemSolutionDiscussionVariationsUsing a conditionalMatch three or more words near each otherExponentially increasing permutationsThe ugly solutionExploiting empty backreferencesJavaScript backreferences by its own rulesMultiple words, any distance from each otherSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileDiscussionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionClean any whitespace charactersClean horizontal whitespace charactersDiscussionClean any whitespace charactersClean horizontal whitespace charactersSee Also
ProblemSolutionBuilt-in solutionsRegular expressionReplacementExample JavaScript functionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementGetting the numbers in PerlStripping leading zeros in PHPDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionBasic solutionMatch separator positions only, using lookbehindDiscussionIntroductionBasic solutionMatch separator positions only, using lookbehindVariationsDon’t add commas after a decimal pointUse infinite lookbehindSearch-and-replace within matched numbersSee Also
ProblemSolutionDiscussionConvert Roman Numerals to DecimalSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussion
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionExtract the scheme from a URL known to be validExtract the scheme while validating the URLDiscussionSee Also
ProblemSolutionExtract the user from a URL known to be validExtract the user while validating the URLDiscussionSee Also
ProblemSolutionExtract the host from a URL known to be validExtract the host while validating the URLDiscussionSee Also
ProblemSolutionExtract the port from a URL known to be validExtract the port while validating the URLDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionPerlDiscussionSee Also
ProblemSolutionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationDiscussionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
Basic Rules for Formats Covered in This Chapter
ProblemSolutionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)DiscussionA few words of cautionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)Skip Tricky (X)HTML and XML SectionsOuter regex for (X)HTMLOuter regex for XMLSee Also
ProblemSolutionDiscussionVariationsReplace a list of tagsSee Also
ProblemSolutionSolution 1: Match tags except and Solution 2: Match tags except and , and any tags that contain attributesDiscussionVariationsWhitelist specific attributesSee Also
ProblemSolutionXML 1.0 names (approximate)XML 1.1 names (exact)DiscussionXML 1.0 namesXML 1.1 namesVariationsSee Also
ProblemSolutionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯Example JavaScript solutionDiscussionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯See Also
ProblemSolutionRegular expressionReplace matches with their corresponding literal charactersExample JavaScript solutionDiscussionSee Also
ProblemSolutionTags that contain an id attribute (quick and dirty)Tags that contain an id attribute (more reliable)<div> tags that contain an id attributeTags that contain an id attribute with the value “my-id”Tags that contain “my-class” within their class attribute valueDiscussionSee Also
ProblemSolutionSolution 1, simplisticSolution 2, more reliableInsert the new attributeDiscussionSee Also
ProblemSolutionDiscussionHow it worksWhen comments can’t be removedVariationsFind valid XML commentsFind valid HTML commentsSee Also
ProblemSolutionTwo-step approachSingle-step approachDiscussionTwo-step approachSingle-step approachVariationsSee Also
ProblemSolutionExample web page with JavaScriptDiscussionSee Also
ProblemSolutionExample web page with JavaScriptDiscussionVariationsMatch a CSV record and capture the field in column 1 to backreference 1Match a CSV record and capture the field in column 2 to backreference 1Match a CSV record and capture the field in column 3 or higher to backreference 1Replacement stringSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also

Content preview from Regular Expressions Cookbook, 2nd Edition

Processing Markup and Data Formats with Regular Expressions

This final chapter focuses on common tasks that come up when working with an assortment of common markup languages and data formats: HTML, XHTML, XML, CSV, and INI. Although we’ll assume at least basic familiarity with these technologies, a brief description of each is included next to make sure we’re on the same page before digging in. The descriptions concentrate on the basic syntax rules needed to correctly search through the data structures of each format. Other details will be introduced as we encounter relevant issues.

Although it’s not always apparent on the surface, some of these formats can be surprisingly complex to process and manipulate accurately, at least using regular expressions. When programming, it’s usually best to use dedicated parsers and APIs instead of regular expressions when performing many of the tasks in this chapter, especially if accuracy is paramount (e.g., if your processing might have security implications). However, we don’t ascribe to a dogmatic view that XML-style markup should never be processed with regular expressions. There are cases when regular expressions are a great tool for the job, such as when making one-time edits in a text editor, scraping data from a limited set of HTML files, fixing broken XML files, or dealing with file formats that look like but aren’t quite XML. There are some issues to be aware of, but reading through this chapter will ensure that you don’t stumble into ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781449327453Supplemental Content Errata Page

Regular Expressions Cookbook, 2nd Edition

by Jan Goyvaerts, Steven Levithan

Processing Markup and Data Formats with Regular Expressions

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Regular Expressions Cookbook

Mastering Regular Expressions, 3rd Edition

Introducing Regular Expressions

An Introduction to Regular Expressions

Publisher Resources