book

Regular Expressions Cookbook, 2nd Edition

by Jan Goyvaerts, Steven Levithan

August 2012

Intermediate to advanced

609 pages

19h 16m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Many Flavors of Regular ExpressionsRegex Flavors Covered by This Book
Many Flavors of Replacement Text
RegexBuddyRegexPalRegexMagicMore Online Regex TestersRegexPlanetregex.larsolavtorvik.comNregexRubularmyregexp.comMore Desktop Regular Expression TestersExpressoThe RegulatorSDL Regex FuzzergrepPowerGREPWindows GrepRegexRenamerPopular Text Editors
ProblemSolutionDiscussionVariationsBlock escapeCase-insensitive matchingSee Also
ProblemSolutionDiscussionVariations on Representations of Nonprinting CharactersThe 26 control charactersThe 7-bit character setSee Also
ProblemSolutionCalendar with misspellingsHexadecimal characterNonhexadecimal characterDiscussionVariationsShorthandsCase insensitivityFlavor-Specific Features.NET character class subtractionJava character class union, intersection, and subtractionSee Also
ProblemSolutionAny character except line breaksAny character including line breaksDiscussionAny character except line breaksAny character including line breaksDot abuseVariationsSee Also
ProblemSolutionStart of the subjectEnd of the subjectStart of a lineEnd of a lineDiscussionAnchors and linesStart of the subjectEnd of the subjectStart of a lineEnd of a lineZero-length matchesVariationsSee Also
ProblemSolutionWord boundariesNonboundariesDiscussionWord boundariesNonboundariesWord CharactersSee Also
ProblemSolutionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeDiscussionUnicode code pointUnicode categoryUnicode blockUnicode scriptUnicode graphemeVariationsNegated variantCharacter classesListing all charactersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsNoncapturing groupsGroup with mode modifiersSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionNamed captureNamed backreferencesDiscussionNamed captureNamed backreferencesGroups with the same nameSee Also
ProblemSolutionGoogolHexadecimal numberHexadecimal number with optional suffixFloating-point numberDiscussionFixed repetitionVariable repetitionInfinite repetitionMaking something optionalRepeating groupsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionLookaroundNegative lookaroundDifferent levels of lookbehindMatching the same text twiceLookaround is atomicAlternative to LookbehindSolution Without LookbehindSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionFree-spacing modeJava has free-spacing character classesVariations
ProblemSolutionDiscussionWhen and how to escape characters in replacement text.NET and JavaScriptJavaPHPPerlPython and RubyMore escape rules for string literalsSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionReplacements using capturing groups$10 and higherReferences to nonexistent groupsSolution Using Named CaptureRegular expressionReplacementFlavors that support named captureSee Also
ProblemSolutionDiscussionSee Also
Languages Covered in This ChapterMore Programming Languages
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETXRegExpJavaPythonDiscussionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRuby
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyCompiling a Regular Expression Down to CILC#VB.NETDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubyAdditional Language-Specific Options.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJavaXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussion.NETJavaJavaScriptPHPPerlPythonRubyNamed CaptureC#VB.NETJava 7XRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionC#VB.NETJavaJavaScriptPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionPerl and RubyPythonSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussionC# and VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPHPPerlPythonRubyDiscussion.NETJavaJavaScriptXRegExpPHPPerlPythonRubySee Also
ProblemSolutionC#VB.NETJavaJavaScriptPHPPerlPythonRubyDiscussionSee Also
ProblemSolutionC#VB.NETJavaJavaScriptXRegExpPerlPythonPHPRubyDiscussionSee Also
ProblemSolutionSimpleSimple, with restrictions on charactersSimple, with all valid local part charactersNo leading, trailing, or consecutive dotsTop-level domain has two to six lettersDiscussionAbout email addressesRegular expression syntaxBuilding a regex step-by-stepVariationsSee Also
ProblemSolutionRegular expressionReplacementC# exampleJavaScript exampleOther programming languagesDiscussionVariationsEliminate invalid phone numbersFind phone numbers in documentsAllow a leading “1”Allow seven-digit phone numbersSee Also
ProblemSolutionRegular expressionJavaScript exampleDiscussionVariationsValidate international phone numbers in EPP formatSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionC#PerlPure regular expressionDiscussionRegex with procedural codePure regular expressionVariationsSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDatesWeeksTimesDate and timeXML Schema dates and timesDiscussionSee Also
ProblemSolutionRegular expressionRuby exampleDiscussionVariationsLimit input to ASCII charactersLimit input to ASCII noncontrol characters and line breaksLimit input to shared ISO-8859-1 and Windows-1252 charactersLimit input to alphanumeric characters in any languageSee Also
ProblemSolutionRegular expressionPerl exampleDiscussionVariationsLimit the length of an arbitrary patternLimit the number of nonwhitespace charactersLimit the number of wordsSee Also
ProblemSolutionRegular expressionPHP (PCRE) exampleDiscussionVariationsWorking with esoteric line separatorsSee Also
ProblemSolutionRegular expressionJavaScript exampleDiscussionSee Also
ProblemSolutionRegular expressionPython exampleDiscussionVariationsFind Social Security numbers in documentsSee Also
ProblemSolutionRegular expressionsJavaScript example, with checksum validationPython example, with checksum validationDiscussionISBN-10 checksumISBN-13 checksumVariationsFind ISBNs in documentsEliminate incorrect ISBN identifiersSee Also
ProblemSolutionRegular expressionVB.NET exampleDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionC# exampleDiscussionSee Also
ProblemSolutionRegular expressionReplacementJavaScript exampleDiscussionVariationsList surname particles at the beginning of the nameSee Also
ProblemSolutionLength between 8 and 32 charactersASCII visible and space characters onlyOne or more uppercase lettersOne or more lowercase lettersOne or more numbersOne or more special charactersDisallow three or more sequential identical charactersExample JavaScript solution, basicExample JavaScript solution, with x out of y validationExample JavaScript solution, with password security rankingDiscussionExample JavaScript solutionsVariationsValidate multiple password rules with a single regexSee Also
ProblemSolutionStrip spaces and hyphensValidate the numberExample web page with JavaScriptDiscussionStrip spaces and hyphensValidate the numberIncorporating the solution into a web pageExtra Validation with the Luhn AlgorithmSee Also
ProblemSolutionStrip whitespace and punctuationValidate the numberDiscussionStrip whitespace and punctuationValidate the numberVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionUsing alternationExample JavaScript solutionDiscussionUsing alternationExample JavaScript solutionSee Also
ProblemSolutionColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”DiscussionUse word boundaries to match complete wordsColor or colourBat, cat, or ratWords ending with “phobia”Steve, Steven, or StephenVariations of “regular expression”See Also
ProblemSolutionDiscussionVariationsFind words that don’t contain another wordSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionLookbehind youWords not preceded by “cat”Simulate lookbehindDiscussionFixed, finite, and infinite length lookbehindSimulate lookbehindVariationsSee Also
ProblemSolutionDiscussionVariationsUsing a conditionalMatch three or more words near each otherExponentially increasing permutationsThe ugly solutionExploiting empty backreferencesJavaScript backreferences by its own rulesMultiple words, any distance from each otherSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileDiscussionOption 1: Sort lines and remove adjacent duplicatesOption 2: Keep the last occurrence of each duplicate line in an unsorted fileOption 3: Keep the first occurrence of each duplicate line in an unsorted fileSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionClean any whitespace charactersClean horizontal whitespace charactersDiscussionClean any whitespace charactersClean horizontal whitespace charactersSee Also
ProblemSolutionBuilt-in solutionsRegular expressionReplacementExample JavaScript functionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementGetting the numbers in PerlStripping leading zeros in PHPDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionBasic solutionMatch separator positions only, using lookbehindDiscussionIntroductionBasic solutionMatch separator positions only, using lookbehindVariationsDon’t add commas after a decimal pointUse infinite lookbehindSearch-and-replace within matched numbersSee Also
ProblemSolutionDiscussionConvert Roman Numerals to DecimalSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussion
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionExtract the scheme from a URL known to be validExtract the scheme while validating the URLDiscussionSee Also
ProblemSolutionExtract the user from a URL known to be validExtract the user while validating the URLDiscussionSee Also
ProblemSolutionExtract the host from a URL known to be validExtract the host while validating the URLDiscussionSee Also
ProblemSolutionExtract the port from a URL known to be validExtract the port while validating the URLDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionPerlDiscussionSee Also
ProblemSolutionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationDiscussionStandard notationMixed notationStandard or mixed notationCompressed notationCompressed mixed notationStandard, mixed, or compressed notationSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsDiscussionDrive letter pathsDrive letter and UNC pathsDrive letter, UNC, and relative pathsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionRegular expressionReplacementDiscussionSee Also
Basic Rules for Formats Covered in This Chapter
ProblemSolutionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)DiscussionA few words of cautionQuick and dirtyAllow > in attribute values(X)HTML tags (loose)(X)HTML tags (strict)XML tags (strict)Skip Tricky (X)HTML and XML SectionsOuter regex for (X)HTMLOuter regex for XMLSee Also
ProblemSolutionDiscussionVariationsReplace a list of tagsSee Also
ProblemSolutionSolution 1: Match tags except and Solution 2: Match tags except and , and any tags that contain attributesDiscussionVariationsWhitelist specific attributesSee Also
ProblemSolutionXML 1.0 names (approximate)XML 1.1 names (exact)DiscussionXML 1.0 namesXML 1.1 namesVariationsSee Also
ProblemSolutionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯Example JavaScript solutionDiscussionStep 1: Replace HTML special characters with named character referencesStep 2: Replace all line breaks with Step 3: Replace double tags with Step 4: Wrap the entire string with ⋯See Also
ProblemSolutionRegular expressionReplace matches with their corresponding literal charactersExample JavaScript solutionDiscussionSee Also
ProblemSolutionTags that contain an id attribute (quick and dirty)Tags that contain an id attribute (more reliable)<div> tags that contain an id attributeTags that contain an id attribute with the value “my-id”Tags that contain “my-class” within their class attribute valueDiscussionSee Also
ProblemSolutionSolution 1, simplisticSolution 2, more reliableInsert the new attributeDiscussionSee Also
ProblemSolutionDiscussionHow it worksWhen comments can’t be removedVariationsFind valid XML commentsFind valid HTML commentsSee Also
ProblemSolutionTwo-step approachSingle-step approachDiscussionTwo-step approachSingle-step approachVariationsSee Also
ProblemSolutionExample web page with JavaScriptDiscussionSee Also
ProblemSolutionExample web page with JavaScriptDiscussionVariationsMatch a CSV record and capture the field in column 1 to backreference 1Match a CSV record and capture the field in column 2 to backreference 1Match a CSV record and capture the field in column 3 or higher to backreference 1Replacement stringSee Also
ProblemSolutionDiscussionVariationsSee Also
ProblemSolutionDiscussionSee Also
ProblemSolutionDiscussionSee Also

Content preview from Regular Expressions Cookbook, 2nd Edition

5.9. Remove Duplicate Lines

Problem

You have a log file, database query output, or some other type of file or string with duplicate lines. You need to remove all but one of each duplicate line using a text editor or other similar tool.

Solution

There is a variety of software (including the Unix command-line utility uniq and Windows PowerShell cmdlet Get-Unique) that can help you remove duplicate lines in a file or string. The following sections contain three regex-based approaches that can be especially helpful when trying to accomplish this task in a nonscriptable text editor with regular expression search-and-replace support.

When you’re programming, options two and three should be avoided since they are inefficient compared to other available approaches, such as using a hash object to keep track of unique lines. However, the first option (which requires that you sort the lines in advance, unless you only want to remove adjacent duplicates) may be an acceptable approach since it’s quick and easy.

Option 1: Sort lines and remove adjacent duplicates

If you’re able to sort lines in the file or string you’re working with so that any duplicate lines appear next to each other, you should do so, unless the order of the lines must be preserved. This option will allow using a simpler and more efficient search-and-replace operation to remove the duplicates than would otherwise be possible.

After sorting the lines, use the following regex and replacement string to get rid of the duplicates: