5.8. Find Repeated Words


You’re editing a document and would like to check it for any incorrectly repeated words. You want to find these doubled words despite capitalization differences, such as with “The the.” You also want to allow differing amounts of whitespace between words, even if this causes the words to extend across more than one line. Any separating punctuation, however, should cause the words to no longer be treated as if they are repeating.


A backreference matches something that has been matched before, and therefore provides the key ingredient for this recipe:

Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If you want to use this regular expression to keep the first word but remove subsequent duplicate words, replace all matches with backreference 1. Another approach is to highlight matches by surrounding them with other characters (such as an HTML tag), so you can more easily identify them during later inspection. Recipe 3.15 shows how you can use backreferences in your replacement text, which you’ll need to do to implement either of these approaches.

If you just want to find repeated words so you can manually examine whether they need to be corrected, Recipe 3.7 shows the code you need. A text editor or grep-like tool, such as those mentioned in Tools for Working with Regular Expressions in Chapter 1, can help you find repeated words while providing the context needed to determine whether ...

Get Regular Expressions Cookbook, 2nd Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.