Detecting Duplicate Words
Problem
You want to check for doubled words in a document.
Solution
Use backreferences in your regular expression.
Discussion
Parentheses in a pattern make the regular expression engine remember
what matched that part of the pattern. Later in your pattern, you can
refer to the actual string that matched with \1
(indicating the string matched by the first set of parentheses),
\2
(for the second string matched by the second
set of parentheses), and so on. Don’t use
$1
; it would be treated as a variable and
interpolated before the match began. If you match
/([A-Z])\1/
, that says to match a capital letter
followed not just by any capital letter, but by whichever one was
captured by the first set of parentheses in that pattern.
This sample code reads its input files by paragraph, with the definition of paragraph following Perl’s notion of two or more contiguous newlines. Within each paragraph, it finds all duplicate words. It ignores case and can match across newlines.
Here we use /x
to embed whitespace and comments to
make the regular expression readable. /i
lets us
match both instances of "is"
in the sentence
"Is
is
this
ok?"
. We use /g
in a
while
loop to keep finding duplicate words until
we run out of text. Within the pattern, use \b
(word boundary) and \s
(whitespace) to help pick
out whole words and avoid matching "This"
.
$/ = ''; # paragrep mode while (<>) { while ( m{ \b # start at a word boundary (begin letters) (\S+) # find chunk of non-whitespace \b # ...
Get Perl Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.