Referring to Wildcards: Back-Referencing
We have seen that we can find, say, words that contain two consecutive vowels; the expression \w*[aeiouy][aeiouy]\w* does that. This finds words like tough, blood, clear, and too. But what if we want to find words with two consecutive identical vowels, so that we find blood and too, but not clear and tough? So what we're after is something like "Find a word with a vowel followed by that same vowel." This is possible in a regex:
\w*([aeiouy])\1\w*
This expression differs from the first one in two respects. The first vowel wildcard is in parenthesis. Like earlier examples of parentheses, they group something—even though the group consists of just one element, namely a character class. But apart from that, the parentheses also create a referent—that is, something that can be referred to. Referrers are numbers starting with 1 (escaped numbers, in fact; unescaped numbers are interpreted as just numbers).
A similar example is the search pattern to try and match words like travelling, focussed, and formatting—that is, verbs that double their final consonant when used in the past tense or as a present participle. The expression to find these verb forms is:
(?x) \w+ ([lst]) \1 (ed|ing)
We match l, s, and t followed by the same letter—so we match ll, ss, and tt—followed by ed or ing. We add \w+ at the beginning to find whole words.
Here is another example. Earlier we saw an example to find dates:
\d\d?-\d\d?-(d\d)\d\d
This expression is reasonably flexible ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access