2.16. Test for a Match Without Adding It to the Overall Match

Problem

Find any word that occurs between a pair of HTML bold tags, without including the tags in the regex match. For instance, if the subject is My <b>cat</b> is furry, the only valid match should be cat.

Solution

(?<=<b>)\w+(?=</b>)
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

JavaScript and Ruby 1.8 support the lookahead (?=</b>), but not the lookbehind (?<=<b>).

Discussion

Lookaround

The four kinds of lookaround groups supported by modern regex flavors have the special property of giving up the text matched by the part of the regex inside the lookaround. Essentially, lookaround checks whether certain text can be matched without actually matching it.

Lookaround that looks backward is called lookbehind. This is the only regular expression construct that will traverse the text from right to left instead of from left to right. The syntax for positive lookbehind is (?<=text). The four characters (?<= form the opening bracket. What you can put inside the lookbehind, here represented by text, varies among regular expression flavors. But simple literal text, such as (?<=<b>), always works.

Lookbehind checks to see whether the text inside the lookbehind occurs immediately to the left of the position that the regular expression engine has reached. If you match (?<=<b>) against My <b>cat</b> is furry, the lookbehind will fail to match until the regular expression starts the match ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.