2.13. Choose Minimal or Maximal Repetition

Problem

Match a pair of <p> and </p> XHTML tags and the text between them. The text between the tags can include other XHTML tags.

Solution

<p>.*?</p>
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

All the quantifiers discussed in Recipe 2.12 are greedy, meaning they try to repeat as many times as possible, giving back only when required to allow the remainder of the regular expression to match.

This can make it hard to pair tags in XHTML (which is a version of XML and therefore requires every opening tag to be matched by a closing tag). Consider the following simple excerpt of XHTML:

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>

There are two opening <p> tags and two closing </p> tags in the excerpt. You want to match the first <p> with the first </p>, because they mark a single paragraph. Note that this paragraph contains a nested <em> tag, so the regex can’t simply stop when it encounters a < character.

Take a look at one incorrect solution for the problem in this recipe:

<p>.*</p>
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The only difference is that this incorrect solution lacks the extra question mark after the asterisk. The incorrect solution uses the same greedy asterisk explained in Recipe 2.12.

After matching the first <p> tag in ...

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.