2.15. Prevent Runaway Repetition
Problem
Use a single regular expression to match a complete HTML
file, checking for properly nested html
, head
,
title
, and body
tags. The regular expression must fail
efficiently on HTML files that do not have the proper tags.
Solution
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)↵ (?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, PCRE, Perl, Ruby |
JavaScript and Python do not support atomic grouping. There is no way to eliminate needless backtracking with these two regex flavors. When programming in JavaScript or Python, you can solve this problem by doing a literal text search for each of the tags one by one, searching for the next tag through the remainder of the subject text after the one last found.
Discussion
The proper solution to this problem is more easily understood if we start from this naïve solution:
<html>.*?<head>.*?<title>.*?</title>↵ .*?</head>.*?<body[^>]*>.*?</body>.*?</html>
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
When you test this regex on a proper HTML file, it works perfectly
well. ‹.*?
› skips over
anything, because we turn on “dot matches line breaks.” The lazy
asterisk makes sure the regex goes ahead only one character at a time,
each time checking whether the next tag can be matched. Recipes 2.4 and 2.13 explain
all this.
But this regex gets you into trouble when ...
Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.