Skip to Main Content
Regular Expressions Cookbook, 2nd Edition
book

Regular Expressions Cookbook, 2nd Edition

by Jan Goyvaerts, Steven Levithan
August 2012
Intermediate to advanced content levelIntermediate to advanced
609 pages
19h 16m
English
O'Reilly Media, Inc.
Content preview from Regular Expressions Cookbook, 2nd Edition

2.15. Prevent Runaway Repetition

Problem

Use a single regular expression to match a complete HTML file, checking for properly nested html, head, title, and body tags. The regular expression must fail efficiently on HTML files that do not have the proper tags.

Solution

<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)↵
(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, PCRE, Perl, Ruby

JavaScript and Python do not support atomic grouping. There is no way to eliminate needless backtracking with these two regex flavors. When programming in JavaScript or Python, you can solve this problem by doing a literal text search for each of the tags one by one, searching for the next tag through the remainder of the subject text after the one last found.

Discussion

The proper solution to this problem is more easily understood if we start from this naïve solution:

<html>.*?<head>.*?<title>.*?</title>↵
.*?</head>.*?<body[^>]*>.*?</body>.*?</html>
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

When you test this regex on a proper HTML file, it works perfectly well. .*? skips over anything, because we turn on “dot matches line breaks.” The lazy asterisk makes sure the regex goes ahead only one character at a time, each time checking whether the next tag can be matched. Recipes 2.4 and 2.13 explain all this.

But this regex gets you into trouble when ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Mastering Regular Expressions, 3rd Edition

Mastering Regular Expressions, 3rd Edition

Jeffrey E.F. Friedl
Regular Expressions Cookbook

Regular Expressions Cookbook

Jan Goyvaerts, Steven Levithan

Publisher Resources

ISBN: 9781449327453Supplemental ContentErrata Page