Chapter 14. Scanning HTML

Sean M. Burke

Tip

This article turned out to be so popular that I ended up writing a whole book, Perl & LWP (O’Reilly), which goes into great detail about the many ways of pulling data out of markup languages like HTML.

In the previous article, Ken MacFarlane describes how the HTML::Parser module scans HTML source as a stream of start tags, end tags, text, comments, and so on. In another issue of TPJ (and republished in Computer Science & Perl Programming: Best of the Perl Journal), I described tree data structures. Now I’ll tie it together by discussing trees of HTML.

The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser extracts, and builds a parse tree—a tree-shaped network of objects representing the structured content of an HTML document. Once the document is parsed as a tree, you’ll find the common tasks of extracting data from that HTML document/tree to be quite straightforward.

HTML::Parser, HTML::TreeBuilder, and HTML::Element

HTML::TreeBuilder can construct a parse tree out of an HTML source file simply by saying:

use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file('foo.html');

$tree now contains a parse tree built from the HTML in foo.html. The parse tree is represented as a network of objects—$tree is the root, an element with tag name html. Its children typically include head and body elements, and so on. Each element in the tree is an object of the class HTML::Element.

If you take this source:

<html><head><title>Doc ...

Get Web, Graphics & Perl/Tk Programming now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.