Chapter 14. Scanning HTML
Tip
This article turned out to be so popular that I ended up writing a whole book, Perl & LWP (O’Reilly), which goes into great detail about the many ways of pulling data out of markup languages like HTML.
In the previous article, Ken MacFarlane describes how the HTML::Parser module scans HTML source as a stream of start tags, end tags, text, comments, and so on. In another issue of TPJ (and republished in Computer Science & Perl Programming: Best of the Perl Journal), I described tree data structures. Now I’ll tie it together by discussing trees of HTML.
The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser extracts, and builds a parse tree—a tree-shaped network of objects representing the structured content of an HTML document. Once the document is parsed as a tree, you’ll find the common tasks of extracting data from that HTML document/tree to be quite straightforward.
HTML::Parser, HTML::TreeBuilder, and HTML::Element
HTML::TreeBuilder can construct a parse tree out of an HTML source file simply by saying:
use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse_file('foo.html');
$tree
now contains a parse tree built from
the HTML in foo.html
. The parse tree is represented
as a network of objects—$tree
is the
root, an element with tag name
html
. Its children typically include
head
and body
elements, and so
on. Each element in the tree is an object of the class HTML::Element.
If you take this source:
<html><head><title>Doc ...
Get Web, Graphics & Perl/Tk Programming now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.