Chapter 9. HTML Processing with Trees
Treating HTML as a stream of tokens is an imperfect solution to the problem of extracting information from HTML. In particular, the token model obscures the hierarchical nature of markup. Nested structures such as lists within lists or tables within tables are difficult to process as just tokens. Such structures are best represented as trees, and the HTML::Element class does just this.
This chapter teaches you how to use the HTML::TreeBuilder module to construct trees from HTML, and how to process those trees to extract information. Chapter 10 shows how to modify HTML using trees.
Introduction to Trees
<ul> <li>Ice cream.</li> <li>Whipped cream. <li>Hot apple pie <br>(mmm pie)</li> </ul>
In the language of trees, each part of the tree (such as
br) is a node. There are
two kinds of nodes in an HTML tree: text nodes,which are
strings with no tags, and elements, which symbolize
not mere strings, but things that can have attributes (such as
align=left), and which generally came from an
open tag (such as
were possibly closed by an end-tag (such as
When several nodes are contained by another, as the
li elements are
contained by the
ul element, the
contained ones are called children. Children ...