Chapter 9. HTML Processing with Trees

Treating HTML as a stream of tokens is an imperfect solution to the problem of extracting information from HTML. In particular, the token model obscures the hierarchical nature of markup. Nested structures such as lists within lists or tables within tables are difficult to process as just tokens. Such structures are best represented as trees, and the HTML::Element class does just this.

This chapter teaches you how to use the HTML::TreeBuilder module to construct trees from HTML, and how to process those trees to extract information. Chapter 10 shows how to modify HTML using trees.

Introduction to Trees

The HTML in Example 9-1 can be represented by the tree in Figure 9-1.

Example 9-1. Simple HTML
<ul>
  <li>Ice cream.</li>
  <li>Whipped cream.
  <li>Hot apple pie <br>(mmm pie)</li>
</ul>
HTML tree
Figure 9-1. HTML tree

In the language of trees, each part of the tree (such as html, li, Ice cream., and br) is a node. There are two kinds of nodes in an HTML tree: text nodes,which are strings with no tags, and elements, which symbolize not mere strings, but things that can have attributes (such as align=left), and which generally came from an open tag (such as <li>), and were possibly closed by an end-tag (such as </li>).

When several nodes are contained by another, as the li elements are contained by the ul element, the contained ones are called children. Children ...

Get Perl & LWP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.