Chapter 9. HTML Processing with Trees
Treating HTML as a stream of tokens is an imperfect solution to the problem of extracting information from HTML. In particular, the token model obscures the hierarchical nature of markup. Nested structures such as lists within lists or tables within tables are difficult to process as just tokens. Such structures are best represented as trees, and the HTML::Element class does just this.
This chapter teaches you how to use the HTML::TreeBuilder module to construct trees from HTML, and how to process those trees to extract information. Chapter 10 shows how to modify HTML using trees.
Introduction to Trees
The HTML in Example 9-1 can be represented by the tree in Figure 9-1.
<ul> <li>Ice cream.</li> <li>Whipped cream. <li>Hot apple pie <br>(mmm pie)</li> </ul>
![]() |
In the language of trees, each part of the tree (such as html, li,
Ice cream., and br) is a node. There are
two kinds of nodes in an HTML tree: text nodes,which are
strings with no tags, and elements, which symbolize
not mere strings, but things that can have attributes (such as align=left), and which generally came from an
open tag (such as <li>), and
were possibly closed by an end-tag (such as </li>).
When several nodes are contained by another, as the li elements are
contained by the ul element, the
contained ones are called children. Children ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access
