Chapter 10. Modifying HTML with Trees
In Chapter 9, we saw how to
extract information from HTML trees. But that’s not the only thing you can
use trees for. HTML::TreeBuilder trees can be altered and can even be
written back out as HTML, using the as_HTML( )
method.
There are four ways in which a tree can be altered: you can alter a node’s
attributes; you can delete a node; you can detach a node and reattach it
elsewhere; and you can add a new node. We’ll treat each of these in
turn.
Changing Attributes
Suppose that in your new role as fixer of large sets of HTML documents, you are given a bunch of documents that have headings like this:
<h3 align=center>Free Monkey</h3> <h3 color=red>Inquire Within</h3>
that need to be changed like this:
<h2 class=scream>Free Monkey</h2> <h4 class=mutter>Inquire Within</h4>
Before you start phrasing this in terms of HTML::Element methods,
you should consider whether this can be done with a search-and-replace
operation in an editor. In this case, it cannot, because you’re not just
changing every <h3
align=center>
to <h2
class=scream>
and every <h4
color=red>
to <h3
class=mutter>
(which are apparently simple
search-and-replace operations), you also have to change </h3>
to </h2>
or to </h4>
, depending on what you did to the element that it closes. That sort of context dependency puts this well outside the realm of simple search-and-replace operations. One could try to implement this with HTML::TokeParser, reading every token and printing it back out, after ...
Get Perl & LWP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.