June 2002
Beginner
759 pages
80h 42m
English
As we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking.
In short, HTML::TokeParser breaks an HTML document into
tokens, attributes, and content, in which the HTML <a href="http://url">link</a>
would break down as:
token: a
attrib: href
content: http://url
content: link
token /aFor example, you can use HTML::TokeParser to extract links from a string that contains HTML:
#!/usr/local/bin/perl -w
require HTML::TokeParser;
# Our string that turns out to be HTML!
my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>';
my $parser = HTML::TokeParser->new(\$html);
get_tag( ) tells TokeParser to match a tag by name
while (my $token = $parser->get_tag("a")) {
my $url = $token->[1]{href} || "-";
my $text = $parser->get_trimmed_text("/a");
print "URL is: $url.\nURL text is: $text.\n";
}