Skip to Main Content
Perl in a Nutshell, 2nd Edition
book

Perl in a Nutshell, 2nd Edition

by Nathan Patwardhan, Ellen Siever, Stephen Spainhour
June 2002
Beginner content levelBeginner
759 pages
80h 42m
English
O'Reilly Media, Inc.
Content preview from Perl in a Nutshell, 2nd Edition

HTML::TokeParser

As we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking.

In short, HTML::TokeParser breaks an HTML document into tokens, attributes, and content, in which the HTML <a href="http://url">link</a> would break down as:

token: a
    attrib: href
content: http://url
content: link
token /a

For example, you can use HTML::TokeParser to extract links from a string that contains HTML:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

# Our string that turns out to be HTML!
my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>';
my $parser = HTML::TokeParser->new(\$html);

get_tag(  ) tells TokeParser to match a tag by name
while (my $token = $parser->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $parser->get_trimmed_text("/a");
    print "URL is: $url.\nURL text is: $text.\n";
}
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Perl by Example, Fourth Edition

Perl by Example, Fourth Edition

Ellie Quigley
Perl Cookbook, 2nd Edition

Perl Cookbook, 2nd Edition

Tom Christiansen, Nathan Torkington
Perl in a Nutshell

Perl in a Nutshell

Nathan Patwardhan, Ellen Siever, Stephen Spainhour
Learning Perl, 7th Edition

Learning Perl, 7th Edition

Randal L. Schwartz, brian d foy, Tom Phoenix

Publisher Resources

ISBN: 0596002416Errata Page