7. HTML Processing with Tokens

Chapter 7. HTML Processing with Tokens

Regular expressions are powerful, but they’re a painfully low-level way of dealing with HTML. You’re forced to worry about spaces and newlines, single and double quotes, HTML comments, and a lot more. The next step up from a regular expression is an HTML tokenizer. In this chapter, we’ll use HTML::TokeParser to extract information from HTML files. Using these techniques, you can extract information from any HTML file, and never again have to worry about character-level trivia of HTML markup.

HTML as Tokens

Your experience with HTML code probably involves seeing raw text such as this:

<p>Dear Diary,
<br>I'm gonna be a superstar, because I'm learning to play
the <a href="http://MyBalalaika.com">balalaika</a> &amp; the <a
href='http://MyBazouki.com'>bazouki</a>!!!

The HTML::TokeParser module divides the HTML into units called tokens, which means units of parsing. The above source code is parsed as this series of tokens:

start-tag token: p with no attributes
text token: Dear Diary,\n
start-tag token: br with no attributes
text token: I'm gonna be a superstar, because I'm learning to play\nthe
start-tag token: a, with attribute href whose value is http://MyBalalaika.com
text token: balalaika
end-tag token: a
text token: & the , which means & the
start-tag token: a, with attribute href equals http://MyBazouki.com
text token: bazouki
end-tag token: a
text token: !!!\n

This representation of things is more abstract, focusing on markup concepts and not individual characters. So whereas ...

Get Perl & LWP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Perl & LWP by Sean M. Burke

Chapter 7. HTML Processing with Tokens

HTML as Tokens

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly