Chapter 7. HTML Processing with Tokens
Regular expressions are powerful, but they’re a painfully low-level way of dealing with HTML. You’re forced to worry about spaces and newlines, single and double quotes, HTML comments, and a lot more. The next step up from a regular expression is an HTML tokenizer. In this chapter, we’ll use HTML::TokeParser to extract information from HTML files. Using these techniques, you can extract information from any HTML file, and never again have to worry about character-level trivia of HTML markup.
HTML as Tokens
Your experience with HTML code probably involves seeing raw text such as this:
<p>Dear Diary, <br>I'm gonna be a superstar, because I'm learning to play the <a href="http://MyBalalaika.com">balalaika</a> & the <a href='http://MyBazouki.com'>bazouki</a>!!!
The HTML::TokeParser module divides the HTML into units called tokens, which means units of parsing. The above source code is parsed as this series of tokens:
- start-tag token
p
with no attributes- text token
Dear Diary,\n
- start-tag token
br
with no attributes- text token
I'm gonna be a superstar, because I'm learning to play\nthe
- start-tag token
a
, with attributehref
whose value ishttp://MyBalalaika.com
- text token
balalaika
- end-tag token
a
- text token
& the
, which means& the
- start-tag token
a
, with attributehref
equalshttp://MyBazouki.com
- text token
bazouki
- end-tag token
a
- text token
!!!\n
This representation of things is more abstract, focusing on markup concepts and not individual characters. So whereas ...
Get Perl & LWP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.