Chapter 7. HTML Processing with Tokens
Regular expressions are powerful, but they’re a painfully low-level way of dealing with HTML. You’re forced to worry about spaces and newlines, single and double quotes, HTML comments, and a lot more. The next step up from a regular expression is an HTML tokenizer. In this chapter, we’ll use HTML::TokeParser to extract information from HTML files. Using these techniques, you can extract information from any HTML file, and never again have to worry about character-level trivia of HTML markup.
HTML as Tokens
Your experience with HTML code probably involves seeing raw text such as this:
<p>Dear Diary, <br>I'm gonna be a superstar, because I'm learning to play the <a href="http://MyBalalaika.com">balalaika</a> & the <a href='http://MyBazouki.com'>bazouki</a>!!!
The HTML::TokeParser module divides the HTML into units called tokens, which means units of parsing. The above source code is parsed as this series of tokens:
- start-tag token
pwith no attributes- text token
Dear Diary,\n- start-tag token
brwith no attributes- text token
I'm gonna be a superstar, because I'm learning to play\nthe- start-tag token
a, with attributehrefwhose value ishttp://MyBalalaika.com- text token
balalaika- end-tag token
a- text token
& the, which means& the- start-tag token
a, with attributehrefequalshttp://MyBazouki.com- text token
bazouki- end-tag token
a- text token
!!!\n
This representation of things is more abstract, focusing on markup concepts and not individual characters. So whereas ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access