Recipe 18-1: Parsing HTML

HTML is a very common markup language, but there is a lot of poorly written HTML out there, which makes parsing such a file quite difficult. This recipe shows a structure that strips the tags (<a>, <li>, and so on) from the HTML. The downloader.sh script acts on the <a> tags by saving the linked URL to a file named after the anchor text. Input of <a href="http://www.example.com/">This is an example web site</a> will download the index page of www.example.com to a file called “This is an example web site.”

Technologies Used

  • tr
  • ((suffix++))
  • wget

Concepts

The actual action taken by this recipe is not particularly relevant; wget -Fi is capable of doing something very similar to what this script achieves, but this script is really about stripping tags from the HTML input.

Some HTML terminology is used in this recipe; in the input <a href="/eg.shtml">example pages</a>, /eg.shtml is the link, and example pages is the anchor text. By default, the anchor text is displayed in blue underlined text in the browser, and the link is the address of the page that will be displayed if the anchor text is clicked.

The recipe uses a very crude state machine to keep track of what position in the HTML input the script has reached. Without this, it would be necessary to make many more assumptions about the format of the input file.

Potential Pitfalls

There are a number of pitfalls in processing HTML; there is no single definition of the language, although most HTML today is ...

Get Shell Scripting: Expert Recipes for Linux, Bash, and More now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.