Skip to Content
Shell Scripting: Expert Recipes for Linux, Bash, and More
book

Shell Scripting: Expert Recipes for Linux, Bash, and More

by Steve Parker
August 2011
Beginner to intermediate
600 pages
14h 29m
English
Wrox
Content preview from Shell Scripting: Expert Recipes for Linux, Bash, and More

Recipe 18-1: Parsing HTML

HTML is a very common markup language, but there is a lot of poorly written HTML out there, which makes parsing such a file quite difficult. This recipe shows a structure that strips the tags (<a>, <li>, and so on) from the HTML. The downloader.sh script acts on the <a> tags by saving the linked URL to a file named after the anchor text. Input of <a href="http://www.example.com/">This is an example web site</a> will download the index page of www.example.com to a file called “This is an example web site.”

Technologies Used

  • tr
  • ((suffix++))
  • wget

Concepts

The actual action taken by this recipe is not particularly relevant; wget -Fi is capable of doing something very similar to what this script achieves, but this script is really about stripping tags from the HTML input.

Some HTML terminology is used in this recipe; in the input <a href="/eg.shtml">example pages</a>, /eg.shtml is the link, and example pages is the anchor text. By default, the anchor text is displayed in blue underlined text in the browser, and the link is the address of the page that will be displayed if the anchor text is clicked.

The recipe uses a very crude state machine to keep track of what position in the HTML input the script has reached. Without this, it would be necessary to make many more assumptions about the format of the input file.

Potential Pitfalls

There are a number of pitfalls in processing HTML; there is no single definition of the language, although most HTML today is ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Linux Command Line and Shell Scripting Techniques

Linux Command Line and Shell Scripting Techniques

Vedran Dakic, Jasmin Redzepagic
Linux Shell Scripting Cookbook - Third Edition

Linux Shell Scripting Cookbook - Third Edition

Clif Flynt, Sarath Lakshman, Shantanu Tushar

Publisher Resources

ISBN: 9781118166321Purchase bookDownloads