book

Shell Scripting: Expert Recipes for Linux, Bash, and More

by Steve Parker

August 2011

Beginner to intermediate

600 pages

14h 29m

English

Wrox

Read now

Unlock full access

Content preview from Shell Scripting: Expert Recipes for Linux, Bash, and More

Recipe 18-1: Parsing HTML

HTML is a very common markup language, but there is a lot of poorly written HTML out there, which makes parsing such a file quite difficult. This recipe shows a structure that strips the tags (<a>, <li>, and so on) from the HTML. The downloader.sh script acts on the <a> tags by saving the linked URL to a file named after the anchor text. Input of <a href="http://www.example.com/">This is an example web site</a> will download the index page of www.example.com to a file called “This is an example web site.”

Technologies Used

tr
((suffix++))
wget

Concepts

The actual action taken by this recipe is not particularly relevant; wget -Fi is capable of doing something very similar to what this script achieves, but this script is really about stripping tags from the HTML input.

Some HTML terminology is used in this recipe; in the input <a href="/eg.shtml">example pages</a>, /eg.shtml is the link, and example pages is the anchor text. By default, the anchor text is displayed in blue underlined text in the browser, and the link is the address of the page that will be displayed if the anchor text is clicked.

The recipe uses a very crude state machine to keep track of what position in the HTML input the script has reached. Without this, it would be necessary to make many more assumptions about the format of the input file.

Potential Pitfalls

There are a number of pitfalls in processing HTML; there is no single definition of the language, although most HTML today is ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Linux Command Line and Shell Scripting Techniques

Publisher Resources

ISBN: 9781118166321Purchase book Downloads

Shell Scripting: Expert Recipes for Linux, Bash, and More

by Steve Parker

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Linux Command Line and Shell Scripting Techniques

Linux Shell Scripting Cookbook - Third Edition

Pro Bash Programming: Scripting the GNU/Linux Shell

Learn Linux Shell Scripting - Fundamentals of Bash 4.4

Publisher Resources