A Proper Link Checker

I’ve done a lot of badmouthing of the link-checking scripts we’ve developed so far in this chapter, so let’s move on now and develop a “proper” link checker. This final version of our link-checking script gives up on the whole problematic notion of trying to extract SRC and HREF attributes from our HTML pages using simple regex patterns. It also throws out the practice of trying to use -e tests of the local filesystem to identify the presence or absence of local images and HTML files. Instead, it crawls through our site like a search engine’s spider program, testing each link with a web request issued via LWP.

Without further ado, the script is given in Example 11-4. There’s a lot going on here, in particular a lot of magic being performed via imported modules, but we’ll cover it all in detail after taking a look at the script.

Example 11-4. A link-checking script that uses LWP to check for “badness”

#!/usr/bin/perl -w # link_check3.plx # This is a third version of an HTML link checker. # Beginning with a URL (required as a command-line argument), # it spiders out the entire site (or as much of it as it can # reach via links followed recursively from the starting page), # checking all HREF and SRC attributes to make sure they work # using GET and HEAD requests from LWP::UserAgent. It then # reports on the bad links. use strict; use LWP::UserAgent; use HTTP::Request; use HTML::LinkExtor; use URI::URL; # required by HTML::LinkExtor, when invoked with base my ...

Get Perl for Web Site Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.