A Proper Link Checker
I’ve done a lot of badmouthing of the link-checking scripts
we’ve developed so far in this chapter, so let’s move on
now and develop a “proper” link checker. This final
version of our link-checking script gives up on the whole problematic
notion of trying to extract
SRC and HREF attributes from our HTML
pages using simple regex patterns. It also throws out the practice of
trying to use -e tests of the local filesystem to
identify the presence or absence of local images and HTML files.
Instead, it crawls through our site like a search engine’s
spider program, testing each link with a web request issued via
LWP.
Without further ado, the script is given in Example 11-4. There’s a lot going on here, in particular a lot of magic being performed via imported modules, but we’ll cover it all in detail after taking a look at the script.
Example 11-4. A link-checking script that uses LWP to check for “badness”
#!/usr/bin/perl -w # link_check3.plx # This is a third version of an HTML link checker. # Beginning with a URL (required as a command-line argument), # it spiders out the entire site (or as much of it as it can # reach via links followed recursively from the starting page), # checking all HREF and SRC attributes to make sure they work # using GET and HEAD requests from LWP::UserAgent. It then # reports on the bad links. use strict; use LWP::UserAgent; use HTTP::Request; use HTML::LinkExtor; use URI::URL; # required by HTML::LinkExtor, when invoked with base my ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access