A Proper Link Checker
I’ve done a lot of badmouthing of the link-checking scripts
we’ve developed so far in this chapter, so let’s move on
now and develop a “proper” link checker. This final
version of our link-checking script gives up on the whole problematic
notion of trying to extract
SRC and HREF attributes from our HTML
pages using simple regex patterns. It also throws out the practice of
trying to use -e
tests of the local filesystem to
identify the presence or absence of local images and HTML files.
Instead, it crawls through our site like a search engine’s
spider program, testing each link with a web request issued via
LWP
.
Without further ado, the script is given in Example 11-4. There’s a lot going on here, in particular a lot of magic being performed via imported modules, but we’ll cover it all in detail after taking a look at the script.
Example 11-4. A link-checking script that uses LWP to check for “badness”
#!/usr/bin/perl -w # link_check3.plx # This is a third version of an HTML link checker. # Beginning with a URL (required as a command-line argument), # it spiders out the entire site (or as much of it as it can # reach via links followed recursively from the starting page), # checking all HREF and SRC attributes to make sure they work # using GET and HEAD requests from LWP::UserAgent. It then # reports on the bad links. use strict; use LWP::UserAgent; use HTTP::Request; use HTML::LinkExtor; use URI::URL; # required by HTML::LinkExtor, when invoked with base my ...
Get Perl for Web Site Management now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.