Let’s take stock of what we’ve done so far. We’ve
written a script that will descend recursively through a filesystem,
reading in the contents of any HTML files it encounters and
extracting all the
<A HREF="..."> and
<IMG SRC="..."> attributes from those files.
We’ve also created a subroutine that will take a directory name
and a list of links extracted from a file in that directory, identify
which links point to local files, and convert them to full (that is,
absolute) filesystem pathnames.
The fast-but-stupid version of our link-checker is almost finished. The main thing left is defining the data structure that will hold the information on the bad links it discovers.
For that, we go back to the top of the script, just below the configuration section, and add the following:
my %bad_links; # A "hash of arrays" with keys consisting of URLs # under $start_base, and values consisting of lists # of bad links on those pages. my %good; # A hash mapping filesystem paths to # 0 or 1 (for good or bad). Used to cache the results # of previous checks so they needn't be repeated for # subsequent pages.
Here we’ve declared two new hashes that are going to be used in
%good is fairly straightforward; we’re going
to use it to store the result of testing the links our script
processes. The keys of the
%good hash are the
local filesystem paths for the files we are checking (e.g.,
/w1/s/socalsail/index.html). A link that turns out to be bad ...