Finding Stale Links
Problem
You want to check whether a document contains invalid links.
Solution
Use the technique outlined in Section 20.3 to extract
each link, and then use the LWP::Simple module’s
head function to make sure that link exists.
Discussion
Example 20.5 is an applied example of the
link-extraction technique. Instead of just printing the name of the
link, we call the LWP::Simple module’s head
function on it. The HEAD method fetches the remote document’s
metainformation to determine status information without downloading
the whole document. If it fails, then the link is bad so we print an
appropriate message.
Because this program uses the get function from
LWP::Simple, it is expecting a URL, not a filename. If you want to
supply either, use the
URI::Heuristic module described in Section 20.1.
Example 20-5. churl
#!/usr/bin/perl -w
# churl - check urls
use HTML::LinkExtor;
use LWP::Simple qw(get head);
$base_url = shift
or die "usage: $0 <start_url>\n";
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url));
@links = $parser->links;
print "$base_url: \n";
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element) {
my ($attr_name , $attr_value) = splice(@element, 0, 2);
if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {
print " $attr_value: ", head($attr_value) ? "OK" : "BAD", "\n";
}
}
}Here’s an example of a program run:
% churl http://www.wizards.com
http://www.wizards.com:
FrontPage/FP_Color.gif: ...Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access