O'Reilly logo

Perl for Web Site Management by John Callender

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Extracting

At this point, we are ready to move on to the next level: having the script extract just the links from those files, or more specifically, having it extract the values of all the SRC and HREF attributes.

Warning

As was discussed in Chapter 4, trying to parse HTML files with simple pattern matching is an inherently error-prone undertaking. The accompanying example fails in the face of several kinds of HTML markup that are perfectly valid as HTML, but break the simplistic assumptions in this script. For a “correct” link checker that will handle those variations more gracefully, see the example at the end of this chapter.

We begin by deleting the line from the end of the &process subroutine that prints out the current filename and the entire contents of the $data variable, and replacing it with the following chunk of code:

my @targets = ($data =~ /(?:href|src)\s*=\s*"([^"]+)"/gi);
print "In file $file, found the following targets:\n";
foreach (@targets) {
    print " $_\n";
}

Let’s concentrate on that first line. It looks challenging, but assuming you’ve been doing your regular expressions homework it’s really not that tough.

The first thing to focus on is the regular expression search pattern itself: /(?:href|src)\s*=\s*"([^"]+)"/gi. In order, from left to right, this pattern says to match a string that begins with either href or src, then has zero or more whitespace characters, then an equal sign (=), then zero or more whitespace characters, then a doublequote ("), then ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required