Extracting
At
this
point, we are ready to move on to the next
level: having the script extract just the links from those files, or
more specifically, having it extract the values of all the
SRC and HREF attributes.
Warning
As was discussed in Chapter 4, trying to parse HTML files with simple pattern matching is an inherently error-prone undertaking. The accompanying example fails in the face of several kinds of HTML markup that are perfectly valid as HTML, but break the simplistic assumptions in this script. For a “correct” link checker that will handle those variations more gracefully, see the example at the end of this chapter.
We begin by deleting the line from the end of the
&process subroutine that prints out the
current filename and the entire contents of the
$data variable, and replacing it with the
following chunk of code:
my @targets = ($data =~ /(?:href|src)\s*=\s*"([^"]+)"/gi);
print "In file $file, found the following targets:\n";
foreach (@targets) {
print " $_\n";
}Let’s concentrate on that first line. It looks challenging, but assuming you’ve been doing your regular expressions homework it’s really not that tough.
The first thing to focus on is the regular expression search pattern
itself: /(?:href|src)\s*=\s*"([^"]+)"/gi. In
order, from left to right, this pattern says to match a string that
begins with either href or src,
then has zero or more whitespace characters, then an equal sign
(=), then zero or more whitespace characters, then
a doublequote ("), then ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access