Extracting
At
this
point, we are ready to move on to the next
level: having the script extract just the links from those files, or
more specifically, having it extract the values of all the
SRC
and HREF
attributes.
Warning
As was discussed in Chapter 4, trying to parse HTML files with simple pattern matching is an inherently error-prone undertaking. The accompanying example fails in the face of several kinds of HTML markup that are perfectly valid as HTML, but break the simplistic assumptions in this script. For a “correct” link checker that will handle those variations more gracefully, see the example at the end of this chapter.
We begin by deleting the line from the end of the
&process
subroutine that prints out the
current filename and the entire contents of the
$data
variable, and replacing it with the
following chunk of code:
my @targets = ($data =~ /(?:href|src)\s*=\s*"([^"]+)"/gi); print "In file $file, found the following targets:\n"; foreach (@targets) { print " $_\n"; }
Let’s concentrate on that first line. It looks challenging, but assuming you’ve been doing your regular expressions homework it’s really not that tough.
The first thing to focus on is the regular expression search pattern
itself: /(?:href|src)\s*=\s*"([^"]+)"/gi
. In
order, from left to right, this pattern says to match a string that
begins with either href
or src
,
then has zero or more whitespace characters, then an equal sign
(=
), then zero or more whitespace characters, then
a doublequote ("
), then ...
Get Perl for Web Site Management now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.