Skip to Content
Perl for Web Site Management
book

Perl for Web Site Management

by John Callender
October 2001
Beginner
528 pages
15h 20m
English
O'Reilly Media, Inc.
Content preview from Perl for Web Site Management

Extracting

At this point, we are ready to move on to the next level: having the script extract just the links from those files, or more specifically, having it extract the values of all the SRC and HREF attributes.

Warning

As was discussed in Chapter 4, trying to parse HTML files with simple pattern matching is an inherently error-prone undertaking. The accompanying example fails in the face of several kinds of HTML markup that are perfectly valid as HTML, but break the simplistic assumptions in this script. For a “correct” link checker that will handle those variations more gracefully, see the example at the end of this chapter.

We begin by deleting the line from the end of the &process subroutine that prints out the current filename and the entire contents of the $data variable, and replacing it with the following chunk of code:

my @targets = ($data =~ /(?:href|src)\s*=\s*"([^"]+)"/gi);
print "In file $file, found the following targets:\n";
foreach (@targets) {
    print " $_\n";
}

Let’s concentrate on that first line. It looks challenging, but assuming you’ve been doing your regular expressions homework it’s really not that tough.

The first thing to focus on is the regular expression search pattern itself: /(?:href|src)\s*=\s*"([^"]+)"/gi. In order, from left to right, this pattern says to match a string that begins with either href or src, then has zero or more whitespace characters, then an equal sign (=), then zero or more whitespace characters, then a doublequote ("), then ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Web Client Programming with Perl

Web Client Programming with Perl

Clinton Wong
Embedding Perl in HTML with Mason

Embedding Perl in HTML with Mason

Ken Williams, Dave Rolsky

Publisher Resources

ISBN: 1565926471Catalog PageErrata