BUY THIS BOOK
Add to Cart

Print Book $24.95


Add to Cart

PDF $19.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £17.50

What is this?

Looking to Reprint or License this content?


Spidering Hacks
Spidering Hacks 100 Industrial-Strength Tips & Tools

By Kevin Hemenway, Tara Calishain
Book Price: $24.95 USD
£17.50 GBP
PDF Price: $19.99

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Walking Softly
With over three billion pages on the Web, serious surfers eventually find themselves asking two questions: where's the good stuff and what can I do with it? Everyone has their own idea of what the "good stuff" is, and most people come up with some creative idea of what to do once they find it. In some corners of the Web, repurposing data in interesting ways is encouraged: it inspires those "Eureka!" moments when unusual information combinations bubble forth unimpeded.
From the Web's standpoint, the utility of universally accessible data has only recently been broached. Once Google opened their search listings via an API (see Google Hacks), Amazon.com quickly followed (see Amazon Hacks), and both have benefited by the creative utilities that have resulted. In this short and sweet chapter, we'll introduce you to the fine art of scraping and spidering: what they are and aren't, what's most likely allowed and what might create risk, finding alternative avenues to your desired data, and how to reassure—and, indeed, educate—webmasters who spot your automation and wonder what you're up to.
A few of the whys and wherefores of spidering and scraping.
There is a wide and ever-increasing variety of computer programs gathering and sifting information, aggregating resources, and comparing data. Humans are just one part of a much larger and automated equation. But despite the variety of programs out there, they all have some basic characteristics in common.
Spiders are programs that traverse the Web, gathering information. If you've ever taken a gander at your own web site's logs, you'll see them peppered with User-Agent names like Googlebot, Scooter, and MSNbot. These are all spiders—or bots , as some prefer to call them.
Throughout this book, you'll hear us referring to spiders and scrapers. What's the difference? Broadly speaking, they're both programs that go out on the Internet and grab things. For the purposes of this book, however, it's probably best for you to think of
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hacks #1-7
With over three billion pages on the Web, serious surfers eventually find themselves asking two questions: where's the good stuff and what can I do with it? Everyone has their own idea of what the "good stuff" is, and most people come up with some creative idea of what to do once they find it. In some corners of the Web, repurposing data in interesting ways is encouraged: it inspires those "Eureka!" moments when unusual information combinations bubble forth unimpeded.
From the Web's standpoint, the utility of universally accessible data has only recently been broached. Once Google opened their search listings via an API (see Google Hacks), Amazon.com quickly followed (see Amazon Hacks), and both have benefited by the creative utilities that have resulted. In this short and sweet chapter, we'll introduce you to the fine art of scraping and spidering: what they are and aren't, what's most likely allowed and what might create risk, finding alternative avenues to your desired data, and how to reassure—and, indeed, educate—webmasters who spot your automation and wonder what you're up to.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
A Crash Course in Spidering and Scraping
A few of the whys and wherefores of spidering and scraping.
There is a wide and ever-increasing variety of computer programs gathering and sifting information, aggregating resources, and comparing data. Humans are just one part of a much larger and automated equation. But despite the variety of programs out there, they all have some basic characteristics in common.
Spiders are programs that traverse the Web, gathering information. If you've ever taken a gander at your own web site's logs, you'll see them peppered with User-Agent names like Googlebot, Scooter, and MSNbot. These are all spiders—or bots , as some prefer to call them.
Throughout this book, you'll hear us referring to spiders and scrapers. What's the difference? Broadly speaking, they're both programs that go out on the Internet and grab things. For the purposes of this book, however, it's probably best for you to think of spiders as programs that grab entire pages, files, or sets of either, while scrapers grab very specific bits of information within these files. For example, one of the spiders [Hack #44] in this book grabs entire collections of Yahoo! Group messages to turn into mailbox files for use by your email application, while one of the scrapers [Hack #76] grabs train schedule information. Spiders follow links, gathering up content, while scrapers pull data from web pages. Spiders and scrapers usually work in concert; you might have a program that uses a spider to follow links but then uses a scraper to gather particular information.
When learning about a technology or way of using technology, it's always good to ask the big question: why? Why bother to spider? Why take the time to write a spider, make sure it works as expected, get permission from the appropriate site's owner to use it, make it available to others, and spend time maintaining it? Trust us; once you've started using spiders, you'll find no end to the ways and places they can be used to make your online life easier:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Best Practices for You and Your Spider
Some rules for the road as you're writing your own well-behaved spider.
In order to make your spider as effective, polite, and useful as possible, there are some general things you'll have to keep in mind as you create them.
To spider, you must pull information from a web site. To pull information from a web site, you must wade your way through some flavor of tag soup, be it HTML, XML, plain text, or something else entirely. This is an inexact science, to put it mildly. If even one tag or bit of file formatting changes, your spider will probably break, leaving you dataless until such time as you retool. Thankfully, most sites aren't doing huge revamps every six months like they used to, but they still change often enough that you'll have to watch out for this.
To minimize the fragility of your scraping, use as little boundary data as you can when gleaning data from the page. Boundary data is the fluff around the actual goodness you want: the tags, superfluous verbiage, spaces, newlines, and such. For example, the title of an average web page looks something like this:
<title>This is the title</title>
If you're after the title, the boundary data is the <title> and </title> tags.
Monitor your spider's output on a regular basis to make sure it's working as expected [Hack #31], make the appropriate adjustments as soon as possible to avoid losing ground with your data gathering, and design your spider to be as adaptive to site redesigns [Hack #32] as possible.
Just because you're working with the Web doesn't mean you're restricted to spidering HTML documents. If you're considering only web pages, you're potentially narrowing your dataset arbitrarily. There are images, sounds, movies, PDFs, text files—all worthy of spidering for your collection.
While it's tempting to think what you're up to is unique, chances are, someone's already spidered and scraped the same or similar sites, leaving clear footprints in the form of code, raw data, or instructions.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Anatomy of an HTML Page
Getting the knack of scraping is more than just code; it takes knowing HTML and other kinds of web page files.
If you're new to spidering, figuring out what to scrape and why is not easy. Relative scraping newbies might try to take too much information, too little, or information that's too likely to change from what they want. If you know how HTML files are structured, however, you'll find it easier to scrape them and zero in on the information you need.
HTML files are just text files with special formatting. And that's just the kind of file you'll spend most of your time scraping, both in this book and in your own spidering adventures. While we'll also be spidering and grabbing multimedia files—images, movies, and audio files—we won't be scraping and parsing them to glean embedded information.
That's not to say, however, that there aren't about as many ways to format an HTML page as there are pages on the Web. To understand how your spider might be able to find patterns of information on an HTML page, you'll need to start with the basics—the very basics—of how an HTML web page looks, and then get into how the information within the body can be organized.
The core of an HTML page looks like this:
<html>
<head>
  <title>
    Title of the page
  </title>
</head>
<body>
  Body of the page
</body>
</html>
That's it. 99% of the HTML pages on the Web start out like this. They can get a lot more elaborate but, in the end, this is the core. What does this mean to your spider? It means that there's only one piece of information that's clearly marked by tags, and that's the page title. If all you need is the title, you're in gravy.
But if you need information from the body of a page—say, a headline or a date—you have some detective work ahead of you. Many times, the body of a page has several tables, JavaScript, and other code that obscures what you're truly looking for—all annoyances that have much more to do with formatting information than truly organizing it. But, at the same time, the HTML language contains several standards for organizing data. Some of these standards make the information larger on the page, representing a heading. Some of the standards organize information into lists within the body. If you understand how the standards work, you'll find it easier to pluck the information you want from the heavily coded confines of a web page body.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Registering Your Spider
If you have a spider you're programming or planning on using even a minimal amount, you need to make sure it can be easily identified. The most low-key of spiders can be the subject of lots of attention.
On the Internet, any number of "arms races" are going on at the same time. You know: spammers versus antispammers, file sharers versus non-file sharers, and so on. A lower-key arms race rages between web spiders and webmasters who don't want the attention.
Who might not want to be spidered? Unfortunately, not all spiders are as benevolent as the Googlebot, Google's own indexer. Many spiders go around searching for email addresses to spam. Still others don't abide by the rules of gentle scraping and data access [Hack #2]. Therefore, spiders have gotten to the point where they're viewed with deep suspicion by experienced webmasters.
In fact, it's gotten to the point where, when in doubt, your spider might be blocked. With that in mind, it's important to name your spider wisely, register it with online databases, and make sure it has a reasonably high profile online.
By the way, you might think that your spider is minimal or low-key enough that nobody's going to notice it. That's probably not the case. In fact, sites like Webmaster World (http://www.webmasterworld.com) have entire forums devoted to identifying and discussing spiders. Don't think that your spider is going to get ignored just because you're not using a thousand online servers and spidering millions of pages a day.
The first thing you want to do is name your spider. Choose a name that gives some kind of indication of what your spider's about and what it does. Examplebot isn't a good name. NewsImageScraper is better. If you're planning to do a lot of development, consider including a version number (such as NewsImageScraper/1.03).
If you're running several spiders, you might want to consider giving your spider a common name. For example, if Kevin runs different spiders, he might consider giving them a naming convention starting with disobeycom: disobeycomNewsImageScraper, disobeycomCamSpider, disobeycomRSSfeeds, and so on. If you establish your spiders as polite and well behaved, a webmaster who sees a spider named similarly to yours might give it the benefit of the doubt. On the other hand, if you program rude, bandwidth-sucking spiders, giving them similar names makes it easier for webmasters to ban 'em all (which you deserve).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Preempting Discovery
Rather than await discovery, introduce yourself!
No matter how gentle and polite your spider is, sooner or later you're going to be noticed. Some webmaster's going to see what your spider is up to, and they're going to want some answers. Rather than wait for that to happen, why not take the initiative and make the first contact yourself? Let's look at the ways you can preempt discovery, make the arguments for your spider, and announce it to the world.
If you've written a great spider, why not tell the site about it? For a small site, this is relatively easy and painless: just look for the Feedback, About, or Contact links. For larger sites, though, figuring out whom to contact is more difficult. Try the technical contacts first, and then web feedback contacts. I've found that public relations contacts are usually best to reach last. Although tempting, because it's usually easy to find their addresses, PR folk like to concentrate on dealing with press people (which you're probably not) and they probably won't know enough programming to understand your request. (PR people, this isn't meant pejoratively. We still love you. Keep helping us promote O'Reilly books. Kiss, kiss.)
If you absolutely can't find anyone to reach out to, try these three steps:
  1. Many sites, especially technical ones, have employees with weblogs. See if you can find them via a Google search. For example, if you're looking for Yahoo! employees, the search "work for yahoo" (weblog | blog) does nicely. Sometimes, you can contact these people and let them know what you're doing, and they can either pass your email to someone who can approve it, or give you some other feedback.
  2. 99.9% of the time, an email to webmaster@ will work (e.g., webmaster@example.com). But it's not always guaranteed that anyone reads this email more than once a month, if at all.
  3. If you're absolutely desperate, you can't find email addresses or contact information anywhere on the site, and your emails to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Keeping Your Spider Out of Sticky Situations
You see tasty data here, there, and everywhere. Before you dive in, check the site's acceptable use policies.
Because the point of Spidering Hacks is to get to data that APIs can't (or haven't been created to) reach, sometimes you might end up in a legal gray area. Here's what you can do to help make sure you don't get anywhere near a "cease and desist" letter or the threat of a lawsuit.
Perhaps, one fine day, you visit a site and find some data you'd simply love to get your hands on. Before you start hacking, it behooves you to spend a little time looking around for an Acceptable Use Policy (AUP) or Terms of Service (TOS)—occasionally you'll see a Terms of Use (TOU)—and familiarize yourself with what you can and can't do with the site itself and its underlying data. Usually, you'll find a link at the bottom of the home page, often along with the site's copyright information. Yahoo! has a Terms of Service link as almost the last entry on its front page, while Google's is at the bottom of their About page. If you can't find it on the front page, look at the corporate information or any About sections. In some cases, sites (mostly smaller ones) won't have them, so you should consider contacting the webmaster—just about always webmaster@sitename.com—and ask.
So, you've found the AUP or TOS. Just what is it you're supposed to be looking for? What you're after is anything that has to do with spidering or scraping data. In the case of eBay, their position is made clear with this excerpt from their User Agreement:
You agree that you will not use any robot, spider, scraper or other automated means to access the Site for any purpose without our express written permission.
Clear enough, isn't it? But sometimes it won't be this obvious. Some usage agreements don't make any reference whatsoever to spidering or scraping. In such cases, look for a contact address for the site itself or technical issues relating to its operation, and ask.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Finding the Patterns of Identifiers
If you find that the online database or resource you want uses unique identification numbers, you can stretch what it does by combining it with other sites and identification values.
Some online data collections are just that—huge collections, put together in one place, and relying on a search engine or database program to provide organization. These collections have no unique ID numbers, no rhyme or reason to their organization. But that's not always the case.
As more and more libraries put their collection information online, more and more records and pages have their own unique identification numbers.
So what? Here's what: when a web site uses an identifying method for its information that is recognized by other web sites, you can scrape data across multiple sites using that identifying method. For example, say you want to tour the country playing golf but you're afraid of pollution, so you want to play only in the cleanest areas. You could write a script that searches for golf courses at http://www.golfcourses.com, then takes the Zip Codes of the courses returned and checks them against http://www.scorecard.org to see which have the most (or least) polluted environment.
This is a silly example, but it shows how two different online data sources (a golf course database and an environmental pollution guide) can be linked together with a unique identifying number (a Zip Code, in this case).
Speaking generally, there are three types of deliberate web data organization:
  • Arbitrary classification systems within a collection
  • Classification systems that use an established universal taxonomy within a collection
  • Classification systems that identify documents across a wide number of collections
An arbitrary classification system is either not based on an established taxonomy or only loosely based on an established taxonomy. If I give 10 photographs unique codes based on their topics and how much blue they have in them, I have established an arbitrary classification system.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Assembling a Toolbox
The idea behind scraping sites often arises out of pure, immediate, and frantic desire: it's late at night, you've forgotten your son's soccer game for the twelfth time in a row, and you're vowing never to let it happen again. Sure, you could place a bookmark to the school calendar in your browser toolbar, but you want something even more insidious, something you couldn't possibly forget or grow accustomed to seeing.
A bit later, you've got a Perl script that automatically emails you every hour of every day that a game is scheduled. You've just made your life less forgetful, your computer more useful, and your son more loving. This is where spidering and scraping shines: when you've got an itch that can best be scratched by getting your computer involved. And if there's one programming language that can quickly scratch an itch better than any other, it's Perl.
Perl is renowned for "making easy things easier and hard things possible," earning the reputation of "Swiss Army chainsaw," "Internet duct tape," or the ultimate "glue language." Since it's a scripting language (as opposed to a compiled one, like C), rapid development is its modus operandi; throw together bits and pieces from code here and there, try it out, tweak, hem, haw, and deploy. Along with its immense repository of existing code (see CPAN, the Comprehensive Perl Archive Network, at http://www.cpan.org) and the uncanny ability to "do what you mean," it's a perfect language on which to base a spidering hacks book.
In this book, we're going to assume you have a rudimentary knowledge of Perl. You may not be much more than an acolyte, but we're hoping you can create something a little more advanced then "Hello, World." What we're not going to assume, however, is that you've done much, if any, network programming before. We still hear tales of those who have stayed away from Internet programming because they're scared of how difficult it might be.
Trust us, like a lot of things with Perl, it's a lot easier than you think. In this chapter, we'll devote a decent amount of time to getting you up to speed with what you need to know: installing the network access modules for Perl [Hack #8] and then learning how to use them, from the simplest query [Hack #9] on up to progress bars [Hack #18], faking HTTP headers [Hack #11], and so on.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Hacks #8-32
The idea behind scraping sites often arises out of pure, immediate, and frantic desire: it's late at night, you've forgotten your son's soccer game for the twelfth time in a row, and you're vowing never to let it happen again. Sure, you could place a bookmark to the school calendar in your browser toolbar, but you want something even more insidious, something you couldn't possibly forget or grow accustomed to seeing.
A bit later, you've got a Perl script that automatically emails you every hour of every day that a game is scheduled. You've just made your life less forgetful, your computer more useful, and your son more loving. This is where spidering and scraping shines: when you've got an itch that can best be scratched by getting your computer involved. And if there's one programming language that can quickly scratch an itch better than any other, it's Perl.
Perl is renowned for "making easy things easier and hard things possible," earning the reputation of "Swiss Army chainsaw," "Internet duct tape," or the ultimate "glue language." Since it's a scripting language (as opposed to a compiled one, like C), rapid development is its modus operandi; throw together bits and pieces from code here and there, try it out, tweak, hem, haw, and deploy. Along with its immense repository of existing code (see CPAN, the Comprehensive Perl Archive Network, at http://www.cpan.org) and the uncanny ability to "do what you mean," it's a perfect language on which to base a spidering hacks book.
In this book, we're going to assume you have a rudimentary knowledge of Perl. You may not be much more than an acolyte, but we're hoping you can create something a little more advanced then "Hello, World." What we're not going to assume, however, is that you've done much, if any, network programming before. We still hear tales of those who have stayed away from Internet programming because they're scared of how difficult it might be.
Trust us, like a lot of things with Perl, it's a lot easier than you think. In this chapter, we'll devote a decent amount of time to getting you up to speed with what you need to know: installing the network access modules for Perl [Hack #8] and then learning how to use them, from the simplest query [Hack #9] on up to progress bars [Hack #18], faking HTTP headers [Hack #11], and so on.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perl Modules
You may have been tripped up by the word modules in the previous paragraph. Don't fret, dear reader. A module is simply an encapsulated bit of Perl code, written by someone else, that you employ in your own application. By leaving the implementation details and much of the dirty work to the module author, using a module rather then writing all the code yourself makes a complicated task far, far easier. When we say we're going to install a module, we really mean we're going to get a copy from CPAN (http://www.cpan.org), test to make sure it'll work in our environment, ensure it doesn't require other modules that we don't yet have, install it, and then prepare it for general use within our own scripts.
Sounds pretty complicated, right? Repeat ad infinitum: don't fret, dear reader, as CPAN has you covered. One of Perl's greatest accomplishments, CPAN is a large and well-categorized selection of modules created and contributed by hundreds of authors. Mirrored worldwide, there's a good chance your "I wish I had a . . . " wonderings have been placated, bug-tested, and packaged for your use.
Since CPAN is such a powerful accoutrement to the Perl language, the task of installing a module and ensuring its capabilities has been made far easier than the mumbo jumbo I uttered previously. We cover exactly how to install modules in our first hack of this chapter [Hack #8].
As you browse through this book, you'll see we use a number of noncore modules—where noncore is defined as "not already part of your Perl installation." Following are a few of the more popular ones you'll be using in your day-to-day scraping. Again, worry not if you don't understand some of this stuff; we'll cover it in time:
LWP
A package of modules for web access, also known as libwww-perl
LWP::Simple
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Resources You May Find Helpful
If you're new to Perl, or if you would like to brush up on your skills, leaf through the following O'Reilly books, both well-respected additions to any Perl programmer's library:
  • The Perl Cookbook (http://www.oreilly.com/catalog/perlckbk2/) by Tom Christiansen and Nathan Torkington
  • Programming Perl (http://www.oreilly.com/catalog/pperl3/) by Larry Wall, Tom Christiansen, and Jon Orwant
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Installing Perl Modules
A fair number of our hacks require modules not included with the standard Perl distribution. Here, we'll show you how to install these modules on Windows, Mac OS X, and Unix-based systems.
As you go through this book, you'll notice that we're constantly mentioning a variety of different Perl modules. Some of them aren't standard installation fare, so it's unlikely that you'll already have them available to your coding.
Why do we mention nonstandard modules? Quite simply, we didn't want to reinvent the wheel. We do some pretty odd toting and lifting in this book, and without the help of many of these modules we'd have to do a lot of extra coding (which means just that many more breakable bits in our scripts).
If you're new to Perl, however, you may be feeling intimidated by the idea of installing modules. Don't worry; it's a snap! If you're running ActiveState Perl for Windows, you'll want to use the Programmer's Package Manager (PPM). Otherwise, you'll use CPAN.
LWP is used throughout this book, because it's the workhorse for any Perl script with designs on interacting with the Internet. Some Perl installations already have it installed; others don't. We'll use it here as an example of how to install a typical Perl module; the steps apply to almost any available noncore module that we use in our other hacks—and, indeed, that you may encounter in your ongoing programming.
Usually, the easiest way to install a Perl module is via another Perl module. The CPAN module, included with just about every modern Perl distribution, automates the installation of Perl modules, fetching components and any prerequisites and building the whole kit and kaboodle for you on the fly.
CPAN installs modules into standard system-wide locations and, therefore, assumes you're running as the root user. If you have no more than regular user access, you'll have to install your module by hand (see Unix and Mac OS X installation by hand later in this hack).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Simply Fetching with LWP::Simple
Suck web content easily using the aptly named LWP::Simple.
LWP (short for "Library for WWW in Perl") is a popular group of Perl modules for accessing data on the Web. Like most Perl module distributions, each of LWP's component modules comes with documentation that is a complete reference to its interface. However, there are so many modules in LWP that it's hard to know where to look for information on doing even the simplest things.
Introducing you to all aspects of using LWP would require a whole book—a book that just so happens to exist, mind you (see Sean Burke's Perl & LWP at http://oreilly.com/catalog/perllwp/).
If you just want to access a particular URL, the simplest way to do so is to use LWP::Simple's functions. In a Perl program, you can simply call its get($url) routine, where $url is the location of the content you're interested in. LWP::Simple will try to fetch the content at the end of the URL. If it's successful, you'll be handed the content; if there's an error of some sort, the get function will return undef, the undefined value. The get represents an aptly named HTTP GET request, which reads as "get me the content at the end of this URL":
#!/usr/bin/perl -w
use strict;
use LWP::Simple;

# Just an example: the URL for the most recent /Fresh Air/ show 
my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';

my $content = get($url);
            die "Couldn't get $url" unless defined $content;

# Do things with $content:
if ($content =~ m/jazz/i) {
    print "They're talking about jazz today on Fresh Air!\n";
} else { print "Fresh Air is apparently jazzless today.\n"; }
A handy variant of get is getprint, useful in Perl one-liners. If it can get the page whose URL you provide, it sends it straight to STDOUT; otherwise, it complains to STDERR—both usually are your screen:
% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"
MIRRORED.BY
MIRRORING.FROM
RECENT
RECENT.html
SITES
SITES.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AB/ABW/CHECKSUMS
authors/id/A/AB/ABW/Pod-POM-0.17.tar.gz
...
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
More Involved Requests with LWP::UserAgent
Knowing how to download web pages is great, but it doesn't help us when we want to submit forms, fake browser settings, or get more information about our request. Here, we'll jump into the more useful LWP::UserAgent.
LWP::Simple's functions [Hack #9] are handy for simple cases, but they don't support cookies or authorization; they don't support setting header lines in the HTTP request; and, generally, they don't support reading header lines in the HTTP response (most notably, the full HTTP error message, in case of problems). To get at all those features, you'll have to use the full LWP class model.
While LWP consists of dozens of classes, the two that you have to understand are LWP::UserAgent and HTTP::Response . LWP::UserAgent is a class for virtual browsers , which you use for performing requests, and HTTP::Response is a class for the responses (or error messages) that you get back from those requests.
The basic idiom is $response = $browser->get($url), like so:
#!/usr/bin/perl -w
use strict;
use LWP 5.64; # Loads all important LWP classes, and makes
              # sure your version is reasonably recent.

my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';

my $browser = LWP::UserAgent->new;
my $response = $browser->get( $url );
die "Can't get $url -- ", $response->status_line
   unless $response->is_success;

die "Hey, I was expecting HTML, not ", $response->content_type
   unless $response->content_type eq 'text/html';
   # or whatever content-type you're dealing with.

# Otherwise, process the content somehow:
if ($response->content =~ m/jazz/i) {
    print "They're talking about jazz today on Fresh Air!\n";
} else {print "Fresh Air is apparently jazzless today.\n"; }
There are two objects involved: $browser, which holds an object of the class LWP::UserAgent, and the $response object, which is of the class HTTP::Response. You really need only one browser object per program; but every time you make a request, you get back a new
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Adding HTTP Headers to Your Request
Add more functionality to your programs, or mimic common browsers, to circumvent server-side filtering of unknown user agents.
The most commonly used syntax for LWP::UserAgent requests is $response = $browser->get($url), but in truth you can add extra HTTP header lines to the request by adding a list of key/value pairs after the URL, like so:
$response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
Why is adding HTTP headers sometimes necessary? It really depends on the site that you're pulling data from; some will respond only to actions that appear to come from common end-user browsers, such as Internet Explorer, Netscape, Mozilla, or Safari. Others, in a desperate attempt to minimize bandwidth costs, will send only compressed data [Hack #16], requiring decoding on the client end. All these client necessities can be enabled through the use of HTTP headers. For example, here's how to send more Netscape-like headers:
my @ns_headers = (
    'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
    'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, 
                 image/pjpeg, image/png,  */*',
    'Accept-Charset' => 'iso-8859-1,*',
    'Accept-Language' => 'en-US',
);

$response = $browser->get($url, @ns_headers);
Or, alternatively, without the interim array:
$response = $browser->get($url,
    'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
    'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, 
                 image/pjpeg, image/png, */*',
    'Accept-Charset' => 'iso-8859-1,*',
    'Accept-Language' => 'en-US',
);
In these headers, you're telling the remote server which types of data you're willing to Accept and in what order: GIFs, bitmaps, JPEGs, PNGs, and then anything else (you'd rather have a GIF first, but an HTML file is fine if the server can't provide the data in your preferred formats). For servers that cater to international users by offering translated documents, the Accept-Language and
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Posting Form Data with LWP
Automate form submission, whether username and password authentication, supplying your Zip Code for location-based services, or simply filling out a number of customizable fields for search engines.
Say you search Google for three blind mice. Your result URL will vary depending on the preferences you've set, but it will look something like this:
http://www.google.com/search?num=100&hl=en&q=%22three+blind+mice%22
The query itself turns into an ungodly mess, &q=%22three+blind+mice%22, but why? Whenever you send data through a form submission, that data has to be encoded so that it can safely arrive at its destination, the server, intact. Characters like spaces and quotes—in essence, anything not alphanumeric—must be turned into their encoded equivalents, like + and %22. LWP will automatically handle most of this encoding (and decoding) for you, but you can request it at will with URI::Escape's uri_escape and uri_unescape functions.
Let's break down what those other bits in the URL mean.
num=100 refers to the number of search results to a page, 100 in this case. Google accepts any number from 10 to 100. Altering the value of num in the URL and reloading the page is a nice shortcut for altering the preferred size of your result set without having to meander over to the Advanced Search (http://www.google.com/advanced_search?hl=en) and rerunning your query.
h1=en means that the language interface—the language in which you use Google, reflected in the home page, messages, and buttons—is in English. Google's Language Tools (http://www.google.com/language_tools?hl=en) provide a list of language choices.
The three variables q, num, and h1 and their associated values represent a GET form request; you can always tell when you have one by the URL in your browser's address bar, where you'll see the URL, then a question mark (?), followed by key/value pairs separated by an ampersand (&). To run the same search from within
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Authentication, Cookies, and Proxies
Access restricted resources programmatically by supplying proper authentication tokens, cookies, or proxy server information.
Accessing public resources assumes that you have the correct privileges to do so. The vast majority of sites you encounter every day on the Web are usually wide open to any visitor anxious to satisfy his browsing desires. Some sites, however, require password authentication before you're allowed in. Still others will give you a special file called a cookie, without which you'll not get any further. And sometimes, your ISP or place of work may require that you use a proxy server, a sort of handholding middleman that preprocesses everything you view. All three of these techniques will break any LWP::UserAgent [Hack #10] code we've previously written.
Many web sites restrict access to documents by using HTTP Authentication , a mechanism whereby the web server sends the browser an HTTP code that says "You are entering a protected realm, accessible only by rerequesting it along with some special authorization headers." Your typical web browser deals with this request by presenting you with a username/password prompt, as shown in Figure 2-1, passing whatever you enter back to the web server as the appropriate authentication headers.
Figure 2-1: A typical browser authentication prompt
For example, the Unicode.org administrators stop email-harvesting bots from spidering the contents of their mailing list archives by protecting them with HTTP Authentication and then publicly stating the username and password (at http://www.unicode.org/mail-arch/)—namely, username "unicode-ml" and password "unicode".
Consider this URL, part of the protected area of the web site:
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
If you access this URL with a browser, you'll be prompted to "Enter username and password for `Unicode-MailList-Archives' at server `www.unicode.org'". Attempting to access this URL via
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Handling Relative and Absolute URLs
Glean the full URL of any relative reference, such as "sample/index.html" or "../../images/flowers.gif", by using the helper functions of URI.
Occasionally, when you're parsing HTML or accepting command-line input, you'll receive a relative URL, something that looks like images/bob.jpg instead of the more specific http://www.example.com/images/bob.jpg. The longer version, called the absolute URL , is more desirable for parsing and display, as it ensures that no confusion can arise over where a resource is located.
The URI class provides all sorts of methods for accessing and modifying parts of URLs (such as asking what sort of URL it is with $url->scheme, asking which host it refers to with $url->host, and so on, as described in the docs for the URI class). However, the methods of most immediate interest are the query_form method [Hack #12] and the new_abs method for taking a URL string that is most likely relative and getting back an absolute URL, as shown here:
use URI; my $abs = URI->new_abs($maybe_relative, $base);
For example, consider the following simple program, which scrapes for URLs in the HTML list of new modules available at your local CPAN mirror:
#!/usr/bin/perl -w
use strict;
use LWP 5.64;

my $browser = LWP::UserAgent->new;
my $url = 'http://www.cpan.org/RECENT.html';
my $response = $browser->get($url);

die "Can't get $url -- ", $response->status_line
  unless $response->is_success;

my $html = $response->content;
while( $html =~ m/<A HREF=\"(.*?)\"/g ) { 
    print "$1\n"; 
}
It returns a list of relative URLs for Perl modules and other assorted files:
% perl get_relative.pl
MIRRORING.FROM
RECENT
RECENT.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AA/AASSAD/CHECKSUMS
...
However, if you actually want to retrieve those URLs, you'll need to convert them from relative (e.g., authors/00whois.html) to absolute (e.g., http://www.cpan.org/authors/00whois.html). The URI module's new_abs
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Secured Access and Browser Attributes
If you're planning on accessing secured resources, such as your online banking, intranet, or the like, you'll need to send and receive data over a secured LWP connection.
Some sites are purveyors of such important data that simple password authentication doesn't provide the security necessary. A banking site, for instance, will use a username and password system to ensure you are who you say you are, but they'll also encrypt all the traffic from your computer to theirs. By doing so, they ensure that a malicious user can't "sniff" the data you're transmitting back and forth—credit card information, account histories, and social security numbers. To prevent against this unwanted snooping, using encryption, the server will install an SSL (Secure Sockets Layer) certificate, a contract of sorts between your browser and the web server, agreeing on how to hide the data passed back and forth.
You can tell a secured site by its URL: it will start with https://.
When you access an HTTPS URL, it'll work for you just like an HTTP URL, but only if your LWP installation has HTTPS support (via an appropriate SSL library). For example:
#!/usr/bin/perl -w
use strict;
use LWP 5.64;

my $url = 'https://www.paypal.com/';   # Yes, HTTPS!
my $browser = LWP::UserAgent->new;
my $response = $browser->get($url);

die "Error at $url\n ", $response->status_line,
    "\n Aborting" unless $response->is_success;

print "Whee, it worked!  I got that ",
    $response->content_type, " document!\n";
If your LWP installation doesn't yet have HTTPS support installed, the script's response will be unsuccessful and you'll receive this error message:
Error at https://www.paypal.com/
   501 Protocol scheme 'https' is not supported
If your LWP installation does have HTTPS support installed, then the response should be successful and you should be able to consult $response just as you would any normal HTTP response [Hack #10].
For information about installing HTTPS support for your
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Respecting Your Scrapee's Bandwidth
Be a better Net citizen by reducing load on remote sites, either by ensuring you're downloading only changed content, or by supporting compression.
Everybody has bills, and the more services you partake in, the higher those bills become. It's a blatantly obvious concept, but one that is easily forgotten when you're writing a scraper. See, when you're physically sitting at your computer, clicking through a site's navigation with your browser, you're an active user: sites love you and they want your traffic but, more importantly, your eyeballs.
With a spider, there are no eyeballs; you run a command line, then go watch the latest anime fansub. Behind the scenes, your spider could be making hundreds or thousands of requests. Of course, it depends on what your spider actually purports to solve, but the fact remains: it's an automated process, and one which could be causing the remote site additional bandwidth costs.
It doesn't have to be this way. In this hack, we'll demonstrate three different ways you can save some bandwidth (both for the site, and for your own rehandling of data you've already seen). The first two methods compare metadata you've saved previously with server data; the last covers compression.
In "Adding HTTP Headers to Your Request" [Hack #11], we learned how to fake our User-Agent or add a Referer to get past certain server-side filters. HTTP headers aren't always used for subversion, though, and If-Modified-Since is a perfect example of one that isn't. The following script downloads a web page and returns the Last-Modified HTTP header, as reported by the server:
#!/usr/bin/perl -w
use strict;
use LWP 5.64;
use HTTP::Date;

my $url = 'http://disobey.com/amphetadesk/';
my $browser = LWP::UserAgent->new;
my $response = $browser->get( $url );
print "Got: ", $response->status_line;

print "\n". "Epoch: " . $response->last_modified . "\n";
print "English: " . time2str($response->last_modified) . "\n";
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Respecting robots.txt
The robots.txt file is a bastion of fair play, allowing a site to restrict what visiting scrapers are allowed to see and do or, indeed, keep them out entirely. Play fair by respecting their requests.
If you've ever built your own web site, you may have come across something called a robots.txt file (http://www.robotstxt.org)a magical bit of text that you, as web developer and site owner, can create to control the capabilities of third-party robots, agents, scrapers, spiders, or what have you. Here is an example of a robots.txt file that blocks any robot's access to three specific directories:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
Applications that understood your robots.txt file will resolutely abstain from indexing those parts of your site, or they'll leave dejectedly if you deny them outright, as per this example:
User-agent: *
Disallow: /
If you're planning on releasing your scraper or spider into the wild, it's important that you make every possible attempt to support robots.txt. Its power comes solely from the number of clients that choose to respect it. Thankfully, with LWP, we can rise to the occasion quite simply.
If you want to make sure that your LWP-based program respects robots.txt, you can use the LWP::RobotUA class (http://search.cpan.org/author/GAAS/libwww-perl/lib/LWP/RobotUA.pm) instead of LWP::UserAgent. Doing so also ensures that your script doesn't make requests too many times a second, saturating the site's bandwidth unnecessarily. LWP::RobotUA is just like LWP::UserAgent, and you can use it like so:
use LWP::RobotUA;

# Your bot's name and your email address
my $browser = LWP::RobotUA->new('SuperBot/1.34', 'you@site.com');
my $response = $browser->get($url);
If the robots.txt file on $url's server forbids you from accessing $url, then the $browser object (assuming it's of the class LWP::RobotUA) won't actually request it, but instead will give you back (in $response) a 403 error with a message "Forbidden by robots.txt." Trap such an eventuality like so:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Adding Progress Bars to Your Scripts
Give a visual indication that a download is progressing smoothly.
With all this downloading, it's often helpful to have some visual representation of its progress. In most of the scripts in this book, there's always a bit of visual information being displayed to the screen: that we're starting this URL here, processing this data there, and so on. These helpful bits usually come before or after the actual data has been downloaded. But what if we want visual feedback while we're in the middle of a large MP3, movie, or database leech?
If you're using a fairly recent vintage of the LWP library, you'll be able to interject your own subroutine to run at regular intervals during download. In this hack, we'll show you four different ways of adding various types of progress bars to your current applications. To get the most from this hack, you should have ready a URL that's roughly 500 KB or larger; it'll give you a good chance to see the progress bar in action.
The first progress bar is the simplest, providing only a visual heartbeat so that you can be sure things are progressing and not just hanging. Save the following code to a file called progress_bar.pl and run it from the command line as perl scriptname URL, where URL is the online location of your appropriately large piece of sample data:
#!/usr/bin/perl -w
#
# Progress Bar: Dots - Simple example of an LWP progress bar.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";

# make sure we have the modules we need, else die peacefully.
eval("use LWP 5.6.9;");  die "[err] LWP 5.6.9 or greater required.\n" if $@;

# now, check for passed URLs for downloading.
die "[err] No URLs were passed for processing.\n" unless @ARGV;

# our downloaded data.
my $final_data = undef;

# loop through each URL.
foreach my $url (@ARGV) {
   print "Downloading URL at ", substr($url, 0, 40), "... ";

   # create a new useragent and download the actual URL.
   # all the data gets thrown into $final_data, which
   # the callback subroutine appends to.
   my $ua = LWP::UserAgent->new(  );
   my $response = $ua->get($url, '
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Scraping with HTML::TreeBuilder
One of many popular HTML parsers available in Perl, HTML::TreeBuilder approaches the art of HTML parsing as a parent/child relationship.
Sometimes regular expressions [Hack #23] won't get you all the way to the data you want and you'll need to use a real HTML parser. CPAN has a few of these, the main two being HTML::TreeBuilder and HTML::TokeParser [Hack #20], both of which are friendly façades for HTML::Parser. This hack covers the former.
The Tree in TreeBuilder represents a parsing ideology: trees are a good way to represent HTML. The <head> tag is a child of the <html> tag. The <title> and <met