Spidering Hacks

Chapter 1. Walking Softly

Hack 1. Hacks #1-7

With over three billion pages on the Web, serious surfers eventually find themselves asking two questions: where’s the good stuff and what can I do with it? Everyone has their own idea of what the “good stuff” is, and most people come up with some creative idea of what to do once they find it. In some corners of the Web, repurposing data in interesting ways is encouraged: it inspires those “Eureka!” moments when unusual information combinations bubble forth unimpeded.

From the Web’s standpoint, the utility of universally accessible data has only recently been broached. Once Google opened their search listings via an API (see Google Hacks), Amazon.com quickly followed (see Amazon Hacks), and both have benefited by the creative utilities that have resulted. In this short and sweet chapter, we’ll introduce you to the fine art of scraping and spidering: what they are and aren’t, what’s most likely allowed and what might create risk, finding alternative avenues to your desired data, and how to reassure—and, indeed, educate—webmasters who spot your automation and wonder what you’re up to.

Hack #1. A Crash Course in Spidering and Scraping

A few of the whys and wherefores of spidering and scraping.

There is a wide and ever-increasing variety of computer programs gathering and sifting information, aggregating resources, and comparing data. Humans are just one part of a much larger and automated equation. But despite the variety of programs out there, they all have some basic characteristics in common.

Spiders are programs that traverse the Web, gathering information. If you’ve ever taken a gander at your own web site’s logs, you’ll see them peppered with User-Agent names like Googlebot, Scooter, and MSNbot. These are all spiders—or bots , as some prefer to call them.

Throughout this book, you’ll hear us referring to spiders and scrapers. What’s the difference? Broadly speaking, they’re both programs that go out on the Internet and grab things. For the purposes of this book, however, it’s probably best for you to think of spiders as programs that grab entire pages, files, or sets of either, while scrapers grab very specific bits of information within these files. For example, one of the spiders [Hack #44] in this book grabs entire collections of Yahoo! Group messages to turn into mailbox files for use by your email application, while one of the scrapers [Hack #76] grabs train schedule information. Spiders follow links, gathering up content, while scrapers pull data from web pages. Spiders and scrapers usually work in concert; you might have a program that uses a spider to follow links but then uses a scraper to gather particular information.

Why Spider?

When learning about a technology or way of using technology, it’s always good to ask the big question: why? Why bother to spider? Why take the time to write a spider, make sure it works as expected, get permission from the appropriate site’s owner to use it, make it available to others, and spend time maintaining it? Trust us; once you’ve started using spiders, you’ll find no end to the ways and places they can be used to make your online life easier:

Gain automated access to resources: Sure, you can visit every site you want to keep up with in your web browser every day, but wouldn’t it be easier to have a program do it for you, passing on only content that should be of interest to you? Having a spider bring you the results of a favorite Google search can save you a lot of time, energy, and repetitive effort. The more you automate, the more time you can spend having fun with and making use of the data.
Gather information and present it in an alternate format: Gather marketing research in the form of search engine results and import them into Microsoft Excel for use in presentations or tracking over time [Hack #93]. Grab a copy of your favorite Yahoo! Groups archive in a form your mail program can read just like the contents of any other mailbox [Hack #43]. Keep up with the latest on your favorite sites without actually having to pay them a visit one after another [Hack #81]. Once you have raw data at your disposal, it can be repurposed, repackaged, and reformatted to your heart’s content.
Aggregate otherwise disparate data sources: No web site is an island, but you wouldn’t know it, given the difficulty of manually integrating data across various sites. Spidering automates this drudgery, providing a 15,000-foot view of otherwise disparate data. Watch Google results change over time [Hack #93] or combine syndicated content [Hack #69] from multiple weblogs into one RSS feed. Spiders can be trained to aggregate data, both across sources and over time.
Combine the functionalities of sites: There might be a search engine you love, but which doesn’t do everything you want. Another fills in some of those gaps, but doesn’t fill the need on its own. A spider can bridge the gap between two such resources [Hack #48], querying one and providing that information to another.
Find and gather specific kinds of information: Perhaps what you seek needs to be searched for first. A spider can run web queries on your behalf, filling out forms and sifting through the results [Hack #51].
Perform regular webmaster functions: Let a spider take care of the drudgery of daily webmastering. Have it check your HTML to be sure it is standards-compliant and tidy (http://tidy.sourceforge.net/), that your links aren’t broken, or that you’re not linking to any prurient content.

For more detail on spiders, robots, crawlers, and scrapers, visit the Web Robot FAQ at http://www.robotstxt.org/wc/faq.html.

Hack #2. Best Practices for You and Your Spider

Some rules for the road as you’re writing your own well-behaved spider.

In order to make your spider as effective, polite, and useful as possible, there are some general things you’ll have to keep in mind as you create them.

Be Liberal in What You Accept

To spider, you must pull information from a web site. To pull information from a web site, you must wade your way through some flavor of tag soup, be it HTML, XML, plain text, or something else entirely. This is an inexact science, to put it mildly. If even one tag or bit of file formatting changes, your spider will probably break, leaving you dataless until such time as you retool. Thankfully, most sites aren’t doing huge revamps every six months like they used to, but they still change often enough that you’ll have to watch out for this.

To minimize the fragility of your scraping, use as little boundary data as you can when gleaning data from the page. Boundary data is the fluff around the actual goodness you want: the tags, superfluous verbiage, spaces, newlines, and such. For example, the title of an average web page looks something like this:

<title>This is the title</title>

If you’re after the title, the boundary data is the <title> and </title> tags.

Monitor your spider’s output on a regular basis to make sure it’s working as expected [Hack #31], make the appropriate adjustments as soon as possible to avoid losing ground with your data gathering, and design your spider to be as adaptive to site redesigns [Hack #32] as possible.

Don’t Limit Your Dataset

Just because you’re working with the Web doesn’t mean you’re restricted to spidering HTML documents. If you’re considering only web pages, you’re potentially narrowing your dataset arbitrarily. There are images, sounds, movies, PDFs, text files—all worthy of spidering for your collection.

Don’t Reinvent the Wheel

While it’s tempting to think what you’re up to is unique, chances are, someone’s already spidered and scraped the same or similar sites, leaving clear footprints in the form of code, raw data, or instructions.

CPAN (http://www.cpan.org), the Comprehensive Perl Archive Network, is a treasure trove of Perl modules for programming to the Internet, shuffling through text in search of data, manipulating gleaned datasets—all the functionality you’re bound to be building into your spider. And these modules are free to take, use, alter, and augment. Who knows, by the time you finish your spider, perhaps you’ll end up with a module or three of your own to pass on to the next guy.

Before you even start coding, check the site to make sure you’re not spending an awful lot of effort building something the site already offers. If you want a weather forecast delivered to your email inbox every morning, check your local newspaper’s site or sites like weather.com (http://www.weather.com) to see if they offer such a service; they probably do. If you want the site’s content as an RSS feed and they don’t appear to sport that orange “XML” button, try a Google search for it (rss site:example.com ( filetype:rss | filetype:xml | filetype:rdf )) or check Syndic8 (http://www.syndic8.com) for an original or scraped version.

Then, of course, you can always contact the site owner, asking him if he has a particular service or data format available for public consumption. Your query might just be the one that convinces him that an RSS feed of or web service API to his content is a good idea.

See [Hack #100] for more pointers on scraping resources and communities.

Best Practices for You

Just as it is important to follow certain rules when programming your spider, it’s important to follow certain rules when designing them as well.

Choose the most structured format available

HTML files are fairly unstructured, focusing more on presentation than on the underlying raw data. Often, sites have more than one flavor of their content available; look or ask for the XHTML or XML version, which are cleaner and more structured file formats. RSS, a simple form of XML, is everywhere.

If you must scrape HTML, do so sparingly

If the information you want is available only embedded in an HTML page, try to find a “Text Only” or “Print this Page” variant; these usually have far less complicated HTML and a higher content-to-presentation markup quotient, and they don’t tend to change all that much (by comparison) during site redesigns.

Regardless of what you eventually use as your source data, try to scrape as little HTML surrounding the information you want as possible. You want just enough HTML to uniquely identify the information you desire. The less HTML, the less fragile your spider will be. See “Anatomy of an HTML Page” [Hack #3] for more information.

Use the right tool for the job

Should you scrape the page using regular expressions? Or would a more comprehensive tool like WWW::Mechanize [Hack #22] or HTML::TokeParser [Hack #20] fit the bill better? This depends very much on the data you’re after and the crafting of the page’s HTML. Is it handcrafted and irregular, or is it tool-built and regular as a bran muffin? Choose the simplest and least fragile method for the job at hand—with an emphasis on the latter.

Don’t go where you’re not wanted

Your script may be the coolest thing ever, but it doesn’t matter if the site you want to spider doesn’t allow it. Before you go to all that trouble, make sure that the site doesn’t mind being spidered and that you’re doing it in such a way that you’re having the minimal possible impact on site bandwidth and resources [Hack #16]. For more information on this issue, including possible legal risks, see [Hack #6] and [Hack #17].

Choose a good identifier

When you’re writing an identifier for your spider, choose one that clearly specifies what the spider does: what information it’s intended to scrape and what it’s used for. There’s no need to write a novel; a sentence will do fine. These identifiers are called User-Agents, and you’ll learn how to set them in [Hack #11].

Whatever you do, do not use an identifier that impersonates an existing spider, such as Googlebot, or an identifier that’s confusingly similar to an existing spider. Not only will your spider get iced, you’ll also get into trouble with Google or whoever you’re imitating. See [Hack #6] for the possible consequences of playing mimicry games.

Make information on your spider readily available

Put up a web page that provides information about your spider and a contact address. Be sure that it’s accessible from your friendly neighborhood search engine. See [Hack #4] for some ways and places to get the word out about its existence.

Don’t demand unlimited site access or support

You may have written the greatest application since Google’s PageRank, but it’s up to the webmaster to decide if that entitles you to more access to site content or restricted areas. Ask nicely, and don’t demand. Share what you’re doing; consider giving them the code! After all, you’re scraping information from their web site. It’s only fair that you share the program that makes use of their information.

Best Practices for Your Spider

When you write your spider, there are some good manners you should follow.

Respect robots.txt

robots.txt is a file that lives at the root of a site and tells spiders what they can and cannot access on that server. It can even tell particular spiders to leave the site entirely unseen. Many webmasters use your spider’s respect—or lack thereof—for robots.txt as a benchmark; if you ignore it, you’ll likely be banned. See [Hack #17] for detailed guidelines.

Secondary to the robots.txt file is the Robots META tag (http://www.robotstxt.org/wc/exclusion.html#meta), which gives in- dexing instructions to spiders on a page-by-page basis. The Robots META tag protocol is not nearly as universal as robots.txt, and fewer spiders respect it.

Go light on the bandwidth

You might love a site’s content and want to make the most of it for your application, but that’s no reason to be greedy. If your spider tries to slurp up too much content in a short stretch of time—dozens or even hundreds of pages per second—you could hurt both the bandwidth allowances of the site you’re scraping and the ability of other visitors to access the site. This is often called hammering (as in, “That stupid spider is hammering my site and the page delivery has slowed to a crawl!”).

There is no agreement on how quickly spiders can politely access pages. One or two requests per page per second has been proposed by contributors to WebmasterWorld.com.

WebmasterWorld.com (http://www.webmasterworld.com) is an online gathering of search engine enthusiasts and webmasters from all over the world. Many good discussions happen there. The best part about WebmasterWorld.com is that representatives from several search engines and sites participate in the discussions.

Unfortunately, it seems that it’s easier to define what’s unacceptable than to figure out a proper limit. If you’re patient, one or two requests a second is probably fine; beyond that, you run the risk of making somebody mad. Anywhere’s walking distance if you have the time; in the same manner, if you’re in no rush to retrieve the data, impart that to your spider. Refer to [Hack #16] for more information on minimizing the amount of bandwidth you consume.

Take just enough, and don’t take too often

Overscraping is, simply, taking more than you need and thus taking more of the site’s bandwidth than necessary. If you need a page, take a page. Don’t take the entire directory or (heaven help you) the entire site.

This also applies to time. Don’t scrape the site any more often than is necessary. If your program will run with data scraped from the site once a day, stick with that. I wouldn’t go more than once an hour, unless I absolutely had to (and had permission from the site owner).

Hack #3. Anatomy of an HTML Page

Getting the knack of scraping is more than just code; it takes knowing HTML and other kinds of web page files.

If you’re new to spidering, figuring out what to scrape and why is not easy. Relative scraping newbies might try to take too much information, too little, or information that’s too likely to change from what they want. If you know how HTML files are structured, however, you’ll find it easier to scrape them and zero in on the information you need.

HTML files are just text files with special formatting. And that’s just the kind of file you’ll spend most of your time scraping, both in this book and in your own spidering adventures. While we’ll also be spidering and grabbing multimedia files—images, movies, and audio files—we won’t be scraping and parsing them to glean embedded information.

Anatomy of an HTML Page

That’s not to say, however, that there aren’t about as many ways to format an HTML page as there are pages on the Web. To understand how your spider might be able to find patterns of information on an HTML page, you’ll need to start with the basics—the very basics—of how an HTML web page looks, and then get into how the information within the body can be organized.

The core of an HTML page looks like this:

<html>
<head>
  <title>
    Title of the page
  </title>
</head>
<body>
  Body of the page
</body>
</html>

That’s it. 99% of the HTML pages on the Web start out like this. They can get a lot more elaborate but, in the end, this is the core. What does this mean to your spider? It means that there’s only one piece of information that’s clearly marked by tags, and that’s the page title. If all you need is the title, you’re in gravy.

But if you need information from the body of a page—say, a headline or a date—you have some detective work ahead of you. Many times, the body of a page has several tables, JavaScript, and other code that obscures what you’re truly looking for—all annoyances that have much more to do with formatting information than truly organizing it. But, at the same time, the HTML language contains several standards for organizing data. Some of these standards make the information larger on the page, representing a heading. Some of the standards organize information into lists within the body. If you understand how the standards work, you’ll find it easier to pluck the information you want from the heavily coded confines of a web page body.

Header Information with the H Tags

Important information on a page (headlines, subheads, notices, and so forth) are usually noted with an <H x > tag, where x is a number from 1 to 6. An <H1> tag is normally displayed as the largest, as it is highest in the headline hierarchy.

Depending on how the site is using them, you can sometimes get a good summary of information from a site just by scraping the H tags. For example, if you’re scraping a news site and you know they always put headlines in <H2> tags and subheads in <H4> tags, you can scrape for that specific markup and get brief story extractions, without having to figure out the rest of the story’s coding. In fact, if you know a site always does that, you can scrape the entire site just for those tags, without having to look at the rest of the site’s page structure at all.

List Information with Special HTML Tags

Not all web wranglers use specific HTML tags to organize lists; some of them just start new numbered paragraphs. But, for the more meticulous page-builder, there are specific tags for lists.

Ordered lists (lists of information that are automatically numbered) are bounded with <ol> and </ol> tags, and each item within is bounded by <li> and </li> tags. If you’re using regular expressions to scrape for information, you can grab everything between <ol> and </ol>, parse each <li></li> element into an array, and go from there. Here’s an ordered list:

<ol>
 <li>eggs</li>
 <li>milk</li>
 <li>butter</li>
 <li>sugar</li>
</ol>

Unordered lists are just like ordered lists, except that they appear in the browser with bullets instead of numbers, and the list is bounded with <ul></ul> instead of <ol></ol>.

Non-HTML Files

Some non-HTML files are just as nebulous as HTML files, while some are far better defined. Plain .txt files, for example (and there are plenty of them available on the Web), have no formatting at all—not even as basic as “this is the title and this is the body.” On the other hand, text files are sometimes easier to parse, because they have no HTML code soup to wade through.

At the other extreme are XML (Extensible Markup Language) files. XML’s parts are defined more rigidly than HTML. RSS, a syndication format and a simple form of XML, has clearly defined parts in its files for titles, content, links, and additional information. We often work with RSS files in this book; the precisely defined parts are easy to parse and write using Perl. See “Using XML::RSS to Repurpose Everything” [Hack #94].

The first thing you’ll need to do when you decide you want to scrape something is determine what kind of file it is. If it’s a plain .txt file, you won’t be able to pinpoint your scraping. If it’s an XML file, you’ll be able to zoom in on what you want with regular expressions, or use any number of Perl XML modules (such as XML::Simple, XML::RSS, or XML::LibXML).

Hack #4. Registering Your Spider

If you have a spider you’re programming or planning on using even a minimal amount, you need to make sure it can be easily identified. The most low-key of spiders can be the subject of lots of attention.

On the Internet, any number of “arms races” are going on at the same time. You know: spammers versus antispammers, file sharers versus non-file sharers, and so on. A lower-key arms race rages between web spiders and webmasters who don’t want the attention.

Who might not want to be spidered? Unfortunately, not all spiders are as benevolent as the Googlebot, Google’s own indexer. Many spiders go around searching for email addresses to spam. Still others don’t abide by the rules of gentle scraping and data access [Hack #2]. Therefore, spiders have gotten to the point where they’re viewed with deep suspicion by experienced webmasters.

In fact, it’s gotten to the point where, when in doubt, your spider might be blocked. With that in mind, it’s important to name your spider wisely, register it with online databases, and make sure it has a reasonably high profile online.

By the way, you might think that your spider is minimal or low-key enough that nobody’s going to notice it. That’s probably not the case. In fact, sites like Webmaster World (http://www.webmasterworld.com) have entire forums devoted to identifying and discussing spiders. Don’t think that your spider is going to get ignored just because you’re not using a thousand online servers and spidering millions of pages a day.

Naming Your Spider

The first thing you want to do is name your spider. Choose a name that gives some kind of indication of what your spider’s about and what it does. Examplebot isn’t a good name. NewsImageScraper is better. If you’re planning to do a lot of development, consider including a version number (such as NewsImageScraper/1.03).

If you’re running several spiders, you might want to consider giving your spider a common name. For example, if Kevin runs different spiders, he might consider giving them a naming convention starting with disobeycom: disobeycomNewsImageScraper, disobeycomCamSpider, disobeycomRSSfeeds, and so on. If you establish your spiders as polite and well behaved, a webmaster who sees a spider named similarly to yours might give it the benefit of the doubt. On the other hand, if you program rude, bandwidth-sucking spiders, giving them similar names makes it easier for webmasters to ban ‘em all (which you deserve).

Considering what you’re going to name your spider might give you what, at first glance, looks like a clever idea: why not just name your spider after one that already exists? After all, most corners of the web make their resources available to the Googlebot; why not just name your spider Googlebot?

As we noted earlier, this is a bad idea for many reasons, including the fact that the owner of the spider you imitate is likely to ice your spider. There are web sites, like http://www.iplists.com, devoted to tracking IP addresses of legitimate spiders. (For example, there’s a whole list associated with the legitimate Googlebot spider.) And second, though there isn’t much legal precedent addressing fraudulent spiders, Google has already established that they don’t take kindly to anyone misappropriating, or even just using without permission, the Google name.

A Web Page About Your Spider

Once you’ve created a spider, you’ll need to register it. But I also believe you should create a web page for it, so a curious and mindful webmaster has to do no more than a quick search to find information. The page should include:

Its name, as it would appear in the logs (via User-Agent)
A brief summary of what the spider was intended for and what it does (as well as a link to the resources it provides, if they’re publicly available)
Contact information for the spider’s programmer
Information on what webmasters can do to block the script or make their information more available and usable to the spider if it’s preferred

Places to Register Your Spider

Even if you have a web page that describes your spider, be sure to register your spider at the online database spots. Why? Because webmasters might default to searching databases instead of doing web searches for spider names. Furthermore, webmasters might use databases as a basis for deciding which spiders they’ll allow on their site. Here are some databases to get you started:

Web Robots Database (http://www.robotstxt.org/wc/active.html): Viewable in several different formats. Adding your spider requires filling out a template and emailing to a submission address.
Search engine robots (http://www.jafsoft.com/searchengines/webbots.html): User-Agents and spiders organized into different categories—search engine robots, browsers, link checkers, and so on—with a list of “fakers” at the end, including some webmaster commentary.
List of User-Agents (http://www.psychedelix.com/agents.html): Divided over several pages and updated often. There’s no clear submission process, though there’s an email address at the bottom of each page.
The User Agent Database (http://www.icehousedesigns.com/useragents/): Almost 300 agents listed, searchable in several different ways. This site provides an email address to which you can submit your spider.

Hack #5. Preempting Discovery

Rather than await discovery, introduce yourself!

No matter how gentle and polite your spider is, sooner or later you’re going to be noticed. Some webmaster’s going to see what your spider is up to, and they’re going to want some answers. Rather than wait for that to happen, why not take the initiative and make the first contact yourself? Let’s look at the ways you can preempt discovery, make the arguments for your spider, and announce it to the world.

Making Contact

If you’ve written a great spider, why not tell the site about it? For a small site, this is relatively easy and painless: just look for the Feedback, About, or Contact links. For larger sites, though, figuring out whom to contact is more difficult. Try the technical contacts first, and then web feedback contacts. I’ve found that public relations contacts are usually best to reach last. Although tempting, because it’s usually easy to find their addresses, PR folk like to concentrate on dealing with press people (which you’re probably not) and they probably won’t know enough programming to understand your request. (PR people, this isn’t meant pejoratively. We still love you. Keep helping us promote O’Reilly books. Kiss, kiss.)

If you absolutely can’t find anyone to reach out to, try these three steps:

Many sites, especially technical ones, have employees with weblogs. See if you can find them via a Google search. For example, if you’re looking for Yahoo! employees, the search "work for yahoo" (weblog | blog) does nicely. Sometimes, you can contact these people and let them know what you’re doing, and they can either pass your email to someone who can approve it, or give you some other feedback.
99.9% of the time, an email to webmaster@ will work (e.g., webmaster@example.com). But it’s not always guaranteed that anyone reads this email more than once a month, if at all.
If you’re absolutely desperate, you can’t find email addresses or contact information anywhere on the site, and your emails to webmaster@ have bounced, try looking up the domain registration at http://www.whois.org or a similar domain lookup site. Most of the time, you’ll find a contact email at this address, but again, there’s no guarantee that anyone checks it, or even that it’s still active. And remember, this works only for top-level domain information. In other words, you might be able to get the contact information for www.example.com but not for www.example.com/resource/.

Making the Arguments for Your Spider

Now that you have a contact address, give a line of reasoning for your spider. If you can clearly describe what your spider’s all about, great. But it may get to the point where you have to code up an example to show to the webmaster. If the person you’re talking to isn’t Perl-savvy, consider making a client-side version of your script with Perl2Exe (http://www.indigostar.com/perl2exe.htm) or PAR (http://search.cpan.org/author/AUTRIJUS/PAR) and sending it to her to test drive.

Offer to show her the code. Explain what it does. Give samples of the output. If she really likes it, offer to let her distribute it from her site! Remember, all the average, nonprogramming webmaster is going to hear is “Hi! I wrote this Program and it Does Stuff to your site! Mind if I use it?” Understand if she wants a complete explanation and a little reassurance.

Making Your Spider Easy to Find and Learn About

Another good way to make sure that someone knows about your spider is to include contact information in the spider’s User-Agent [Hack #11]. Contact information can be an email or a web address. Whatever it is, be sure to monitor the address and make sure the web site has adequate information.

Considering Legal Issues

Despite making contact, getting permission, and putting plenty of information about your spider on the Web, you may still have questions. Is your spider illegal? Are you going to get in trouble for using it?

There are many open issues with respect to the laws relating to the Web, and cases, experts, and scholars—not to mention members of the Web community—disagree heartily on most of them. Getting permission and operating within its limits probably reduces your risk, particularly if the site’s a small one (that is, run by a person or two instead of a corporation). If you don’t have permission and the site’s terms of service aren’t clear, risk is greater. That’s probably also true if you’ve not asked permission and you’re spidering a site that makes an API available and has very overt terms of service (like Google).

Legal issues on the Internet are constantly evolving; the medium is just too new to make sweeping statements about fair use and what’s going to be okay and what’s not. It’s not just how your spider does its work, but also what you do with what you collect. In fact, we need to warn you that just because a hack is in the book doesn’t mean that we can promise that it won’t create risks or that no webmaster will ever consider the hack a violation of the relevant terms of service or some other legal rights.

Use your common sense (don’t suck everything off a web site, put it on yours, and think you’re okay), keep copyright laws in mind (don’t take entire wire service stories and stick them on your site), and ask permission (the worst thing they can say is no, right?). If you’re really worried, your best results will come from talking to an experienced lawyer.

Hack #6. Keeping Your Spider Out of Sticky Situations

You see tasty data here, there, and everywhere. Before you dive in, check the site’s acceptable use policies.

Because the point of Spidering Hacks is to get to data that APIs can’t (or haven’t been created to) reach, sometimes you might end up in a legal gray area. Here’s what you can do to help make sure you don’t get anywhere near a “cease and desist” letter or the threat of a lawsuit.

Perhaps, one fine day, you visit a site and find some data you’d simply love to get your hands on. Before you start hacking, it behooves you to spend a little time looking around for an Acceptable Use Policy (AUP) or Terms of Service (TOS)—occasionally you’ll see a Terms of Use (TOU)—and familiarize yourself with what you can and can’t do with the site itself and its underlying data. Usually, you’ll find a link at the bottom of the home page, often along with the site’s copyright information. Yahoo! has a Terms of Service link as almost the last entry on its front page, while Google’s is at the bottom of their About page. If you can’t find it on the front page, look at the corporate information or any About sections. In some cases, sites (mostly smaller ones) won’t have them, so you should consider contacting the webmaster—just about always webmaster@sitename.com—and ask.

So, you’ve found the AUP or TOS. Just what is it you’re supposed to be looking for? What you’re after is anything that has to do with spidering or scraping data. In the case of eBay, their position is made clear with this excerpt from their User Agreement:

You agree that you will not use any robot, spider, scraper or other automated means to access the Site for any purpose without our express written permission.

Clear enough, isn’t it? But sometimes it won’t be this obvious. Some usage agreements don’t make any reference whatsoever to spidering or scraping. In such cases, look for a contact address for the site itself or technical issues relating to its operation, and ask.

Bad Spider, No Biscuit!

Even with adherence to the terms of service and usage agreements you find on its pages, a web site might simply have a problem with how you’re using its data. There are several ways in which a spider might be obeying the letter of a service agreement yet still doing something unacceptable from the perspective of the owners of the content. For example, a site might say that it doesn’t want its content republished on a web site. Then, a spider comes along and turns its information into an RSS feed. An RSS feed is not, technically speaking, a web page. But the site owners might still find this use unacceptable. There is nothing stopping a disgruntled site from revising its TOS to deny a spider’s access, and then sending you a “cease and desist” letter.

But let’s go beyond that for a minute. Of course we don’t want you to violate Terms of Service, dance with lawyers, and so on. The Terms of Service are there for a reason. Usually, they’re the parameters under which a site needs to operate in order to stay in business. Whatever your spider does, it needs to do it in the spirit of keeping the site from which it draws information healthy. If you write a spider that sucks away all information from advertiser-supported sites, and they can’t sell any more advertising, what happens? The site dies. You lose the site, and your program doesn’t work any more.

Though it’s rarely done in conjunction with spidering, framing data is a long-established legal no-no. Basically, framing data means that you’re putting the content of someone else’s site under a frame of your own design (in effect, branding another site’s data with your own elements). The frame usually contains ads that are paying you for the privilege. Spidering another site’s content and reappropriating it into your own framed pages is bad. Don’t do it.

Violating Copyright

I shouldn’t even have to say this, but reiteration is a must. If you’re spidering for the purpose of using someone else’s intellectual property on your web site, you’re violating copyright law. I don’t care if your spider is scrupulously obeying a site’s Terms of Service and is the best-behaved spider in the world, it’s still doing something illegal. In this case, you can’t fix the spider; it’s not the code that’s at fault. Instead, you’d better fix the intent of the script you wrote. For more information about copyright and intellectual property on the Web, check out Lawrence Lessig’s weblog at http://www.lessig.org/blog/ (Professor Lessig is a Professor of Law at Stanford Law School); the Electronic Frontier Foundation (http://www.eff.org); and Copyfight, the Politics of IP (http://www.copyfight.org/).

Aggregating Data

Aggregating data means gathering data from several different places and putting it together in one place. Think of a site that gathers different airline ticket prices in one place, or a site that compares prices from several different online bookstores. These are online aggregators, which represent a gray area in Internet etiquette. Some companies resent their data being aggregated and compared to the data on other sites (like comparison price shopping). Other companies don’t care. Some companies actually have agreements with certain sites to have their information aggregated! You won’t often find this spelled out in a site’s Terms of Service, so when in doubt, ask.

Competitive Intelligence

Some sites complain because their competitors access and spider their data—data that’s publicly available to any browser—and use it in their competitive activities. You might agree with them and you might not, but the fact is that such scraping has been the object of legal action in the past. Bidder’s Edge was sued by eBay (http://pub.bna.com/lw/21200.htm) for such a spider.

Possible Consequences of Misbehaving Spiders

What’s going to happen if you write a misbehaving spider and unleash it on the world? There are several possibilities. Often, sites will simply block your IP address. In the past, Google has blocked groups of IP addresses in an attempt to keep a single automated process from violating its TOS. Otherwise, the first course of action is usually a “cease and desist” letter, telling you to knock it off. From there, the conflict could escalate into a lawsuit, depending on your response.

Besides the damages that are assessed against people who lose lawsuits, some of the laws governing content and other web issues—for example, copyright laws—carry criminal penalities, which means fines and imprisonment in really extreme situations.

Writing a misbehaving spider is rarely going to have the police kicking down your door, unless you write something particularly egregious, like something that floods a web site with data or otherwise interferes with that site’s ability to operate (referred to as denial of service). But considering lawyer’s fees, the time it’ll take out of your life, and the monetary penalties that might be imposed on you, a lawsuit is bad enough, and it’s a good enough reason to make sure that your spiders are behaving and your intent is fair.

Tracking Legal Issues

To keep an eye on ongoing legal and scraping issues, try the Blawg Search (http://blawgs.detod.com/) search engine, which indexes only weblogs that cover legal issues and events. Try a search for spider, scraper, or spider lawsuit. If you’re really interested, note that Blawg Search’s results are also available as an RSS feed for use in popular syndicated news aggregators. You could use one of the hacks in this book and start your own uber-RSS feed for intellectual property concerns.

Other resources for keeping on top of brewing and current legal issues include: Slashdot (http://slashdot.org/search.pl?topic=123), popular geek hangout; the Electronic Freedom Foundation (http://www.eff.org), keeping tabs on digital rights; and the Berkman Center for Internet & Society at Harvard Law School (http://cyber.law.harvard.edu/home/), a research program studying cyberspace and its implications.

Hack #7. Finding the Patterns of Identifiers

If you find that the online database or resource you want uses unique identification numbers, you can stretch what it does by combining it with other sites and identification values.

Some online data collections are just that—huge collections, put together in one place, and relying on a search engine or database program to provide organization. These collections have no unique ID numbers, no rhyme or reason to their organization. But that’s not always the case.

As more and more libraries put their collection information online, more and more records and pages have their own unique identification numbers.

So what? Here’s what: when a web site uses an identifying method for its information that is recognized by other web sites, you can scrape data across multiple sites using that identifying method. For example, say you want to tour the country playing golf but you’re afraid of pollution, so you want to play only in the cleanest areas. You could write a script that searches for golf courses at http://www.golfcourses.com, then takes the Zip Codes of the courses returned and checks them against http://www.scorecard.org to see which have the most (or least) polluted environment.

This is a silly example, but it shows how two different online data sources (a golf course database and an environmental pollution guide) can be linked together with a unique identifying number (a Zip Code, in this case).

Speaking generally, there are three types of deliberate web data organization:

Arbitrary classification systems within a collection
Classification systems that use an established universal taxonomy within a collection
Classification systems that identify documents across a wide number of collections

Arbitrary Classification Systems Within a Collection

An arbitrary classification system is either not based on an established taxonomy or only loosely based on an established taxonomy. If I give 10 photographs unique codes based on their topics and how much blue they have in them, I have established an arbitrary classification system.

The arbitrary classification system’s usefulness is limited. You cannot use its identifying code on other sites. You might be able to detect a pattern in it that allows you to spider large volumes of data, but, on the other hand, you may not. (In other words, files labeled 10A, 10B, 10C, and 10D might be useful, but files labeled CMSH113, LFFD917, and MDFS214 would not.)

Classification Systems that Use an Established Universal Taxonomy Within a Collection

The most overt example of classification systems that use an established universal taxonomy is a library card catalog that follows the Dewey Decimal, Library of Congress, or other established classification.

The benefit of such systems is mixed. Say I look up Google Hacks at the University of Tulsa. I’ll discover that the LOC number is ZA4251.G66 C3 2003. Now, if I plug that number into Google, I’ll find about 13 results. Here’s the cool part: the results will be from a variety of libraries. So, if I wanted to, I could plug that search into Google and find other libraries that carry Google Hacks and extend that idea into a spidering script [Hack #65].

That’s the good thing. The bad thing is that such a search won’t list all libraries that carry Google Hacks. Other libraries have different classification systems, so if you’re looking for a complete list of libraries carrying the book, you’re not going to find it solely with this method. But you may find enough libraries to work well enough for your needs.

Classification Systems that Identify Documents Across a Wide Number of Collections

Beyond site classifications that are based on an established taxonomy, there are systems that use an identification number that is universally recognized and applied. Examples of such systems include:

ISBN (International Standard Book Number): As you might guess, this number is an identification system for books. There are similar numbers for serials, music, scientific reports, etc. You’ll see ISBNs used everywhere from library catalogs to Amazon.com—anywhere books are listed.
EIN (Employer Identification Number): Used by the IRS. You’ll see this number in tax-filing databases, references to businesses and nonprofits, and so on.
Zip Code: Allows the U.S. Post Office to identify unique areas.

This list barely scratches the surface. Of course, you can go even further, including unique characteristics such as latitude and longitude, other business identification numbers, or even area codes. The challenge is identifying a unique classification number that requires context that’s minimal enough to make it usable by your spider. "918" is a three-number string that has plenty of search results beyond just those related to area codes. So, you might not be able to find a way to eliminate your false positives when building a spider that depends on area codes for results.

On the other hand, an extended classification number, such as an LOC catalog number or ISBN, is going to have few, if any, false positives. The longer or more complicated an identifying number is, the better it serves the purposes of spidering.

Some Large Collections with ID Numbers

There are several places online that use unique classification numbers, which you can use across other sites. Here are a few you might want to play with:

Amazon.com (http://www.amazon.com), Abebooks (http://www.abebooks.com): These sites both use ISBN numbers. Combining the two result sets can aggregate seller information to find you the cheapest price on used books.
The International Standard Serial Number Register (http://www.issn.org): You need to be a subscriber to use this site, but trials are available. ISSNs are used for both online and offline magazines.
United States Post Office (http://www.usps.com): This site allows you to do both standard and nine-digit Zip Code lookups, allowing you to find more specific areas within a Zip Code (and eliminate false positive results from your spider).

GuideStar, a database of nonprofit organizations, has a search page that allows you to search by EIN (http://www.guidestar.org/search/). A variety of other business databases also allow you to search by EIN.

Get Spidering Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 1. Walking Softly

Hack 1. Hacks #1-7

Hack #1. A Crash Course in Spidering and Scraping

Why Spider?

Hack #2. Best Practices for You and Your Spider

Be Liberal in What You Accept

Don’t Limit Your Dataset

Don’t Reinvent the Wheel

Best Practices for You

Choose the most structured format available

If you must scrape HTML, do so sparingly

Use the right tool for the job

Don’t go where you’re not wanted

Choose a good identifier

Make information on your spider readily available

Don’t demand unlimited site access or support

Best Practices for Your Spider

Respect robots.txt

Go light on the bandwidth

Take just enough, and don’t take too often

Hack #3. Anatomy of an HTML Page

Anatomy of an HTML Page

Header Information with the H Tags

List Information with Special HTML Tags

Non-HTML Files

Hack #4. Registering Your Spider

Naming Your Spider

A Web Page About Your Spider

Places to Register Your Spider

Hack #5. Preempting Discovery

Making Contact

Making the Arguments for Your Spider

Making Your Spider Easy to Find and Learn About

Considering Legal Issues

Hack #6. Keeping Your Spider Out of Sticky Situations

Bad Spider, No Biscuit!

Violating Copyright

Aggregating Data

Competitive Intelligence

Possible Consequences of Misbehaving Spiders

Tracking Legal Issues

Hack #7. Finding the Patterns of Identifiers

Arbitrary Classification Systems Within a Collection

Classification Systems that Use an Established Universal Taxonomy Within a Collection

Classification Systems that Identify Documents Across a Wide Number of Collections

Some Large Collections with ID Numbers

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly