Chapter 1. Introduction to Web Automation

LWP (short for “Library for World Wide Web in Perl”) is a set of Perl modules and object-oriented classes for getting data from the Web and for extracting information from HTML. This chapter provides essential background on the LWP suite. It describes the nature and history of LWP, which platforms it runs on, and how to download and install it. This chapter ends with a quick walkthrough of several LWP programs that illustrate common tasks, such as fetching web pages, extracting information using regular expressions, and submitting forms.

The Web as Data Source

Most web sites are designed for people. User Interface gurus consult for large sums of money to build HTML code that is easy to use and displays correctly on all browsers. User Experience gurus wag their fingers and tell web designers to study their users, so they know the human foibles and desires of the ape descendents who will be viewing the web site.

Fundamentally, though, a web site is home to data and services. A stockbroker has stock prices and the value of your portfolio (data) and forms that let you buy and sell stock (services). Amazon has book ISBNs, titles, authors, reviews, prices, and rankings (data) and forms that let you order those books (services).

It’s assumed that the data and services will be accessed by people viewing the rendered HTML. But many a programmer has eyed those data sources and services on the Web and thought “I’d like to use those in a program!” For example, they could page you when your portfolio falls past a certain point or could calculate the “best” book on Perl based on the ratio of its price to its average reader review.

LWP lets you do this kind of web automation. With it, you can fetch web pages, submit forms, authenticate, and extract information from HTML. Once you’ve used it to grab news headlines or check links, you’ll never view the Web in the same way again.

As with everything in Perl, there’s more than one way to automate accessing the Web. In this book, we’ll show you everything from the basic way to access the Web (via the LWP::Simple module), through forms, all the way to the gory details of cookies, authentication, and other types of complex requests.

Screen Scraping

Once you’ve tackled the fundamentals of how to ask a web server for a particular page, you still have to find the information you want, buried in the HTML response. Most often you won’t need more than regular expressions to achieve this. Chapter 6 describes the art of extracting information from HTML using regular expressions, although you’ll see the beginnings of it as early as Chapter 2, where we query AltaVista for a word, and use a regexp to match the number in the response that says “We found [number] results.”

The more discerning LWP connoisseur, however, treats the HTML document as a stream of tokens (Chapter 7, with an extended example in Chapter 8) or as a parse tree (Chapter 9). For example, you’ll use a token view and a tree view to consider such tasks as how to catch <img...> tags that are missing some of their attributes, how to get the absolute URLs of all the headlines on the BBC News main page, and how to extract content from one web page and insert it into a different template.

In the old days of 80x24 terminals, “screen scraping” referred to the art of programmatically extracting information from the screens of interactive applications. That term has been carried over to mean the act of automatically extracting data from the output of any system that was basically designed for interactive use. That’s the term used for getting data out of HTML that was meant to be looked at in a browser, not necessarily extracted for your programs’ use.


In some lucky cases, your LWP-related task consists of downloading a file without requiring your program to parse it in any way. But most tasks involve having to extract a piece of data from some part of the returned document, using the screen-scraping tactics as mentioned earlier. An unavoidable problem is that the format of most web content can change at any time. For example in Chapter 8, I discuss the task of extracting data from the program listings at the web site for the radio show Fresh Air. The principle I demonstrate for that specific case is true for all extraction tasks: no pattern in the data is permanent and so any data-parsing program will be “brittle.”

For example, if you want to match text in section headings, you can write your program to depend on them being inside <h2>...</h2> tags, but tomorrow the site’s template could be redesigned, and headings could then be in <h3 class='hdln'>...</h3> tags, at which point your program won’t see anything it considers a section heading. In practice, any given site’s template won’t change on a daily basis (nor even yearly, for most sites), but as you read this book and see examples of data extraction, bear in mind that each solution can’t be the solution, but is just a solution, and a temporary and brittle one at that.

As somewhat of a lesson in brittleness, in this book I show you data from various web sites (, the BBC News web site, and many others) and show how to write programs to extract data from them. However, that code is fragile. Some sites get redesigned only every few years; seems to change something every few weeks. So while I’ve made every effort to provide accurate code for the web sites as they exist at the time of this writing, I hope you will consider the programs in this book valuable as learning tools even after the sites will have changed beyond recognition.

Web Services

Programmers have begun to realize the great value in automating transactions over the Web. There is now a booming industry in web services, which is the buzzword for data or services offered over the Web. What differentiates web services from web sites is that web services don’t emit HTML for the ultimate reading pleasure of humans, they emit XML for programs.

This removes the need to scrape information out of HTML, neatly solving the problem of ever-changing web sites made brittle by the fickle tastes of the web-browsing public. Some web services standards (SOAP and XML-RPC) even make the remote web service appear to be a set of functions you call from within your program—if you use a SOAP or XML-RPC toolkit, you don’t even have to parse XML!

However, there will always be information on the Web that isn’t accessible as a web service. For that information, screen scraping is the only choice.

History of LWP

The following history of LWP was written by Gisle Aas, one of the creators of LWP and its current maintainer.

The libwww-perl project was started at the very first WWW conference held in Geneva in 1994. At the conference, Martijn Koster met Roy Fielding who was presenting the work he had done on MOMspider. MOMspider was a Perl program that traversed the Web looking for broken links and built an index of the documents and links discovered. Martijn suggested turning the reusable components of this program into a library. The result was the libwww-perl library for Perl 4 that Roy maintained.

Later the same year, Larry Wall made the first “stable” release of Perl 5 available. It was obvious that the module system and object-oriented features that the new version of Perl provided make Roy’s library even better. At one point, both Martijn and myself had made our own separate modifications of libwww-perl. We joined forces, merged our designs, and made several alpha releases. Unfortunately, Martijn ended up in disagreement with his employer about the intellectual property rights of work done outside hours. To safeguard the code’s continued availability to the Perl community, he asked me to take over maintenance of it.

The LWP:: module namespace was introduced by Martijn in one of the early alpha releases. This name choice was lively discussed on the libwww mailing list. It was soon pointed out that this name could be confused with what certain implementations of threads called themselves, but no better name alternatives emerged. In the last message on this matter, Martijn concluded, “OK, so we all agree LWP stinks :-).” The name stuck and has established itself.

If you search for “LWP” on Google today, you have to go to 30th position before you find a link about threads.

In May 1996, we made the first non-beta release of libwww-perl for Perl 5. It was called release 5.00 because it was for Perl 5. This made some room for Roy to maintain libwww-perl for Perl 4, called libwww-perl-0.40. Martijn continued to contribute but was unfortunately “rolled over by the Java train.”

In 1997-98, I tried to redesign LWP around the concept of an event loop under the name LWPng. This allowed many nice things: multiple requests could be handled in parallel and on the same connection, requests could be pipelined to improve round-trip time, and HTTP/1.1 was actually supported. But the tuits to finish it up never came, so this branch must by now be regarded as dead. I still hope some brave soul shows up and decides to bring it back to life.

1998 was also the year that the HTML:: modules were unbundled from the core LWP distribution and the year after Sean M. Burke showed up and took over maintenance of the HTML-Tree distribution, actually making it handle all the real-world HTML that you will find. I had kind of given up on dealing with all the strange HTML that the web ecology had let develop. Sean had enough dedication to make sense of it.

Today LWP is in strict maintenance mode with a much slower release cycle. The code base seems to be quite solid and capable of doing what most people expect it to.

Installing LWP

LWP and the associated modules are available in various distributions free from the Comprehensive Perl Archive Network (CPAN). The main distributions are listed at the start of Appendix A, although the details of which modules are in which distributions change occasionally.

If you’re using ActivePerl for Windows or MacPerl for Mac OS 9, you already have LWP. If you’re on Unix and you don’t already have LWP installed, you’ll need to install it from CPAN using instructions given in the next section.

To test whether you already have LWP installed:

% perl -MLWP -le "print(LWP->VERSION)"

(The second character in -le is a lowercase L, not a digit one.)

If you see:

Can't locate LWP in @INC (@INC contains: ...lots of paths...).
BEGIN failed--compilation aborted.

or if you see a version number lower than 5.64, you need to install LWP on your system.

There are two ways to install modules: using the CPAN shell or the old-fashioned manual way.

Installing LWP from the CPAN Shell

The CPAN shell is a command-line environment for automatically downloading, building, and installing modules from CPAN.


If you have never used the CPAN shell, you will need to configure it before you can use it. It will prompt you for some information before building its configuration file.

Invoke the CPAN shell by entering the following command at a system shell prompt:

% perl -MCPAN -eshell

If you’ve never run it before, you’ll see this:

We have to reconfigure due to following uninitialized parameters:

followed by a number of questions. For each question, the default answer is typically fine, but you may answer otherwise if you know that the default setting is wrong or not optimal. Once you’ve answered all the questions, a configuration file is created and you can start working with the CPAN shell.

Obtaining help

If you need help at any time, you can read the CPAN shell’s manual page by typing perldoc CPAN or by starting up the CPAN shell (with perl -MCPAN -eshell at a system shell prompt) and entering h at the cpan> prompt:

cpan> h
Display Information
 command  argument          description
 a,b,d,m  WORD or /REGEXP/  about authors, bundles, distributions, modules
 i        WORD or /REGEXP/  about anything of above
 r        NONE              reinstall recommendations
 ls       AUTHOR            about files in the author's directory
Download, Test, Make, Install...
 get                        download
 make                       make (implies get)
 test      MODULES,         make test (implies make)
 install   DISTS, BUNDLES   make install (implies test)
 clean                      make clean
 look                       open subshell in these dists' directories
 readme                     display these dists' README files
 h,?           display this menu       ! perl-code   eval a perl command
 o conf [opt]  set and query options   q             quit the cpan shell
 reload cpan   load again      reload index  load newer indices
 autobundle    Snapshot                force cmd     unconditionally do cmd

Installing LWP

All you have to do is enter:

cpan> install Bundle::LWP

The CPAN shell will show messages explaining what it’s up to. You may need to answer questions to configure the various modules (e.g., libnet asks for mail hosts and so on for testing purposes).

After much activity, you should then have a fresh copy of LWP on your system, with far less work than installing it manually one distribution at a time. At the time of this writing, install Bundle::LWP installs not just the libwww-perl distribution, but also URI and HTML-Parser. It does not install the HTML-Tree distribution that we’ll use in Chapter 9 and Chapter 10. To do that, enter:

cpan> install HTML::Tree

These commands do not install the HTML-Format distribution, which was also once part of the LWP distribution. I do not discuss HTML-Format in this book, but if you want to install it so that you have a complete LWP installation, enter this command:

cpan> install HTML::Format

Remember, LWP may be just about the most popular distribution in CPAN, but that’s not all there is! Look around the web-related parts of CPAN (I prefer the interface at, but you can also try as there are dozens of modules, from WWW::Automate to SOAP::Lite, that can simplify your web-related tasks.

Installing LWP Manually

The normal Perl module installation procedure is summed up in the document perlmodinstall. You can read this by running perldoc perlmodinstall at a shell prompt or online at

CPAN is a network of a large collection of Perl software and documentation. See the CPAN FAQ at for more information about CPAN and modules.

Download distributions

First, download the module distributions. LWP requires several other modules to operate successfully. You’ll need to install the distributions given in Table 1-1, in the order in which they are listed.

Table 1-1. Modules used in this book


CPAN directory



















Fetch these modules from one of the FTP or web sites that form CPAN, listed at and Sometimes CPAN has several versions of a module in the authors directory. Be sure to check the version number and get the latest.

For example to install MIME-Base64, you might first fetch to see which versions are there, then fetch and install that.

Unpack and configure

The distributions are gzipped tar archives of source code. Extracting a distribution creates a directory, and in that directory is a Makefile.PL Perl program that builds a Makefile for you.

% tar xzf MIME-Base64-2.12.tar.gz
% cd MIME-Base64-2.12
% perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for MIME::Base64

Make, test, and install

Compile the code with the make command:

% make
cp blib/lib/MIME/
cp blib/lib/MIME/
/usr/bin/perl -I/opt/perl5/5.6.1/i386-freebsd -I/opt/perl5/5.6.1
/opt/perl5/5.6.1/ExtUtils/xsubpp  -typemap
/opt/perl5/5.6.1/ExtUtils/typemap Base64.xs > Base64.xsc && mv
  Base64.xsc Base64.c
cc -c  -fno-strict-aliasing -I/usr/local/include -O    -DVERSION=\"2.12\"
  -DXS_VERSION=\"2.12\" -DPIC -fpic -I/opt/perl5/5.6.1/i386-freebsd/CORE
Running Mkbootstrap for MIME::Base64 (  )
chmod 644
rm -f blib/arch/auto/MIME/Base64/
LD_RUN_PATH="" cc -o blib/arch/auto/MIME/Base64/  -shared
  -L/opt Base64.o
chmod 755 blib/arch/auto/MIME/Base64/
cp blib/arch/auto/MIME/Base64/
chmod 644 blib/arch/auto/MIME/Base64/
Manifying blib/man3/MIME::Base64.3
Manifying blib/man3/MIME::QuotedPrint.3

Then make sure everything works on your system with make test:

% make test
PERL_DL_NONLAZY=1 /usr/bin/perl -Iblib/arch -Iblib/lib 
-I/opt/perl5/5.6.1/i386-freebsd -I/opt/perl5/5.6.1 -e 'use Test::Harness
  qw(&runtests $verbose); $verbose=0; runtests @ARGV;' t/*.t
t/unicode.........skipped test on this platform
All tests successful, 1 test skipped.
Files=3, Tests=306,  1 wallclock secs ( 0.52 cusr +  0.06 csys =  0.58 CPU)

If it passes the tests, install it with make install (as the superuser):

# make install
Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/auto/MIME/Base64/
Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/auto/MIME/Base64/
Files found in blib/arch: installing files in blib/lib into architecture
  dependent library tree
Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/MIME/
Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/MIME/
Installing /usr/local/man/man3/MIME::Base64.3
Installing /usr/local/man/man3/MIME::QuotedPrint.3
Writing /opt/perl5/site_perl/5.6.1/i386-freebsd/auto/MIME/Base64/.packlist
Appending installation info to /opt/perl5/5.6.1/i386-freebsd/perllocal.pod

Words of Caution

In theory, the underlying mechanisms of the Web make no difference between a browser getting data and displaying it to you, and your LWP-based program getting data and doing something else with it. However, in practice, almost all the data on the Web was put there with the assumption (sometimes implicit, sometimes explicit) that it would be looked at directly in a browser. When you write an LWP program that downloads that data, you are working against that assumption. The trick is to do this in as considerate a way as possible.

Network and Server Load

When you access a web server, you are using scarce resources. You are using your bandwidth and the web server’s bandwidth. Moreover, processing your request places a load on the remote server, particularly if the page you’re requesting has to be dynamically generated, and especially if that dynamic generation involves database access. If you’re writing a program that requests several pages from a given server but you don’t need the pages immediately, you should write delays into your program (such as sleep 60; to sleep for one minute), so that the load that you’re placing on the network and on the web server is spread unobtrusively over a longer period of time.

If possible, you might even want to consider having your program run in the middle of the night (modulo the relevant time zones), when network usage is low and the web server is not likely to be busy handling a lot of requests. Do this only if you know there is no risk of your program behaving unpredictably. In Chapter 12 , we discuss programs with definite risk of that happening; do not let such programs run unattended until you have added appropriate safeguards and carefully checked that they behave as you expect them to.


While the complexities of national and international copyright law can’t be covered in a page or two (or even a library or two), the short story is that just because you can get some data off the Web doesn’t mean you can do whatever you want with it. The things you do with data on the Web form a continuum, as far as their relation to copyright law. At the one end is direct use, where you sit at your browser, downloading and reading pages as the site owners clearly intended. At the other end is illegal use, where you run a program that hammers a remote server as it copies and saves copyrighted data that was not meant for free public consumption, then saves it all to your public web server, which you then encourage people to visit so that you can make money off of the ad banners you’ve put there. Between these extremes, there are many gray areas involving considerations of “fair use,” a tricky concept. The safest guide in trying to stay on the right side of copyright law is to ask, by using the data this way, could I possibly be depriving the original web site of some money that it would/could otherwise get?

For example, suppose that you set up a program that copies data every hour from the Yahoo! Weather site, for the 50 most populous towns in your state. You then copy the data directly to your public web site and encourage everyone to visit it. Even though “no one owns the weather,” even if any particular bit of weather data is in the public domain (which it may be, depending on its source), Yahoo! Weather put time and effort into making a collection of that data, presented in a certain way. And as such, the collection of data is copyrighted.

Moreover, by posting the data publicly, you are almost definitely taking viewers away from Yahoo! Weather, which means less ad revenue for them. Even if Yahoo! Weather didn’t have any ads and so wasn’t obviously making any money off of viewers, your having the data online elsewhere means that if Yahoo! Weather wanted to start having ads tomorrow, they’d be unable to make as much money at it, because there would be people in the habit of looking at your web site’s weather data instead of at theirs.

Acceptable Use

Besides the protection provided by copyright law, many web sites have “terms of use” or “acceptable use” policies, where the web site owners basically say “as a user, you may do this and this, but not that or that, and if you don’t abide by these terms, then we don’t want you using this web site.” For example, a search engine’s terms of use might stipulate that you should not make “automated queries” to their system, nor should you show the search data on another site.

Before you start pulling data off of a web site, you should put good effort into looking around for its terms of service document, and take the time to read it and reasonably interpret what it says. When in doubt, ask the web site’s administrators whether what you have in mind would bother them.

LWP in Action

Enough of why you should be careful when you automate the Web. Let’s look at the types of things you’ll be learning in this book. Chapter 2 introduces web automation and LWP, presenting straightforward functions to let you fetch web pages. Example 1-1 shows how to fetch the O’Reilly home page and count the number of times Perl is mentioned.

Example 1-1. Count “Perl” in the O’Reilly catalog
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
my $catalog = get("");
my $count = 0;
$count++ while $catalog =~ m{Perl}gi;
print "$count\n";

The LWP::Simple module’s get( ) function returns the document at a given URL or undef if an error occurred. A regular expression match in a loop counts the number of occurrences.

The Object-Oriented Interface

Chapter 3 goes beyond LWP::Simple to show larger LWP’s powerful object-oriented interface. Most useful of all the features it covers are how to set headers in requests and check the headers of responses. Example 1-2 prints the identifying string that every server returns.

Example 1-2. Identify a server
#!/usr/bin/perl -w
use strict;
use LWP;
my $browser = LWP::UserAgent->new(  );
my $response = $browser->get("");
print $response->header("Server"), "\n";

The two variables, $browser and $response, are references to objects. LWP::UserAgent object $browser makes requests of a server and creates HTTP::Response objects such as $response to represent the server’s reply. In Example 1-2, we call the header( ) method on the response to check one of the HTTP header values.


Chapter 5 shows how to analyze and submit forms with LWP, including both GET and POST submissions. Example 1-3 makes queries of the California license plate database to see whether a personalized plate is available.

Example 1-3. Query California license plate database
#!/usr/bin/perl -w
# -  query California license plate database
use strict;
use LWP::UserAgent;
my $plate = $ARGV[0] || die "Plate to search for?\n";
$plate = uc $plate;
$plate =~ tr/O/0/;  # we use zero for letter-oh
die "$plate is invalid.\n"
 unless $plate =~ m/^[A-Z0-9]{2,7}$/
    and $plate !~ m/^\d+$/;  # no all-digit plates
my $browser = LWP::UserAgent->new;
my $response = $browser->post(
    'plate'  => $plate,
    'search' => 'Check Plate Availability'
die "Error: ", $response->status_line
 unless $response->is_success;
if($response->content =~ m/is unavailable/) {
  print "$plate is already taken.\n";
} elsif($response->content =~ m/and available/) {
  print "$plate is AVAILABLE!\n";
} else {
  print "$plate... Can't make sense of response?!\n";

Here’s how you might use it:

% knee
KNEE is already taken.
% ankle

We use the post( ) method on an LWP::UserAgent object to POST form parameters to a page.

Parsing HTML

The regular expression techniques in Examples Example 1-1 and Example 1-3 are discussed in detail in Chapter 6. Chapter 7 shows a different approach, where the HTML::TokeParser module turns a string of HTML into a stream of chunks (“start-tag,” “text,” “close-tag,” and so on). Chapter 8 is a detailed step-by-step walkthrough showing how to solve a problem using HTML::TokeParser. Example 1-4 uses HTML::TokeParser to extract the src parts of all img tags in the O’Reilly home page.

Example 1-4. Extract image locations
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TokeParser;
my $html   = get("");
my $stream = HTML::TokeParser->new(\$html);
my %image  = (  );
while (my $token = $stream->get_token) {
    if ($token->[0] eq 'S' && $token->[1] eq 'img') {
        # store src value in %image
        $image{ $token->[2]{'src'} }++;
foreach my $pic (sort keys %image) {
    print "$pic\n";

The get_token( ) method on our HTML::TokeParser object returns an array reference, representing a token. If the first array element is S, it’s a token representing the start of a tag. The second array element is the type of tag, and the third array element is a hash mapping attribute to value. The %image hash holds the images we find.

Chapter 9 and Chapter 10 show how to use tree data structures to represent HTML. The HTML::TreeBuilder module constructs such trees and provides operations for searching and manipulating them. Example 1-5 extracts image locations using a tree.

Example 1-5. Extracting image locations with a tree
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
my $html = get("");
my $root = HTML::TreeBuilder->new_from_content($html);
my %images;
foreach my $node ($root->find_by_tag_name('img')) {
    $images{ $node->attr('src') }++;
foreach my $pic (sort keys %images) {
    print "$pic\n";

We create a new tree from the HTML in the O’Reilly home page. The tree has methods to help us search, such as find_by_tag_name( ), which returns a list of nodes corresponding to those tags. We use that to find the img tags, then use the attr( ) method to get their src attributes.


Chapter 11 talks about advanced request features such as cookies (used to identify a user between web page accesses) and authentication. Example 1-6 shows how easy it is to request a protected page with LWP.

Example 1-6. Authenticating
#!/usr/bin/perl -w
use strict;
use LWP;
my $browser = LWP::UserAgent->new(  );
$browser->credentials("", "music", "fred" => "l33t1");
my $response = $browser->get("");
# ...

The credentials( ) method on an LWP::UserAgent adds the authentication information (the host, realm, and username/password pair are the parameters). The realm identifies which username and password are expected if there are multiple protected areas on a single host. When we request a document using that LWP::UserAgent object, the authentication information is used if necessary.

