Chapter 1. Introduction to Web Automation
LWP (short for “Library for World Wide Web in Perl”) is a set of Perl modules and object-oriented classes for getting data from the Web and for extracting information from HTML. This chapter provides essential background on the LWP suite. It describes the nature and history of LWP, which platforms it runs on, and how to download and install it. This chapter ends with a quick walkthrough of several LWP programs that illustrate common tasks, such as fetching web pages, extracting information using regular expressions, and submitting forms.
The Web as Data Source
Most web sites are designed for people. User Interface gurus consult for large sums of money to build HTML code that is easy to use and displays correctly on all browsers. User Experience gurus wag their fingers and tell web designers to study their users, so they know the human foibles and desires of the ape descendents who will be viewing the web site.
Fundamentally, though, a web site is home to data and services. A stockbroker has stock prices and the value of your portfolio (data) and forms that let you buy and sell stock (services). Amazon has book ISBNs, titles, authors, reviews, prices, and rankings (data) and forms that let you order those books (services).
It’s assumed that the data and services will be accessed by people viewing the rendered HTML. But many a programmer has eyed those data sources and services on the Web and thought “I’d like to use those in a program!” For example, they could page you when your portfolio falls past a certain point or could calculate the “best” book on Perl based on the ratio of its price to its average reader review.
LWP lets you do this kind of web automation. With it, you can fetch web pages, submit forms, authenticate, and extract information from HTML. Once you’ve used it to grab news headlines or check links, you’ll never view the Web in the same way again.
As with everything in Perl, there’s more than one way to automate accessing the Web. In this book, we’ll show you everything from the basic way to access the Web (via the LWP::Simple module), through forms, all the way to the gory details of cookies, authentication, and other types of complex requests.
Screen Scraping
Once you’ve tackled the fundamentals of how to ask a web server for a particular page, you still have to find the information you want, buried in the HTML response. Most often you won’t need more than regular expressions to achieve this. Chapter 6 describes the art of extracting information from HTML using regular expressions, although you’ll see the beginnings of it as early as Chapter 2, where we query AltaVista for a word, and use a regexp to match the number in the response that says “We found [number] results.”
The more discerning LWP connoisseur, however, treats the HTML
document as a stream of tokens (Chapter 7,
with an extended example in Chapter
8) or as a parse tree (Chapter
9). For example, you’ll use a token view and a tree view to
consider such tasks as how to catch <img...>
tags that are missing some of
their attributes, how to get the absolute URLs of all the headlines on
the BBC News main page, and how to extract content from one web page
and insert it into a different template.
In the old days of 80x24 terminals, “screen scraping” referred to the art of programmatically extracting information from the screens of interactive applications. That term has been carried over to mean the act of automatically extracting data from the output of any system that was basically designed for interactive use. That’s the term used for getting data out of HTML that was meant to be looked at in a browser, not necessarily extracted for your programs’ use.
Brittleness
In some lucky cases, your LWP-related task consists of downloading a file without requiring your program to parse it in any way. But most tasks involve having to extract a piece of data from some part of the returned document, using the screen-scraping tactics as mentioned earlier. An unavoidable problem is that the format of most web content can change at any time. For example in Chapter 8, I discuss the task of extracting data from the program listings at the web site for the radio show Fresh Air. The principle I demonstrate for that specific case is true for all extraction tasks: no pattern in the data is permanent and so any data-parsing program will be “brittle.”
For example, if you want to match text in section headings, you
can write your program to depend on them being inside <h2>...</h2>
tags, but tomorrow
the site’s template could be redesigned, and headings could then be in
<h3
class='hdln'>...</h3>
tags, at which point your
program won’t see anything it considers a section heading. In
practice, any given site’s template won’t change on a daily basis (nor
even yearly, for most sites), but as you read this book and see
examples of data extraction, bear in mind that each solution can’t be
the solution, but is just a
solution, and a temporary and brittle one at that.
As somewhat of a lesson in brittleness, in this book I show you data from various web sites (Amazon.com, the BBC News web site, and many others) and show how to write programs to extract data from them. However, that code is fragile. Some sites get redesigned only every few years; Amazon.com seems to change something every few weeks. So while I’ve made every effort to provide accurate code for the web sites as they exist at the time of this writing, I hope you will consider the programs in this book valuable as learning tools even after the sites will have changed beyond recognition.
Web Services
Programmers have begun to realize the great value in automating transactions over the Web. There is now a booming industry in web services, which is the buzzword for data or services offered over the Web. What differentiates web services from web sites is that web services don’t emit HTML for the ultimate reading pleasure of humans, they emit XML for programs.
This removes the need to scrape information out of HTML, neatly solving the problem of ever-changing web sites made brittle by the fickle tastes of the web-browsing public. Some web services standards (SOAP and XML-RPC) even make the remote web service appear to be a set of functions you call from within your program—if you use a SOAP or XML-RPC toolkit, you don’t even have to parse XML!
However, there will always be information on the Web that isn’t accessible as a web service. For that information, screen scraping is the only choice.
History of LWP
The following history of LWP was written by Gisle Aas, one of the creators of LWP and its current maintainer.
The libwww-perl project was started at the very first WWW conference held in Geneva in 1994. At the conference, Martijn Koster met Roy Fielding who was presenting the work he had done on MOMspider. MOMspider was a Perl program that traversed the Web looking for broken links and built an index of the documents and links discovered. Martijn suggested turning the reusable components of this program into a library. The result was the libwww-perl library for Perl 4 that Roy maintained.
Later the same year, Larry Wall made the first “stable” release of Perl 5 available. It was obvious that the module system and object-oriented features that the new version of Perl provided make Roy’s library even better. At one point, both Martijn and myself had made our own separate modifications of libwww-perl. We joined forces, merged our designs, and made several alpha releases. Unfortunately, Martijn ended up in disagreement with his employer about the intellectual property rights of work done outside hours. To safeguard the code’s continued availability to the Perl community, he asked me to take over maintenance of it.
The LWP:: module namespace was introduced by Martijn in one of the early alpha releases. This name choice was lively discussed on the libwww mailing list. It was soon pointed out that this name could be confused with what certain implementations of threads called themselves, but no better name alternatives emerged. In the last message on this matter, Martijn concluded, “OK, so we all agree LWP stinks :-).” The name stuck and has established itself.
If you search for “LWP” on Google today, you have to go to 30th position before you find a link about threads.
In May 1996, we made the first non-beta release of libwww-perl for Perl 5. It was called release 5.00 because it was for Perl 5. This made some room for Roy to maintain libwww-perl for Perl 4, called libwww-perl-0.40. Martijn continued to contribute but was unfortunately “rolled over by the Java train.”
In 1997-98, I tried to redesign LWP around the concept of an event loop under the name LWPng. This allowed many nice things: multiple requests could be handled in parallel and on the same connection, requests could be pipelined to improve round-trip time, and HTTP/1.1 was actually supported. But the tuits to finish it up never came, so this branch must by now be regarded as dead. I still hope some brave soul shows up and decides to bring it back to life.
1998 was also the year that the HTML:: modules were unbundled from the core LWP distribution and the year after Sean M. Burke showed up and took over maintenance of the HTML-Tree distribution, actually making it handle all the real-world HTML that you will find. I had kind of given up on dealing with all the strange HTML that the web ecology had let develop. Sean had enough dedication to make sense of it.
Today LWP is in strict maintenance mode with a much slower release cycle. The code base seems to be quite solid and capable of doing what most people expect it to.
Installing LWP
LWP and the associated modules are available in various distributions free from the Comprehensive Perl Archive Network (CPAN). The main distributions are listed at the start of Appendix A, although the details of which modules are in which distributions change occasionally.
If you’re using ActivePerl for Windows or MacPerl for Mac OS 9, you already have LWP. If you’re on Unix and you don’t already have LWP installed, you’ll need to install it from CPAN using instructions given in the next section.
To test whether you already have LWP installed:
% perl -MLWP -le "print(LWP->VERSION)"
(The second character in -le
is
a lowercase L, not a digit one.)
If you see:
Can't locate LWP in @INC (@INC contains: ...lots of paths...
).
BEGIN failed--compilation aborted.
or if you see a version number lower than 5.64, you need to install LWP on your system.
There are two ways to install modules: using the CPAN shell or the old-fashioned manual way.
Installing LWP from the CPAN Shell
The CPAN shell is a command-line environment for automatically downloading, building, and installing modules from CPAN.
Configuring
If you have never used the CPAN shell, you will need to configure it before you can use it. It will prompt you for some information before building its configuration file.
Invoke the CPAN shell by entering the following command at a system shell prompt:
% perl -MCPAN -eshell
If you’ve never run it before, you’ll see this:
We have to reconfigure CPAN.pm due to following uninitialized parameters:
followed by a number of questions. For each question, the default answer is typically fine, but you may answer otherwise if you know that the default setting is wrong or not optimal. Once you’ve answered all the questions, a configuration file is created and you can start working with the CPAN shell.
Obtaining help
If you need help at any time, you can read the CPAN shell’s
manual page by typing perldoc
CPAN
or by starting up the CPAN
shell (with perl
-MCPAN
-eshell
at a system shell prompt) and
entering h
at the cpan>
prompt:
cpan> h Display Information command argument description a,b,d,m WORD or /REGEXP/ about authors, bundles, distributions, modules i WORD or /REGEXP/ about anything of above r NONE reinstall recommendations ls AUTHOR about files in the author's directory Download, Test, Make, Install... get download make make (implies get) test MODULES, make test (implies make) install DISTS, BUNDLES make install (implies test) clean make clean look open subshell in these dists' directories readme display these dists' README files Other h,? display this menu ! perl-code eval a perl command o conf [opt] set and query options q quit the cpan shell reload cpan load CPAN.pm again reload index load newer indices autobundle Snapshot force cmd unconditionally do cmd
Installing LWP
All you have to do is enter:
cpan> install Bundle::LWP
The CPAN shell will show messages explaining what it’s up to. You may need to answer questions to configure the various modules (e.g., libnet asks for mail hosts and so on for testing purposes).
After much activity, you should then have a fresh copy of LWP
on your system, with far less work than installing it manually one
distribution at a time. At the time of this writing, install Bundle::LWP
installs not just the
libwww-perl distribution, but also URI and HTML-Parser. It does not
install the HTML-Tree distribution that we’ll use in Chapter 9 and Chapter 10. To do that,
enter:
cpan> install HTML::Tree
These commands do not install the HTML-Format distribution, which was also once part of the LWP distribution. I do not discuss HTML-Format in this book, but if you want to install it so that you have a complete LWP installation, enter this command:
cpan> install HTML::Format
Remember, LWP may be just about the most popular distribution in CPAN, but that’s not all there is! Look around the web-related parts of CPAN (I prefer the interface at http://search.cpan.org, but you can also try http://kobesearch.cpan.org) as there are dozens of modules, from WWW::Automate to SOAP::Lite, that can simplify your web-related tasks.
Installing LWP Manually
The normal Perl module installation procedure is summed up in the document
perlmodinstall. You can read this
by running perldoc
perlmodinstall
at a shell prompt or online
at http://theoryx5.uwinnipeg.ca/CPAN/perl/pod/perlmodinstall.html.
CPAN is a network of a large collection of Perl software and documentation. See the CPAN FAQ at http://www.cpan.org/misc/cpan-faq.html for more information about CPAN and modules.
Download distributions
First, download the module distributions. LWP requires several other modules to operate successfully. You’ll need to install the distributions given in Table 1-1, in the order in which they are listed.
Distribution | CPAN directory |
MIME-Base64 | authors/id/G/GA/GAAS |
libnet | authors/id/G/GB/GBAAR |
HTML-Tagset | authors/id/S/SBURKE |
HTML-Parser | authors/id/G/GA/GAAS |
URI | authors/id/G/GA/GAAS/URI |
Compress-Zlib | authors/id/P/PM/PMQS/Compress-Zlib |
Digest-MD5 | authors/id/G/GA/GAAS/Digest-MD5 |
libwww-perl | authors/id/G/GA/GAAS/libwww-perl |
HTML-Tree | authors/id/S/SB/SBURKE/HTML-Tree |
Fetch these modules from one of the FTP or web sites that form CPAN, listed at http://www.cpan.org/SITES.html and http://mirror.cpan.org. Sometimes CPAN has several versions of a module in the authors directory. Be sure to check the version number and get the latest.
For example to install MIME-Base64, you might first fetch http://www.cpan.org/authors/id/G/GA/GAAS/ to see which versions are there, then fetch http://www.cpan.org/authors/id/G/GA/GAAS/MIME-Base64-2.12.tar.gz and install that.
Unpack and configure
The distributions are gzipped tar archives of source code. Extracting a distribution creates a directory, and in that directory is a Makefile.PL Perl program that builds a Makefile for you.
% tar xzf MIME-Base64-2.12.tar.gz % cd MIME-Base64-2.12 % perl Makefile.PL Checking if your kit is complete... Looks good Writing Makefile for MIME::Base64
Make, test, and install
Compile the code with the make
command:
% make cp Base64.pm blib/lib/MIME/Base64.pm cp QuotedPrint.pm blib/lib/MIME/QuotedPrint.pm /usr/bin/perl -I/opt/perl5/5.6.1/i386-freebsd -I/opt/perl5/5.6.1 /opt/perl5/5.6.1/ExtUtils/xsubpp -typemap /opt/perl5/5.6.1/ExtUtils/typemap Base64.xs > Base64.xsc && mv Base64.xsc Base64.c cc -c -fno-strict-aliasing -I/usr/local/include -O -DVERSION=\"2.12\" -DXS_VERSION=\"2.12\" -DPIC -fpic -I/opt/perl5/5.6.1/i386-freebsd/CORE Base64.c Running Mkbootstrap for MIME::Base64 ( ) chmod 644 Base64.bs rm -f blib/arch/auto/MIME/Base64/Base64.so LD_RUN_PATH="" cc -o blib/arch/auto/MIME/Base64/Base64.so -shared -L/opt Base64.o chmod 755 blib/arch/auto/MIME/Base64/Base64.so cp Base64.bs blib/arch/auto/MIME/Base64/Base64.bs chmod 644 blib/arch/auto/MIME/Base64/Base64.bs Manifying blib/man3/MIME::Base64.3 Manifying blib/man3/MIME::QuotedPrint.3
Then make sure everything works on your system with make test
:
% make test PERL_DL_NONLAZY=1 /usr/bin/perl -Iblib/arch -Iblib/lib -I/opt/perl5/5.6.1/i386-freebsd -I/opt/perl5/5.6.1 -e 'use Test::Harness qw(&runtests $verbose); $verbose=0; runtests @ARGV;' t/*.t t/base64..........ok t/quoted-print....ok t/unicode.........skipped test on this platform All tests successful, 1 test skipped. Files=3, Tests=306, 1 wallclock secs ( 0.52 cusr + 0.06 csys = 0.58 CPU)
If it passes the tests, install it with make
install
(as the superuser):
# make install Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/auto/MIME/Base64/Base64.so Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/auto/MIME/Base64/Base64.bs Files found in blib/arch: installing files in blib/lib into architecture dependent library tree Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/MIME/Base64.pm Installing /opt/perl5/site_perl/5.6.1/i386-freebsd/MIME/QuotedPrint.pm Installing /usr/local/man/man3/MIME::Base64.3 Installing /usr/local/man/man3/MIME::QuotedPrint.3 Writing /opt/perl5/site_perl/5.6.1/i386-freebsd/auto/MIME/Base64/.packlist Appending installation info to /opt/perl5/5.6.1/i386-freebsd/perllocal.pod
Words of Caution
In theory, the underlying mechanisms of the Web make no difference between a browser getting data and displaying it to you, and your LWP-based program getting data and doing something else with it. However, in practice, almost all the data on the Web was put there with the assumption (sometimes implicit, sometimes explicit) that it would be looked at directly in a browser. When you write an LWP program that downloads that data, you are working against that assumption. The trick is to do this in as considerate a way as possible.
Network and Server Load
When you access a web server, you are using scarce resources. You are using your
bandwidth and the web server’s bandwidth. Moreover, processing your
request places a load on the remote server, particularly if the page
you’re requesting has to be dynamically generated, and especially if
that dynamic generation involves database access. If you’re writing a
program that requests several pages from a given server but you don’t
need the pages immediately, you should write delays into your program
(such as sleep 60;
to sleep for one
minute), so that the load that you’re placing on the network and on
the web server is spread unobtrusively over a longer period of
time.
If possible, you might even want to consider having your program run in the middle of the night (modulo the relevant time zones), when network usage is low and the web server is not likely to be busy handling a lot of requests. Do this only if you know there is no risk of your program behaving unpredictably. In Chapter 12 , we discuss programs with definite risk of that happening; do not let such programs run unattended until you have added appropriate safeguards and carefully checked that they behave as you expect them to.
Copyright
While the complexities of national and international copyright law can’t be covered in a page or two (or even a library or two), the short story is that just because you can get some data off the Web doesn’t mean you can do whatever you want with it. The things you do with data on the Web form a continuum, as far as their relation to copyright law. At the one end is direct use, where you sit at your browser, downloading and reading pages as the site owners clearly intended. At the other end is illegal use, where you run a program that hammers a remote server as it copies and saves copyrighted data that was not meant for free public consumption, then saves it all to your public web server, which you then encourage people to visit so that you can make money off of the ad banners you’ve put there. Between these extremes, there are many gray areas involving considerations of “fair use,” a tricky concept. The safest guide in trying to stay on the right side of copyright law is to ask, by using the data this way, could I possibly be depriving the original web site of some money that it would/could otherwise get?
For example, suppose that you set up a program that copies data every hour from the Yahoo! Weather site, for the 50 most populous towns in your state. You then copy the data directly to your public web site and encourage everyone to visit it. Even though “no one owns the weather,” even if any particular bit of weather data is in the public domain (which it may be, depending on its source), Yahoo! Weather put time and effort into making a collection of that data, presented in a certain way. And as such, the collection of data is copyrighted.
Moreover, by posting the data publicly, you are almost definitely taking viewers away from Yahoo! Weather, which means less ad revenue for them. Even if Yahoo! Weather didn’t have any ads and so wasn’t obviously making any money off of viewers, your having the data online elsewhere means that if Yahoo! Weather wanted to start having ads tomorrow, they’d be unable to make as much money at it, because there would be people in the habit of looking at your web site’s weather data instead of at theirs.
Acceptable Use
Besides the protection provided by copyright law, many web sites have “terms of use” or “acceptable use” policies, where the web site owners basically say “as a user, you may do this and this, but not that or that, and if you don’t abide by these terms, then we don’t want you using this web site.” For example, a search engine’s terms of use might stipulate that you should not make “automated queries” to their system, nor should you show the search data on another site.
Before you start pulling data off of a web site, you should put good effort into looking around for its terms of service document, and take the time to read it and reasonably interpret what it says. When in doubt, ask the web site’s administrators whether what you have in mind would bother them.
LWP in Action
Enough of why you should be careful when you automate the Web. Let’s look at the types of things you’ll be learning in this book. Chapter 2 introduces web automation and LWP, presenting straightforward functions to let you fetch web pages. Example 1-1 shows how to fetch the O’Reilly home page and count the number of times Perl is mentioned.
#!/usr/bin/perl -w use strict; use LWP::Simple; my $catalog = get("http://www.oreilly.com/catalog"); my $count = 0; $count++ while $catalog =~ m{Perl}gi; print "$count\n";
The LWP::Simple module’s get( )
function returns the document at a given URL or undef
if an error occurred. A regular
expression match in a loop counts the number of occurrences.
The Object-Oriented Interface
Chapter 3 goes beyond LWP::Simple to show larger LWP’s powerful object-oriented interface. Most useful of all the features it covers are how to set headers in requests and check the headers of responses. Example 1-2 prints the identifying string that every server returns.
#!/usr/bin/perl -w use strict; use LWP; my $browser = LWP::UserAgent->new( ); my $response = $browser->get("http://www.oreilly.com/"); print $response->header("Server"), "\n";
The two variables, $browser
and $response
, are references to
objects. LWP::UserAgent object $browser
makes requests of a server and
creates HTTP::Response objects such as $response
to represent the server’s reply.
In Example 1-2, we call the
header( )
method on the response to
check one of the HTTP header values.
Forms
Chapter 5 shows how to analyze and submit forms with LWP, including both GET and POST submissions. Example 1-3 makes queries of the California license plate database to see whether a personalized plate is available.
#!/usr/bin/perl -w # pl8.pl - query California license plate database use strict; use LWP::UserAgent; my $plate = $ARGV[0] || die "Plate to search for?\n"; $plate = uc $plate; $plate =~ tr/O/0/; # we use zero for letter-oh die "$plate is invalid.\n" unless $plate =~ m/^[A-Z0-9]{2,7}$/ and $plate !~ m/^\d+$/; # no all-digit plates my $browser = LWP::UserAgent->new; my $response = $browser->post( 'http://plates.ca.gov/search/search.php3', [ 'plate' => $plate, 'search' => 'Check Plate Availability' ], ); die "Error: ", $response->status_line unless $response->is_success; if($response->content =~ m/is unavailable/) { print "$plate is already taken.\n"; } elsif($response->content =~ m/and available/) { print "$plate is AVAILABLE!\n"; } else { print "$plate... Can't make sense of response?!\n"; } exit;
Here’s how you might use it:
% pl8.pl knee KNEE is already taken. % pl8.pl ankle ANKLE is AVAILABLE!
We use the post( )
method on
an LWP::UserAgent object to POST form parameters to a page.
Parsing HTML
The regular expression techniques in Examples Example 1-1 and Example 1-3 are discussed in
detail in Chapter 6. Chapter 7 shows a different approach,
where the HTML::TokeParser module turns a string of HTML into a stream
of chunks (“start-tag,” “text,” “close-tag,” and so on). Chapter 8 is a detailed step-by-step
walkthrough showing how to solve a problem using HTML::TokeParser.
Example 1-4 uses
HTML::TokeParser to extract the src
parts of all img
tags in the
O’Reilly home page.
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; my $html = get("http://www.oreilly.com/"); my $stream = HTML::TokeParser->new(\$html); my %image = ( ); while (my $token = $stream->get_token) { if ($token->[0] eq 'S' && $token->[1] eq 'img') { # store src value in %image $image{ $token->[2]{'src'} }++; } } foreach my $pic (sort keys %image) { print "$pic\n"; }
The get_token( )
method on
our HTML::TokeParser object returns an array reference, representing a
token. If the first array element is S, it’s a token representing the
start of a tag. The second array element is the type of tag, and the
third array element is a hash mapping attribute to value. The %image
hash holds the images we find.
Chapter 9 and Chapter 10 show how to use tree data structures to represent HTML. The HTML::TreeBuilder module constructs such trees and provides operations for searching and manipulating them. Example 1-5 extracts image locations using a tree.
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TreeBuilder; my $html = get("http://www.oreilly.com/"); my $root = HTML::TreeBuilder->new_from_content($html); my %images; foreach my $node ($root->find_by_tag_name('img')) { $images{ $node->attr('src') }++; } foreach my $pic (sort keys %images) { print "$pic\n"; }
We create a new tree from the HTML in the O’Reilly home page.
The tree has methods to help us search, such as find_by_tag_name( )
, which returns a list of
nodes corresponding to those tags. We use that to find the img tags,
then use the attr( )
method to get
their src
attributes.
Authentication
Chapter 11 talks about advanced request features such as cookies (used to identify a user between web page accesses) and authentication. Example 1-6 shows how easy it is to request a protected page with LWP.
#!/usr/bin/perl -w use strict; use LWP; my $browser = LWP::UserAgent->new( ); $browser->credentials("www.example.com:80", "music", "fred" => "l33t1"); my $response = $browser->get("http://www.example.com/mp3s"); # ...
The credentials( )
method on
an LWP::UserAgent adds the authentication information (the host,
realm, and username/password pair are the parameters). The realm
identifies which username and password are expected if there are
multiple protected areas on a single host. When we request a document
using that LWP::UserAgent object, the authentication information is
used if necessary.
Get Perl & LWP now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.