By Kevin Hemenway, Tara Calishain
Book Price: $24.95 USD
£17.50 GBP
PDF Price: $19.99
Cover | Table of Contents | Colophon
User-Agent names like
Googlebot, Scooter, and
MSNbot. These are all spiders—or
bots
,
as some prefer to call them.User-Agent names like
Googlebot, Scooter, and
MSNbot. These are all spiders—or
bots
,
as some prefer to call them.<title>This is the title</title>
<title> and
</title> tags.<html>
<head>
<title>
Title of the page
</title>
</head>
<body>
Body of the page
</body>
</html>
http://www.webmasterworld.com) have entire
forums devoted to identifying and discussing
spiders.
Don't think that your spider is going to get ignored
just because you're not using a thousand online
servers and spidering millions of pages a day.work for yahoo" (weblog | blog) does nicely.
Sometimes, you can contact these people and let them know what
you're doing, and they can either pass your email to
someone who can approve it, or give you some other feedback.You agree that you will not use any robot, spider, scraper or other automated means to access the Site for any purpose without our express written permission.
http://www.golfcourses.com, then takes the
Zip Codes of the courses returned and checks them against http://www.scorecard.org to see which have
the most (or least) polluted environment.http://www.cpan.org) and the uncanny ability
to "do what you mean,"
it's a perfect language on which to base a spidering
hacks book.http://www.cpan.org) and the uncanny ability
to "do what you mean,"
it's a perfect language on which to base a spidering
hacks book.http://www.cpan.org), test to make sure
it'll work in our environment, ensure it
doesn't require other modules that we
don't yet have, install it, and then prepare it for
general use within our own scripts.http://www.oreilly.com/catalog/perlckbk2/) by
Tom Christiansen and Nathan Torkingtonhttp://www.oreilly.com/catalog/pperl3/) by
Larry Wall, Tom Christiansen, and Jon
Orwant
http://oreilly.com/catalog/perllwp/).get($url)
routine, where $url is the location of the content
you're interested in.
LWP::Simple will try to fetch the content at the
end of the URL. If it's successful,
you'll be handed the content; if
there's an error of some sort, the
get function will return undef,
the undefined value. The get represents an aptly
named HTTP GET request, which reads as
"get me the content at the end of this
URL":#!/usr/bin/perl -w use strict; use LWP::Simple; # Just an example: the URL for the most recent /Fresh Air/ show my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current'; my $content = get($url); die "Couldn't get $url" unless defined $content; # Do things with $content: if ($content =~ m/jazz/i) { print "They're talking about jazz today on Fresh Air!\n"; } else { print "Fresh Air is apparently jazzless today.\n"; }
get is
getprint, useful in Perl one-liners. If it can get
the page whose URL you provide, it sends it straight to
STDOUT; otherwise, it complains to
STDERR—both usually are your screen:% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"
MIRRORED.BY
MIRRORING.FROM
RECENT
RECENT.html
SITES
SITES.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AB/ABW/CHECKSUMS
authors/id/A/AB/ABW/Pod-POM-0.17.tar.gz
...$response =
$browser->get($url), like so:#!/usr/bin/perl -w
use strict;
use LWP 5.64; # Loads all important LWP classes, and makes
# sure your version is reasonably recent.
my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
my $browser = LWP::UserAgent->new;
my $response = $browser->get( $url );
die "Can't get $url -- ", $response->status_line
unless $response->is_success;
die "Hey, I was expecting HTML, not ", $response->content_type
unless $response->content_type eq 'text/html';
# or whatever content-type you're dealing with.
# Otherwise, process the content somehow:
if ($response->content =~ m/jazz/i) {
print "They're talking about jazz today on Fresh Air!\n";
} else {print "Fresh Air is apparently jazzless today.\n"; }
$browser, which
holds an object of the class LWP::UserAgent, and
the $response object, which is of the class
HTTP::Response. You really need only one browser
object per program; but every time you make a request, you get back a
new
$response
=
$browser->get($url), but in truth you can add extra
HTTP header lines to the request by
adding a list of key/value pairs after the URL, like so:$response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
my @ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*',
'Accept-Language' => 'en-US',
);
$response = $browser->get($url, @ns_headers);
$response = $browser->get($url,
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*',
'Accept-Language' => 'en-US',
);
Accept and in what order: GIFs, bitmaps, JPEGs,
PNGs, and then anything else (you'd rather have a
GIF first, but an HTML file is fine if the server
can't provide the data in your preferred formats).
For servers that cater to international users by offering translated
documents, the
Accept-Language
and three blind
mice. Your result URL will vary depending on the
preferences you've set, but it will look something
like this:http://www.google.com/search?num=100&hl=en&q=%22three+blind+mice%22
&q=%22three+blind+mice%22, but why? Whenever
you send data through a form submission, that data has to be encoded
so that it can safely arrive at its destination, the server, intact.
Characters like
spaces and
quotes—in essence, anything not
alphanumeric—must be turned into their encoded equivalents,
like + and %22.
LWP will automatically handle most of this
encoding (and decoding) for you, but you can request it at will with
URI::Escape's
uri_escape and uri_unescape
functions.num=100 refers to the number of search results to
a page, 100 in this case. Google accepts any
number from 10 to 100. Altering
the value of num in the URL and reloading the page
is a nice shortcut for altering the preferred size of your result set
without having to meander over to the Advanced Search (http://www.google.com/advanced_search?hl=en)
and rerunning your query.h1=en means that the language interface—the
language in which you use Google, reflected in the home page,
messages, and buttons—is in English.
Google's Language Tools
(http://www.google.com/language_tools?hl=en)
provide a list of language choices.q, num, and
h1 and their associated values represent a
GET form request; you can always tell when you
have one by the URL in your browser's address bar,
where you'll see the URL, then a question mark
(?), followed by key/value pairs separated by an
ampersand (&). To run the same search from
within
http://www.unicode.org/mail-arch/)—namely,
username "unicode-ml" and password
"unicode".http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
$url->scheme, asking which host it refers to
with $url->host, and so on, as described in the
docs for the URI class). However, the methods of
most immediate interest are the query_form method [Hack #12] and the
new_abs method for taking a URL string that is
most likely relative and getting back an absolute URL, as shown here:use URI; my $abs = URI->new_abs($maybe_relative, $base);
#!/usr/bin/perl -w
use strict;
use LWP 5.64;
my $browser = LWP::UserAgent->new;
my $url = 'http://www.cpan.org/RECENT.html';
my $response = $browser->get($url);
die "Can't get $url -- ", $response->status_line
unless $response->is_success;
my $html = $response->content;
while( $html =~ m/<A HREF=\"(.*?)\"/g ) {
print "$1\n";
}
% perl get_relative.pl
MIRRORING.FROM
RECENT
RECENT.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AA/AASSAD/CHECKSUMS
...
new_abs#!/usr/bin/perl -w
use strict;
use LWP 5.64;
my $url = 'https://www.paypal.com/'; # Yes, HTTPS!
my $browser = LWP::UserAgent->new;
my $response = $browser->get($url);
die "Error at $url\n ", $response->status_line,
"\n Aborting" unless $response->is_success;
print "Whee, it worked! I got that ",
$response->content_type, " document!\n";
Error at https://www.paypal.com/ 501 Protocol scheme 'https' is not supported
$response just as you would any normal HTTP
response [Hack #10].User-Agent or add a
Referer to get past certain server-side filters.
HTTP headers aren't always used for subversion,
though, and
If-Modified-Since
is a perfect example of one that
isn't. The following script downloads a web page and
returns the
Last-Modified HTTP
header, as reported by the server:#!/usr/bin/perl -w use strict; use LWP 5.64; use HTTP::Date; my $url = 'http://disobey.com/amphetadesk/'; my $browser = LWP::UserAgent->new; my $response = $browser->get( $url ); print "Got: ", $response->status_line; print "\n". "Epoch: " . $response->last_modified . "\n"; print "English: " . time2str($response->last_modified) . "\n";
http://www.robotstxt.org)—a
magical bit of text that you, as web developer and site owner, can
create to control the capabilities of third-party robots, agents,
scrapers, spiders, or what have you. Here is an example of a
robots.txt file that blocks any
robot's access to three specific directories:User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/
User-agent: * Disallow: /
http://search.cpan.org/author/GAAS/libwww-perl/lib/LWP/RobotUA.pm)
instead of LWP::UserAgent. Doing so also ensures
that your script doesn't make requests too many
times a second, saturating the site's bandwidth
unnecessarily. LWP::RobotUA is just like
LWP::UserAgent, and you can use it like so:use LWP::RobotUA;
# Your bot's name and your email address
my $browser = LWP::RobotUA->new('SuperBot/1.34', 'you@site.com');
my $response = $browser->get($url);
$url's server forbids you from
accessing $url, then the
$browser object (assuming it's of
the class LWP::RobotUA) won't
actually request it, but instead will give you back (in
$response) a 403 error with a message
"Forbidden by robots.txt." Trap
such an eventuality like so:perl scriptname
URL,
where URL is the online location of your
appropriately large piece of sample data:#!/usr/bin/perl -w
#
# Progress Bar: Dots - Simple example of an LWP progress bar.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
use strict; $|++;
my $VERSION = "1.0";
# make sure we have the modules we need, else die peacefully.
eval("use LWP 5.6.9;"); die "[err] LWP 5.6.9 or greater required.\n" if $@;
# now, check for passed URLs for downloading.
die "[err] No URLs were passed for processing.\n" unless @ARGV;
# our downloaded data.
my $final_data = undef;
# loop through each URL.
foreach my $url (@ARGV) {
print "Downloading URL at ", substr($url, 0, 40), "... ";
# create a new useragent and download the actual URL.
# all the data gets thrown into $final_data, which
# the callback subroutine appends to.
my $ua = LWP::UserAgent->new( );
my $response = $ua->get($url, '<head> tag is a child
of the <html> tag. The
<title> and <met