Writing a spider to spider an existing spider’s site may seem convoluted, but it can prove useful when you’re looking for location-based services. This hack walks through creating a framework for full-site spidering, including additional filters to lessen your load.
In this hack, you’ll learn how to write a spider that crawls the Yahoo! group of portals. The choice of Yahoo! was obvious; because it is one of the largest Internet portals in existence, it can serve as an ideal example of how one goes about writing a portal spider.
But before we get to the gory details of code, let’s define what exactly a portal spider is. While many may argue with such a classification, I maintain that a portal spider is a script that automatically downloads all documents from a preselected range of URLs found on the portal’s site or a group of sites, as is the case with Yahoo!. A portal spider’s main job is to walk from one document to another, extract URLs from downloaded HTML, process said URLs, and go to another document, repeating the cycle until it runs out of URLs to visit. Once you create code that describes such basic behavior, you can add additional functionality, turning your general portal spider into a specialized one.
Although writing a script that walks from one Yahoo! page to another sounds simple, it isn’t, because there is no general pattern followed by all Yahoo! sites or sections within those sites. Furthermore, Yahoo! is not a single site with a nice link layout that can be described using a simple algorithm and a classic data structure. Instead, it is a collection of well over 30 thematic sites, each with its own document layout, naming conventions, and peculiarities in page design and URL patterns. For example, if you check links to the same directory section on different Yahoo! sites, you will find that some of them begin with http://www.yahoo.com/r, some begin with http://uk.yahoo.com/r/hp/dr, and others begin with http://kr.yahoo.com.
If you try to look for patterns, you will soon find yourself writing long if/ elsif/else
sections that are hard to maintain and need to be rewritten every time Yahoo! makes a small change to one of its sites. If you follow that route, you will soon discover that you need to write hundreds of lines of code to describe every kind of behavior you want to build into your spider.
This is particularly frustrating to programmers who expect to write code that uses elegant algorithms and nicely structured data. The hard truth about portals is that you cannot expect elegance and ease of spidering. Instead, prepare yourself for a lot of detective work and writing (and throwing away) chunks of code in a hit-and-miss fashion. Portal spiders are written in an organic, unstructured way, and the only rule you should follow is to keep things simple and add specific functionality only once you have the general behavior working.
Okay—with taxonomy and general advice behind us, we can get to the gist of the matter. The spider in this hack is a relatively simple tool for crawling Yahoo! sites. It makes no assumptions about the layout of the sites; in fact, it makes almost no assumptions whatsoever and can easily be adapted to other portals or even groups of portals. You can use it as a framework for writing specialized spiders.
Save the following code to a file called yspider.pl:
#!/usr/bin/perl -w # # yspider.pl # # Yahoo! Spider--crawls Yahoo! sites, collects links from each downloaded HTML # page, searches each downloaded page and prints a list of results when done. # http://www.artymiak.com/software/ or contact jacek@artymiak.com # # This code is free software; you can redistribute it and/or # modify it under the same terms as Perl itself. use strict; use Getopt::Std; # parse command line options. use LWP::UserAgent; # download data from the net. use HTML::LinkExtor; # get links inside an HTML document. use URI::URL; # turn relative links into absolutes. my $help = <<"EOH"; ------------------------------------------------------------------------ Yahoo! Spider. Options: -s list of sites you want to crawl, e.g. -s 'us china denmark' -h print this help Allowed values of -s are: argentina, asia, australia, brazil, canada, catalan, china, denmark, france, germany, hongkong, india, ireland, italy, japan, korea, mexico, newzealand, norway, singapore, spain, sweden, taiwan, uk, us, us_chinese, us_spanish Please, use this code responsibly. Flooding any site with excessive queries is bad net citizenship. ------------------------------------------------------------------------ EOH # define our arguments and # show the help if asked. my %args; getopts("s:h", \%args); die $help if exists $args{h}; # The list of code names, and # URLs, for various Yahoo! sites. my %ys = ( argentina => "http://ar.yahoo.com", asia => "http://asia.yahoo.com", australia => "http://au.yahoo.com", newzealand => "http://au.yahoo.com", brazil => "http://br.yahoo.com", canada => "http://ca.yahoo.com", catalan => "http://ct.yahoo.com", china => "http://cn.yahoo.com", denmark => "http://dk.yahoo.com", france => "http://fr.yahoo.com", germany => "http://de.yahoo.com", hongkong => "http://hk.yahoo.com", india => "http://in.yahoo.com", italy => "http://it.yahoo.com", korea => "http://kr.yahoo.com", mexico => "http://mx.yahoo.com", norway => "http://no.yahoo.com", singapore => "http://sg.yahoo.com", spain => "http://es.yahoo.com", sweden => "http://se.yahoo.com", taiwan => "http://tw.yahoo.com", uk => "http://uk.yahoo.com", ireland => "http://uk.yahoo.com", us => "http://www.yahoo.com", japan => "http://www.yahoo.co.jp", us_chinese => "http://chinese.yahoo.com", us_spanish => "http://espanol.yahoo.com" ); # if the -s option was used, check to make # sure it matches one of our existing codes # above. if not, or no -s was passed, help. my @sites; # which locales to spider. if (exists $args{'s'}) { @sites = split(/ /, lc($args{'s'})); foreach my $site (@sites) { die "UNKNOWN: $site\n\n$help" unless $ys{$site}; } } else { die $help; } # Defines global and local profiles for URLs extracted from the # downloaded pages. These profiles are used to determine if the # URLs extracted from each new document should be placed on the # TODO list (%todo) or rejected (%rejects). Profiles are lists # made of chunks of text, which are matched against found URLs. # Any special characters, like slash (/) or dot (.) must be properly # escaped. Remember that globals have precedence over locals. my %rules = ( global => { allow => [], deny => [ 'search', '\*' ] }, argentina => { allow => [ 'http:\/\/ar\.' ], deny => [] }, asia => { allow => [ 'http:\/\/(aa|asia)\.' ], deny => [] }, australia => { allow => [ 'http:\/\/au\.' ], deny => [] }, brazil => { allow => [ 'http:\/\/br\.' ], deny => [] }, canada => { allow => [ 'http:\/\/ca\.' ], deny => [] }, catalan => { allow => [ 'http:\/\/ct\.' ], deny => [] }, china => { allow => [ 'http:\/\/cn\.' ], deny => [] }, denmark => { allow => [ 'http:\/\/dk\.' ], deny => [] }, france => { allow => [ 'http:\/\/fr\.' ], deny => [] }, germany => { allow => [ 'http:\/\/de\.' ], deny => [] }, hongkong => { allow => [ 'http:\/\/hk\.' ], deny => [] }, india => { allow => [ 'http:\/\/in\.' ], deny => [] }, ireland => { allow => [ 'http:\/\/uk\.' ], deny => [] }, italy => { allow => [ 'http:\/\/it\.' ], deny => [] }, japan => { allow => [ 'yahoo\.co\.jp' ], deny => [] }, korea => { allow => [ 'http:\/\/kr\.' ], deny => [] }, mexico => { allow => [ 'http:\/\/mx\.' ], deny => [] }, norway => { allow => [ 'http:\/\/no\.' ], deny => [] }, singapore => { allow => [ 'http:\/\/sg\.' ], deny => [] }, spain => { allow => [ 'http:\/\/es\.' ], deny => [] }, sweden => { allow => [ 'http:\/\/se\.' ], deny => [] }, taiwan => { allow => [ 'http:\/\/tw\.' ], deny => [] }, uk => { allow => [ 'http:\/\/uk\.' ], deny => [] }, us => { allow => [ 'http:\/\/(dir|www)\.' ], deny => [] }, us_chinese => { allow => [ 'http:\/\/chinese\.' ], deny => [] }, us_spanish => { allow => [ 'http:\/\/espanol\.' ], deny => [] }, ); my %todo = (); # URLs to parse my %done = (); # parsed/finished URLs my %errors = (); # broken URLs with errors. my %rejects = (); # URLs rejected by the script # print out a "we're off!" line, then # begin walking the site we've been told to. print "=" x 80 . "\nStarted Yahoo! spider…\n" . "=" x 80 . "\n"; our $site; foreach $site (@sites) { # for each of the sites that have been passed on the # command line, we make a title for them, add them to # the TODO list for downloading, then call walksite(), # which downloads the URL, looks for more URLs, etc. my $title = "Yahoo! " . ucfirst($site) . " front page"; $todo{$ys{$site}} = $title; walksite(); # process. } # once we're all done with all the URLs, we print a # report about all the information we've gone through. print "=" x 80 . "\nURLs downloaded and parsed:\n" . "=" x 80 . "\n"; foreach my $url (keys %done) { print "$url => $done{$url}\n"; } print "=" x 80 . "\nURLs that couldn't be downloaded:\n" . "=" x 80 . "\n"; foreach my $url (keys %errors) { print "$url => $errors{$url}\n"; } print "=" x 80 . "\nURLs that got rejected:\n" . "=" x 80 . "\n"; foreach my $url (keys %rejects) { print "$url => $rejects{$url}\n"; } # this routine grabs the first entry in our todo # list, downloads the content, and looks for more URLs. # we stay in walksite until there are no more URLs # in our to do list, which could be a good long time. sub walksite { do { # get first URL to do. my $url = (keys %todo)[0]; # download this URL print "-> trying $url…\n"; my $browser = LWP::UserAgent->new; my $resp = $browser->get( $url, 'User-Agent' => 'Y!SpiderHack/1.0' ); # check the results. if ($resp->is_success) { my $base = $resp->base || ''; print "-> base URL: $base\n"; my $data = $resp->content; # get the data. print "-> downloaded: " . length($data) . " bytes of $url\n"; # find URLs using a link extorter. relevant ones # will be added to our to do list of downloadables. # this passes all the found links to findurls() # below, which determines if we should add the link # to our to do list, or ignore it due to filtering. HTML::LinkExtor->new(\&findurls, $base)->parse($data); ########################################################### # add your own processing here. perhaps you'd like to add # # a keyword search for the downloaded content in $data? # ########################################################### } else { $errors{$url} = $resp->message(); print "-> error: couldn't download URL: $url\n"; delete $todo{$url}; } # we're finished with this URL, so move it from # the to do list to the done list, and print a report. $done{$url} = $todo{$url}; delete $todo{$url}; print "-> processed legal URLs: " . (scalar keys %done) . "\n"; print "-> remaining URLs: " . (scalar keys %todo) . "\n"; print "-" x 80 . "\n"; } until ((scalar keys %todo) == 0); } # callback routine for HTML::LinkExtor. For every # link we find in our downloaded content, we check # to see if we've processed it before, then run it # through a bevy of regexp rules (see the top of # this script) to see if it belongs in the to do. sub findurls { my($tag, %links) = @_; return if $tag ne 'a'; return unless $links{href}; print "-> found URL: $links{href}\n"; # already seen this URL, so move on. if (exists $done{$links{href}} || exists $errors{$links{href}} || exists $rejects{$links{href}}) { print "--> I've seen this before: $links{href}\n"; return; } # now, run through our filters. unless (exists($todo{$links{href}})) { my ($ga, $gd, $la, $ld); # counters. foreach (@{$rules{global}{'allow'}}) { $ga++ if $links{href} =~ /$_/i; } foreach (@{$rules{global}{'deny'}}) { $gd++ if $links{href} =~ /$_/i; } foreach (@{$rules{$site}{'allow'}}) { $la++ if $links{href} =~ /$_/i; } foreach (@{$rules{$site}{'deny'}}) { $ld++ if $links{href} =~ /$_/i; } # if there were denials or NO allowances, we move on. if ($gd or $ld) { print "-> rejected URL: $links{href}\n"; return; } unless ($ga or $la) { print "-> rejected URL: $links{href}\n"; return; } # we passed our filters, so add it on the barby. print "-> added $links{href} to my TODO list\n"; $todo{$links{href}} = $links{href}; } }
Before sending the spider off, you’ll need to make a decision regarding which part of the Yahoo! directory you want to crawl. If you’re mainly interested in the United States and United Kingdom, inform the spider of that by using the -s
option on the command line, like so:
% perl yspider.pl -s "us uk"
=================================================================
Started Yahoo! spider…
=================================================================
-> trying http://www.yahoo.com…
-> base URL: http://www.yahoo.com/
-> downloaded: 28376 bytes of http://www.yahoo.com
-> found URL: http://www.yahoo.com/s/92802
-> added http://www.yahoo.com/s/92802 to my TODO list
-> found URL: http://www.yahoo.com/s/92803
…etc…
-> added http://www.yahoo.com/r/pv to my TODO list
-> processed legal URLs: 1
-> remaining URLs: 244
-> trying http://www.yahoo.com/r/fr…
-> base URL: http://fr.yahoo.com/r/
-> downloaded: 32619 bytes of http://www.yahoo.com/r/fr
-> found URL: http://fr.yahoo.com/r/t/mu00
-> rejected URL: http://fr.yahoo.com/r/t/mu00
…
You can see a full list of locations available to you by asking for help:
% perl yspider.pl–h
…
Allowed values of -s are:
argentina, asia, australia, brazil, canada, catalan, china,
denmark, france, germany, hongkong, india, ireland, italy, japan, korea,
mexico, newzealand, norway, singapore, spain, sweden,
taiwan, uk, us, us_chinese, us_spanish
The section you’ll want to modify most contains the filters that determine how far the spider will go; by tweaking the allow
and deny
rules at the beginning of the script, you’ll be able to better grab just the content you’re interested in. If you want to make this spider even more generic, consider rewriting the configuration code so that it’ll instead read a plain-text list of code names, start URLs, and allow
and deny
patterns. This can turn a Yahoo! spider into a general Internet spider.
Whenever you want to add code that extends the functionality of this spider (such as searching for keywords in a document, adding the downloaded content to a database, or otherwise repurposing it for your needs), include your own logic where specified by the hashed-out comment block.
If you’re spidering Yahoo! because you want to start your own directory, you might want to consider Google’s Open Directory Project (http://dmoz.org/about.html). Downloading the project’s freely available directory data, all several hundred megs of it, will give you plenty of information to play with.
Get Yahoo! Hacks now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.