Chapter 4. Gleaning Data from Databases

Hack 48. Hacks #43-89

In Chapter 3, you learned techniques for collecting media files. Now you’re going to take those lessons and move in a slightly different direction: gleaning data from databases and information collections.

Information collections can be as large and multifaceted as Google’s index of the World Wide Web, or as narrow and precise as King County health collections. You can scrape information as general as archives from Yahoo! Groups, or as targeted as game prices from GameStop.com. In this chapter, we’ll look at a variety of ways you can access database information, a variety of sources you might want to try (and a few hints for thinking of your own!), and ways that you can combine the power of programming with already-existing web APIs to make new and powerful tools.

Hack #43. Archiving Yahoo! Groups Messages with yahoo2mbox

Looking to keep a local archive of your favorite mailing list? With yahoo2mbox, you can import the final results into your favorite mailer.

With the popularity of Yahoo! Groups (http://groups.yahoo.com/) comes a problem. Sometimes, you want to save the archives of a Yahoo! Group, but you want to be able to access it outside the Yahoo! Groups site. Or you want to move your list somewhere else and be able to take your existing archive with you.

Vadim Zeitlin had these same concerns, which is why he wrote yahoo2mbox (http://www.lpthe.jussieu.fr/~zeitlin/yahoo2mbox.html). This hack retrieves all the messages from a mailing list archive at Yahoo! Groups and saves them to a local file in mbox format. Plenty of options make this handy to have when you’re trying to transfer information from Yahoo! Groups.

As of this writing, the program is still fairly new, so be sure to visit its URL (cited in the previous paragraph) to download the latest version. Note that you’ll need Perl and several additional modules to run this code, including Getopt::Long, HTML::Entities, HTML::HeadParser, HTML::TokeParser, and LWP::UserAgent.

Running the Hack

Running the code looks like this:

perl yahoo2mbox.pl [options] [-o <mbox>] <groupname>

The options for running the program are as follows:

--help          give the usage message showing the program options
--version       show the program version and exit
--verbose       give verbose informational messages (default)
--quiet         be silent, only error messages are given
-o mbox         save the message to mbox instead of file named groupname
--start=n       start retrieving messages at index n instead of 1
--end=n         stop retrieving messages at index n instead of the last one
--noresume      don't resume, **overwrites** the existing output file if any
--user=name     login to eGroups using this username (default: guest login)
--pass=pass     the password to use for login (default: none)
--cookies=xxx   file to use to store cookies (default: none,
                'netscape' uses netscape cookies file).
--proxy=url     use the given proxy; if 'no', don't use proxy 
                at all (not even the environment variable http_proxy, 
                which is used by default), may use http://username:password\
                @full.host.name/ notation
--country=xx    use the given country code to access localized yahoo

So, this command downloads messages from Weird Al Club, starting at message 3258:

% perl yahoo2mbox.pl --start=3258 weirdalclub2
Logging in anonymously... ok.
Getting number of messages in group weirdalclub2...
Retrieving messages 3258..3287: .............................. done!
Saved 30 message(s) in weirdalclub2.

Here, the messages are saved to a file called weirdalclub2. Renaming the file weirdalclub2.mbx means that you can immediately open the messages in Eudora, as shown in Figure 4-1. Of course, you can also open the resulting files in any mail program that can import (or natively read) the mbox format.

A Yahoo! Groups archive in Eudora
Figure 4-1. A Yahoo! Groups archive in Eudora

Hacking the Hack

Because this is someone else’s program, there’s not too much hacking to be done. On the other hand, you might find that you don’t want to end this process with the mbox file; you might want to convert to other formats for use in other projects or archives. In that case, check out these other programs to take that mbox format a little further:

hypermail (http://sourceforge.net/projects/hypermail/)

Converts mbox format to cross-referenced HTML documents.

mb2md (http://www.gerg.ca/hacks/mb2md/)

Converts mbox format to Maildir. Requires Python and Procmail.

Mb2md.pl (http://batleth.sapienti-sat.org/projects/mb2md/)

Converts mbox format to Maildir. Uses Perl.

Hack #44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups

Yahoo! Groups makes it easy to run an email discussion group at no cost. Sadly, there’s no simple way to download all the messages—until now.

If you’ve ever wanted to run an email discussion group, but you didn’t want to mess around with getting your own server and administering your own software, you should consider looking into Yahoo! Groups (http://groups.yahoo.com/). The free (ad-supported) service makes it easy to run a mailing list, and if you or any other group moderator has set a list to support archiving of messages, a handy web interface to browse them is provided. Sadly, the service provides no simple way to download all the messages in one fell swoop, and nobody wants to click and Save As . . . on hundreds or thousands of links.

Iain Truskett of Canberra, Australia, wanted to keep an offline archive of his Yahoo! Groups mailing lists, so he created the WWW::Yahoo::Groups module, available on CPAN (http://search.cpan.org/dist/WWW-Yahoo-Groups/). It uses WWW::Mechanize to log into Yahoo! Groups, get a count of the messages, and download any given message by number. It even bypasses the pop-up ads and interstitial interruptions!

The Code

You’ll need the WWW::Yahoo::Groups Perl module installed to use this script. The module requires a number of other modules, but installing from the CPAN shell [Hack #8] should take care of the installation of these prerequisites for you.

Save the following code to a file called yahoogroups.pl:

#!/usr/bin/perl -w

use constant USERNAME => 'your username';
use constant PASSWORD => 'your password';

use strict;
use File::Path;
use Getopt::Long;
use WWW::Yahoo::Groups;
$SIG{PIPE} = 'IGNORE';

# define the command-line options, and 
# ensure that a group has been passed.
my ($debug, $group, $last, $first, $stats);
GetOptions(
    "debug"     => \$debug,
    "group=s"   => \$group,
    "stats"     => \$stats,
    "first=i"   => \$first,
    "last=i"    => \$last,
); (defined $group) or die "Must specify a group!\n";

# sign into Yahoo! Groups.
my $w = WWW::Yahoo::Groups->new(  );
$w->debug( $debug );
$w->login( USERNAME, PASSWORD );
$w->list( $group );
$w->agent->requests_redirectable( [] ); # no redirects now

# first and last IDs of group.
my $first_id = $w->first_msg_id(  );
my $last_id = $w->last_msg_id(  );
print "Messages in $group: $first_id to $last_id\n";
exit 0 if $stats; # they just wanted numbers.

# default our IDs to the first and last
# of the $group in question, else use the
# passed command-line options.
$first = $first_id unless $first;
$last  = $last_id  unless $last;
warn "Fetching $first to $last\n";

# get our specified messages.
for my $msgnum ($first..$last) {
    fetch_message( $w, $msgnum );
}

sub fetch_message {
    my $w = shift;
    my $msgnum = shift;

    # Put messages in directories by 100.
    my $dirname = int($msgnum/100)*100;

    # Create the dir if necessary.
    my $dir = "$group/$dirname";
    mkpath( $dir ) unless -d $dir;

    # Don't pull down the message
    # if we already have it...
    my $filename = "$dir/$msgnum";
    return if -f $filename;

    # pull down the content and check for errors.
    my $content = eval { $w->fetch_message($msgnum) };
    if ( $@ ) {
        if ( $@->isa('X::WWW::Yahoo::Groups') ) {
            warn "Could not handle message $msgnum: ",$@->error,"\n";
        } else { warn "Could not get content for message $msgnum\n"; }
    } else {
        open(FH, ">$filename") 
          or return warn "Can't create $filename: $!\n";
        print FH $content; close FH; # data has been saved.
        $w->autosleep( 5 ); # so now sleep to prevent saturation.
    }
}

Running the Hack

Before you can use the script, you’ll need to have a Yahoo! Groups account (http://edit.yahoo.com/config/eval_register) and be subscribed to at least one list that has web archives. Remember that we’re merely automating the web transactions, not getting at some secret backdoor into Yahoo! Groups. Also, modify the lines at the top of the script that set the USERNAME and PASSWORD constants. If these aren’t set, the script can’t log in as you and, consequently, you might not have access to the group’s messages.

First, find out how many messages there are. In this case, let’s check out milwpm, the discussion list for the Milwaukee Perl Mongers:

% perl yahoogroups.pl --group=milwpm --stats
Messages in milwpm: 1 to 721

Now, take a look at the last five messages in the archive:

% perl yahoogroups.pl --group=milwpm --first=717
Messages in milwpm: 1 to 721
Fetching 717 to 721

Behind the scenes, the script has created a directory called milwpm and, within that, a directory called 700 for holding all messages between 700 and 799. Each message gets its own file.

% ls -al milwpm/700
-rw-r--r--    1 andy     staff        2814 Jul 16 23:04 700
-rw-r--r--    1 andy     staff        4005 Jul 16 23:05 717
-rw-r--r--    1 andy     staff        1511 Jul 16 23:05 718
-rw-r--r--    1 andy     staff        5576 Jul 16 23:05 719
-rw-r--r--    1 andy     staff        5862 Jul 16 23:05 720
-rw-r--r--    1 andy     staff        6632 Jul 16 23:05 721

If you want to look at the starting few messages, use the --last parameter. You can also use the --debug parameter to get running notes of what the script is doing:

% perl yahoogroups.pl --group=milwpm --last=5 --debug
Fetching http://groups.yahoo.com/
Fetching http://login.yahoo.com/config/login?.intl=us&.src=ygrp&....
Fetching http://groups.yahoo.com/group/milwpm/messages/1
Messages in milwpm: 1 to 721
Fetching 1 to 5
Fetching http://groups.yahoo.com/group/milwpm/message/1?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/2?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/3?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/4?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/interrupt?st=2&m=1&done=%2...
Fetching /group/milwpm/message/4?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/5?source=1&unwrap=1

Hacking the Hack

You can easily extend this hack to manipulate the data before it gets saved to the file. The messages that are returned are in standard Internet mail format, so you can extract just the headers you want, such as To:, From:, and Subject:. The MailTools (http://search.cpan.org/~spoon/WWW-Yahoo-Groups-1.91/lib/WWW/Yahoo/Groups.pm) distribution has a number of modules that will help.

As a quick example, sans MailTools, let’s say you want to see the most active threads from the messages you’re downloading. This is a rather simple modification to make. Add a hash for our new information before the fetch_message subroutine (changes are in bold):

               # Keep track of popular subjects
               my %subjects;

sub fetch_message {
    my $w = shift;

Then, add the tracking code for each subject line:

        } else { warn "Could not get content for message $msgnum\n"; }
    } else {

        # and add one to our subject line counter.
        $content =~ /Subject: (.*)/ig; $subjects{$1}++ if $1;

        open(FH, ">$filename") 
          or return warn "Can't create $filename: $!\n";

Finally, at the end of the script, display the stats:

               # now, print our totals.
my @sorted = sort { $subjects{$b} <=> $subjects{$a} } keys %subjects;
               foreach (@sorted) { print "$subjects{$_}: $_\n"; }

This code can easily be tweaked to save only messages from certain authors—local copies of your own postings, for instance—or subject lines associated with especially thoughtful or useful threads.

Yahoo! Groups also has search capabilities that you can take advantage of with WWW::Mechanize. See Downloading Images from Webshots” [Hack #36] for an example of searching web sites with WWW::Mechanize.

—Andy Lester

Hack #45. Gleaning Buzz from Yahoo!

Stay hip with the latest Yahoo! Buzz search results.

Google has a Zeitgeist page (http://www.google.com/press/zeitgeist.html) that gives you an idea of what people are searching for, but unfortunately it’s not updated very often; some parts are updated once a week, while other parts are updated only once a month. Meanwhile, Yahoo! has a Yahoo! Buzz site (http://buzz.yahoo.com/) that contains much more annotated information about what people are searching for.

We thought it would be fun to take a Buzz item from the Yahoo! Buzz site (specifically, http://buzz.yahoo.com/overall/) and then use it to initiate a search on Google. This hack is part scraping—the Yahoo! Buzz side—and part use of a web API—the Google side. As you’ll see, the two work very well together.

The Code

You’ll need a Google API developer’s key (http://api.google.com/) and a lesser-known Perl module (Time::JulianDay ) to get this hack to work. Save the following code to a file called ybgoogled.pl:

#!/usr/bin/perl -w
# ybgoogled.pl
# Pull the top item from the Yahoo Buzz Index and query
# the last three day's worth of Google's index for it.
# Usage: perl ybgoogled.pl
use strict;
use SOAP::Lite;
use LWP::Simple;
use Time::JulianDay;

# Your Google API developer's key.
my $google_key='insert key here';

# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";

# Number of days back to
# go in the Google index.
my $days_back = 3;

# Grab a copy of http://buzz.yahoo.com.
my $buzz_content = get("http://buzz.yahoo.com/overall/") 
  or die "Couldn't grab the Yahoo Buzz: $!";

# Find the first item on the Buzz Index list.
$buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!;
my $buzziest = $1; # assign our match as our search term.
die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest;

# Figure out today's Julian date.
my $today = int local_julian_day(time);

# Build the Google query and say hi.
my $query = "\"$buzziest\" daterange:" . ($today - $days_back) . "-$today"; 
print "The buzziest item on Yahoo Buzz today is: $buzziest\n",
      "Querying Google for: $query\n", "Results:\n\n";

# Create a new SOAP::Lite instance, feeding it GoogleSearch.wsdl.
my $google_search = SOAP::Lite->service("file:$google_wdsl");

# Query Google.
my $results = $google_search->doGoogleSearch( 
                  $google_key, $query, 0, 10, "false",
                  "",  "false", "", "", ""
              );

# No results?
die "No results" unless @{$results->{resultElements}};

# Loop through the results.
foreach my $result (@{$results->{'resultElements'}}) {
    my $output = join "\n", $result->{title} || "no title",
                 $result->{URL}, $result->{snippet} || 'none',"\n";
    $output =~ s!<.+?>!!g; # drop all HTML tags sloppily.
    print $output; # woo, we're done!
}

This code works only as long as Yahoo! formats its Buzz page in the same way; we’ve had to change it multiple times. If you try this program and it doesn’t work, pull out this line:

$buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!;

Take a look at the code pulled out by the variable $buzziest and see if it matches any code in the source code at http://buzz.yahoo.com/overall/. If it doesn’t, the code’s changed. Go to the HTML source view and find the first item on the Buzz list. Look at the source, find that first Buzz listing, and pull the code from around it. You want to pull enough code to get a unique line, but not so much that you can’t read it.

Running the Hack

Run this script from the command line, like so:

% perl ybgoogled.pl

The buzziest item on Yahoo Buzz today is: Gregory Hines
Querying Google for: "Gregory Hines" daterange:2452861-2452864
Results:

 Celebrities @ Hollywood.com-Featuring Gregory Hines. Celebrities ... 
 http://www.hollywood.com/celebs/detail/celeb/191902
 Gregory Hines Vital Stats: Born: February 14, 1946 Birth Place: New York,
 New York 

 Gregory Hines
 http://www.rottentomatoes.com/p/GregoryHines-1007016/
  ... Gregory Hines. CELEB QUIK BROWSER &gt; Select A Celebrity. ...

...

Hacking the Hack

As it stands, this hack returns 10 results. If you want to, you can change the code to return only one result and immediately open it instead of returning a list. This version of the program searches the last three days of indexed pages. Because there’s a slight lag in indexing news stories, I would index at least the last two days’ worth of pages, but you could extend it to seven days or even a month.

If you want to abandon Google entirely, you can. Instead, you might want to go to Daypop (http://www.daypop.com), which also has a news search. Here’s a version of the script using the top item on Daypop:

#!/usr/bin/perl -w
# ybdaypopped
# Pull the top item from the Yahoo! Buzz Index and query 
# Daypop's News search engine for relevant stories
use strict;
use LWP::Simple;

# Grab a copy of http://buzz.yahoo.com.
my $buzz_content = get("http://buzz.yahoo.com/") 
  or die "Couldn't grab the Yahoo Buzz: $!";

# Find the first item on the Buzz Index list.
$buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!;
my $buzziest = $1; # assign our match as our search term.
die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest;

# Build a Daypop Query.
my $dpquery = "http://www.daypop.com/search?q=$buzziest&t=n"; 
print "Location: $dpquery\n\n";

This version of the program takes the first Buzz item from Yahoo! and opens a Daypop news search for that item (assuming you run this as a CGI script). But hey, maybe we should use that RSS format [Hack #94] all the kids are talking about. In that case, just put &o=rss at the end of $dpquery:

my $dpquery = "http://www.daypop.com/search?q=$buzziest&t=n&o=rss";

Now you’re using Yahoo! Buzz to generate an RSS file with Daypop. From there, you can scrape the RSS file, pass this URL to a routine that puts an RSS file up on a web page [Hack #95], and so on.

—Tara Calishain and Rael Dornfest

Hack #46. Spidering the Yahoo! Catalog

Writing a spider to spider an existing spider’s site may seem convoluted, but it can prove useful when you’re looking for location-based services. This hack walks through creating a framework for full-site spidering, including additional filters to lessen your load.

In this hack, you’ll learn how to write a spider that crawls the Yahoo! group of portals. The choice of Yahoo! was obvious; because it is one of the largest Internet portals in existence, it can serve as an ideal example of how one goes about writing a portal spider.

But before we get to the gory details of code, let’s define what exactly a portal spider is. While many may argue with such classification, I maintain that a portal spider is a script that automatically downloads all documents from a preselected range of URLs found on the portal’s site or a group of sites, as is the case with Yahoo!. A portal spider’s main job is to walk from one document to another, extract URLs from downloaded HTML, process said URLs, and go to another document, repeating the cycle until it runs out of URLs to visit. Once you create code that describes such basic behavior, you can add additional functionality, turning your general portal spider into a specialized one.

Although writing a script that walks from one Yahoo! page to another sounds simple, it isn’t, because there is no general pattern followed by all Yahoo! sites or sections within those sites. Furthermore, Yahoo! is not a single site with a nice link layout that can be described using a simple algorithm and a classic data structure. Instead, it is a collection of well over 30 thematic sites, each with its own document layout, naming conventions, and peculiarities in page design and URL patterns. For example, if you check links to the same directory section on different Yahoo! sites, you will find that some of them begin with http://www.yahoo.com/r, some begin with http://uk.yahoo.com/r/hp/dr, and others begin with http://kr.yahoo.com.

If you try to look for patterns, you will soon find yourself writing long if/elsif/else sections that are hard to maintain and need to be rewritten every time Yahoo! makes a small change to one of its sites. If you follow that route, you will soon discover that you need to write hundreds of lines of code to describe every kind of behavior you want to build into your spider.

This is particularly frustrating to programmers who expect to write code that uses elegant algorithms and nicely structured data. The hard truth about portals is that you cannot expect elegance and ease of spidering from them. Instead, prepare yourself for a lot of detective work and writing (and throwing away) chunks of code in a hit-and-miss fashion. Portal spiders are written in an organic, unstructured way, and the only rule you should follow is to keep things simple and add specific functionality only once you have the general behavior working.

Okay, with taxonomy and general advice behind us, we can get to the gist of the matter. The spider in this hack is a relatively simple tool for crawling Yahoo! sites. It makes no assumptions about the layout of the sites; in fact, it makes almost no assumptions whatsoever and can easily be adapted to other portals or even groups of portals. You can use it as a framework for writing specialized spiders.

The Code

Save the following code to a file called yspider.pl:

#!/usr/bin/perl -w
#
# yspider.pl
#
# Yahoo! Spider - crawls Yahoo! sites, collects links from each 
# downloaded HTML page, searches each downloaded page, and prints a
# list of results when done.
# http://www.artymiak.com/software/ or contact jacek@artymiak.com
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.

use strict;
use Getopt::Std;            # parse command-line options.
use LWP::UserAgent;         # download data from the Net.
use HTML::LinkExtor;        # get links inside an HTML document.
use URI::URL;               # turn relative links into absolutes.

my $help = <<"EOH";
----------------------------------------------------------------------------
Yahoo! Spider.

Options: -s    list of sites you want to crawl,
               e.g. -s 'us china denmark'
         -h    print this help

Allowed values of -s are:

   argentina, asia, australia, brazil, canada,
   catalan, china, denmark, france, germany, hongkong,
   india, ireland, italy, japan, korea, mexico,
   newzealand, norway, singapore, spain, sweden, taiwan,
   uk, us, us_chinese, us_spanish 

Please, use this code responsibly.  Flooding any site
with excessive queries is bad net citizenship.
----------------------------------------------------------------------------
EOH

# define our arguments and
# show the help if asked.
my %args; getopts("s:h", \%args); 
die $help if exists $args{h};

# The list of code names, and
# URLs for various Yahoo! sites.
my %ys = (
   argentina => "http://ar.yahoo.com", asia => "http://asia.yahoo.com",
   australia => "http://au.yahoo.com", newzealand => "http://au.yahoo.com",
   brazil    => "http://br.yahoo.com", canada   => "http://ca.yahoo.com",
   catalan   => "http://ct.yahoo.com", china    => "http://cn.yahoo.com",
   denmark   => "http://dk.yahoo.com", france   => "http://fr.yahoo.com",
   germany   => "http://de.yahoo.com", hongkong => "http://hk.yahoo.com",
   india     => "http://in.yahoo.com", italy    => "http://it.yahoo.com",
   korea     => "http://kr.yahoo.com", mexico   => "http://mx.yahoo.com",
   norway    => "http://no.yahoo.com", singapore => "http://sg.yahoo.com",
   spain     => "http://es.yahoo.com", sweden   => "http://se.yahoo.com",
   taiwan    => "http://tw.yahoo.com", uk       => "http://uk.yahoo.com",
   ireland  => "http://uk.yahoo.com",  us       => "http://www.yahoo.com",
   japan    => "http://www.yahoo.co.jp",
   us_chinese => "http://chinese.yahoo.com",
   us_spanish => "http://espanol.yahoo.com"
);

# if the -s option was used, check to make
# sure it matches one of our existing codes
# above. if not, or if no -s was passed, help.
my @sites; # which locales to spider.
if (exists $args{'s'}) {
    @sites = split(/ /, lc($args{'s'}));
    foreach my $site (@sites) {
        die "UNKNOWN: $site\n\n$help" unless $ys{$site};
    }
} else { die $help; }

# Defines global and local profiles for URLs extracted from the
# downloaded pages. These profiles are used to determine if the
# URLs extracted from each new document should be placed on the
# TODO list (%todo) or rejected (%rejects). Profiles are lists
# made of chunks of text, which are matched against found URLs.
# Any special characters, like slash (/) or dot (.), must be properly
# escaped. Remember that globals have precedence over locals. 
my %rules = (
   global     => { allow => [], deny => [ 'search', '\*' ] },
   argentina  => { allow => [ 'http:\/\/ar\.' ], deny => [] },
   asia       => { allow => [ 'http:\/\/(aa|asia)\.' ], deny => [] },
   australia  => { allow => [ 'http:\/\/au\.' ], deny => [] },
   brazil     => { allow => [ 'http:\/\/br\.' ], deny => [] },
   canada     => { allow => [ 'http:\/\/ca\.' ], deny => [] },
   catalan    => { allow => [ 'http:\/\/ct\.' ], deny => [] },
   china      => { allow => [ 'http:\/\/cn\.' ], deny => [] },
   denmark    => { allow => [ 'http:\/\/dk\.' ], deny => [] },
   france     => { allow => [ 'http:\/\/fr\.' ], deny => [] },
   germany    => { allow => [ 'http:\/\/de\.' ], deny => [] },
   hongkong   => { allow => [ 'http:\/\/hk\.' ], deny => [] },
   india      => { allow => [ 'http:\/\/in\.' ], deny => [] },
   ireland    => { allow => [ 'http:\/\/uk\.' ], deny => [] },
   italy      => { allow => [ 'http:\/\/it\.' ], deny => [] },
   japan      => { allow => [ 'yahoo\.co\.jp' ], deny => [] },
   korea      => { allow => [ 'http:\/\/kr\.' ], deny => [] },
   mexico     => { allow => [ 'http:\/\/mx\.' ], deny => [] },
   norway     => { allow => [ 'http:\/\/no\.' ], deny => [] },
   singapore  => { allow => [ 'http:\/\/sg\.' ], deny => [] },
   spain      => { allow => [ 'http:\/\/es\.' ], deny => [] },
   sweden     => { allow => [ 'http:\/\/se\.' ], deny => [] },
   taiwan     => { allow => [ 'http:\/\/tw\.' ], deny => [] },
   uk         => { allow => [ 'http:\/\/uk\.' ], deny => [] },
   us         => { allow => [ 'http:\/\/(dir|www)\.' ], deny => [] },
   us_chinese => { allow => [ 'http:\/\/chinese\.' ], deny => [] },
   us_spanish => { allow => [ 'http:\/\/espanol\.' ], deny => [] },
);

my %todo = (  );       # URLs to parse
my %done = (  );       # parsed/finished URLs
my %errors = (  );     # broken URLs with errors
my %rejects = (  );    # URLs rejected by the script

# print out a "we're off!" line, then
# begin walking the site we've been told to.
print "=" x 80 . "\nStarted Yahoo! spider...\n" . "=" x 80 . "\n";
our $site; foreach $site (@sites) {

    # for each of the sites that have been passed on the
    # command line, we make a title for them, add them to
    # the TODO list for downloading, then call walksite(  ),
    # which downloads the URL, looks for more URLs, etc.
    my $title = "Yahoo! " . ucfirst($site) . " front page";
    $todo{$ys{$site}} = $title; walksite(  ); # process.

}

# once we're all done with all the URLs, we print a
# report about all the information we've gone through.
print "=" x 80 . "\nURLs downloaded and parsed:\n" . "=" x 80 . "\n";
foreach my $url (keys %done) { print "$url => $done{$url}\n"; }
print "=" x 80 . "\nURLs that couldn't be downloaded:\n" . "=" x 80 . "\n";
foreach my $url (keys %errors) { print "$url => $errors{$url}\n"; }
print "=" x 80 . "\nURLs that got rejected:\n" . "=" x 80 . "\n";
foreach my $url (keys %rejects) { print "$url => $rejects{$url}\n"; }

# this routine grabs the first entry in our TODO
# list, downloads the content, and looks for more URLs.
# we stay in walksite until there are no more URLs
# in our TODO list, which could be a good long time.
sub walksite {

    do {
        # get first URL to do.
        my $url = (keys %todo)[0];

        # download this URL.
        print "-> trying $url ...\n";
        my $browser = LWP::UserAgent->new;
        my $resp = $browser->get( $url, 'User-Agent' => 'Y!SpiderHack/1.0' );

        # check the results.
        if ($resp->is_success) {
            my $base = $resp->base || '';
            print "-> base URL: $base\n";
            my $data = $resp->content; # get the data.
            print "-> downloaded: " . length($data) . " bytes of $url\n";

            # find URLs using a link extorter. relevant ones
            # will be added to our TODO list of downloadables.
            # this passes all the found links to findurls(  )
            # below, which determines if we should add the link
            # to our TODO list, or ignore it due to filtering.
            HTML::LinkExtor->new(\&findurls, $base)->parse($data);

            ###########################################################
            # add your own processing here. perhaps you'd like to add #
            # a keyword search for the downloaded content in $data?   #
            ###########################################################

        } else {
            $errors{$url} = $resp->message(  );
            print "-> error: couldn't download URL: $url\n";
            delete $todo{$url};
        }

        # we're finished with this URL, so move it from
        # the TODO list to the done list, and print a report.
        $done{$url} = $todo{$url}; delete $todo{$url};
        print "-> processed legal URLs: " . (scalar keys %done) . "\n";
        print "-> remaining URLs: " . (scalar keys %todo) . "\n";
        print "-" x 80 . "\n";
    } until ((scalar keys %todo) == 0);
}

# callback routine for HTML::LinkExtor. For every
# link we find in our downloaded content, we check
# to see if we've processed it before, then run it
# through a bevy of regexp rules (see the top of
# this script) to see if it belongs in the TODO.
sub findurls {
    my($tag, %links) = @_;
    return if $tag ne 'a';
    return unless $links{href};
    print "-> found URL: $links{href}\n";

    # already seen this URL, so move on.
    if (exists $done{$links{href}} ||
        exists $errors{$links{href}} || 
        exists $rejects{$links{href}}) {
        print "--> I've seen this before: $links{href}\n"; return;
    }

    # now, run through our filters.
    unless (exists($todo{$links{href}})) {
        my ($ga, $gd, $la, $ld); # counters.
        foreach (@{$rules{global}{'allow'}}) { 
            $ga++ if $links{href} =~ /$_/i; 
        }
        foreach (@{$rules{global}{'deny'}}) { 
            $gd++ if $links{href} =~ /$_/i; 
        }
        foreach (@{$rules{$site}{'allow'}}) { 
            $la++ if $links{href} =~ /$_/i; 
        }
        foreach (@{$rules{$site}{'deny'}}) { 
            $ld++ if $links{href} =~ /$_/i; 
        }

        # if there were denials or NO allowances, we move on.
        if ($gd or $ld) { print "-> rejected: $links{href}\n"; return; }
        unless ($ga or $la) { print "-> rejected: $links{href}\n"; return; }

        # we passed our filters, so add it on the barby.
        print "-> added $links{href} to my TODO list\n";
        $todo{$links{href}} = $links{href};
    }
}

Running the Hack

Before sending the spider off, you’ll need to make a decision regarding which part of the Yahoo! directory you want to crawl. If you’re mainly interested in the United States and United Kingdom, you’ll inform the spider using the -s option on the command line, like so:

% perl yspider.pl -s "us uk"
============================================================================
Started Yahoo! spider...
============================================================================
-> trying http://www.yahoo.com ...
-> base URL: http://www.yahoo.com/
-> downloaded: 28376 bytes of http://www.yahoo.com
-> found URL: http://www.yahoo.com/s/92802
-> added http://www.yahoo.com/s/92802 to my TODO list
-> found URL: http://www.yahoo.com/s/92803
... etc ...
-> added http://www.yahoo.com/r/pv to my TODO list
-> processed legal URLs: 1
-> remaining URLs: 244
----------------------------------------------------------------------------
-> trying http://www.yahoo.com/r/fr ...
-> base URL: http://fr.yahoo.com/r/
-> downloaded: 32619 bytes of http://www.yahoo.com/r/fr
-> found URL: http://fr.yahoo.com/r/t/mu00
-> rejected URL: http://fr.yahoo.com/r/t/mu00
...

You can see a full list of locations available to you by asking for help:

% perl yspider.pl -h

...
Allowed values of -s are:

   argentina, asia, australia, brazil, canada,
   catalan, china, denmark, france, germany, hongkong,
   india, ireland, italy, japan, korea, mexico,
   newzealand, norway, singapore, spain, sweden, taiwan,
   uk, us, us_chinese, us_spanish

Hacking the Hack

The section you’ll want to modify most contains the filters that determine how far the spider will go; by tweaking the allow and deny rules at the beginning of the script, you’ll be able to better grab just the content you’re interested in. If you want to make this spider even more generic, consider rewriting the configuration code so that it’ll instead read a plain-text list of code names, start URLs, and allow and deny patterns. This can turn a Yahoo! spider into a general Internet spider.

Whenever you want to add code that extends the functionality of this spider (such as searching for keywords in a document, adding the downloaded content to a database, or otherwise repurposing it for your needs), include your own logic where specified by the hashed-out comment block.

See Also

If you’re spidering Yahoo! because you want to start your own directory, you might want to consider Google’s Open Directory Project (http://dmoz.org/about.html). Downloading their freely available directory data, all several hundred megs of it, will give you plenty of information to play with.

—Jacek Artymiak

Hack #47. Tracking Additions to Yahoo!

Keep track of the number of sites added to your favorite Yahoo! categories.

Every day, a squad of surfers at Yahoo! adds new sites to the Yahoo! index. These changes are reflected in the Yahoo! What’s New page (http://dir.yahoo.com/new/), along with the Picks of the Day.

If you’re a casual surfer, you might not care about the number of new sites added to Yahoo!. But there are several scenarios when you might have an interest:

  • You regularly glean information about new sites from Yahoo! Knowing which categories are growing and which categories are stagnant will tell you where to direct your attention.

  • You want to submit sites to Yahoo! Are you going to spend your hard-earned money adding a site to a category where new sites are added constantly (meaning your submitted site might get quickly buried)? Or will you be paying to add to a category that sees few additions (meaning your site might have a better chance of standing out)?

  • You’re interested in trend tracking. Which categories are consistently busy? Which are all but dead? By watching how Yahoo! adds sites to categories, over time you’ll get a sense of their rhythms and trends and detect when unusual activity occurs in a category.

This hack scrapes the recent counts of additions to Yahoo! categories and prints them out, providing an at-a-glance glimpse of additions to various categories. You’ll also get a tab-delimited table of how many sites have been added to each category for each day. A tab-delimited file is excellent for importing into Excel, where you can turn the count numbers into a chart.

The Code

Save the following code to a file called hoocount.pl:

#!/usr/bin/perl-w

use strict;
use Date::Manip;
use LWP::Simple;
use Getopt::Long;

$ENV{TZ} = "GMT" if $^O eq "MSWin32";

# the homepage for Yahoo!'s "What's New".
my $new_url = "http://dir.yahoo.com/new/";

# the major categories at Yahoo!. hashed because
# we'll use them to hold our counts string.
my @categories = ("Arts & Humanities",    "Business & Economy",
                  "Computers & Internet", "Education",
                  "Entertainment",        "Government",
                  "Health",               "News & Media",
                  "Recreation & Sports",  "Reference",
                  "Regional",             "Science", 
                  "Social Science",       "Society & Culture");
my %final_counts; # where we save our final readouts.

# load in our options from the command line.
my %opts; GetOptions(\%opts, "c|count=i");
die unless $opts{c}; # count sites from past $i days.

# if we've been told to count the number of new sites,
# then we'll go through each of our main categories
# for the last $i days and collate a result.

# begin the header
# for our import file.
my $header = "Category";

# from today, going backwards, get $i days.
for (my $i=1; $i <= $opts{c}; $i++) {

   # create a Data::Manip time that will
   # be used to construct the last $i days.
   my $day; # query for Yahoo! retrieval.
   if ($i == 1) { $day = "yesterday"; }
   else { $day = "$i days ago"; }
   my $date = UnixDate($day, "%Y%m%d");

   # add this date to
   # our import file.
   $header .= "\t$date";

   # and download the day.
   my $url = "$new_url$date.html";
   my $data = get($url) or die $!;

   # and loop through each of our categories.
   my $day_count; foreach my $category (sort @categories) {
       $data =~ /$category.*?(\d+)/; my $count = $1 || 0;
       $final_counts{$category} .= "\t$count"; # building our string.
   }
}

# with all our counts finished,
# print out our final file.
print $header . "\n";
foreach my $category (@categories) {
   print $category, $final_counts{$category}, "\n";
}

Running the Hack

The only argument you need to provide the script is the number of days back you’d like it to travel in search of new additions. Since Yahoo! doesn’t archive their “new pages added” indefinitely, a safe upper limit is around two weeks. Here, we’re looking at the past two days:

% perl hoocount.pl --count 2
Category        20030807        20030806
Arts & Humanities       23      23
Business & Economy      88      141
Computers & Internet    2       9
Education       0       4
Entertainment   43      29
Government      3       4
Health  2       7
News & Media    1       1
Recreation & Sports     8       27
Reference       0       0
Regional        142     114
Science 1       2
Social Science  3       0
Society & Culture       7       8

Hacking the Hack

If you’re not only a researcher but also a Yahoo! observer, you might be interested in how the number of sites added changes over time. To that end, you could run this script under cron [Hack #90], and output the results to a file. After three months or so, you’d have a pretty interesting set of counts to manipulate with a spreadsheet program like Excel. Alternatively, you could modify the script to run RRDTOOL [Hack #62] and have real-time graphs.

Hack #48. Scattersearch with Yahoo! and Google

Sometimes, illuminating results can be found when scraping from one site and feeding the results into the API of another. With scattersearching, you can narrow down the most popular related results, as suggested by Yahoo! and Google.

We’ve combined a scrape of a Yahoo! web page with a Google search [Hack #45], blending scraped data with data generated via a web service API to good effect. In this hack, we’re doing something similar, except this time we’re taking the results of a Yahoo! search and blending it with a Google search.

Yahoo! has a “Related searches” feature, where you enter a search term and get a list of related terms under the search box, if any are available. This hack scrapes those related terms and performs a Google search for the related terms in the title. It then returns the count for those searches, along with a direct link to the results. Aside from showing how scraped and API-generated data can live together in harmony, this hack is good to use when you’re exploring concepts; for example, you might know that something called Pokemon exists, but you might not know anything about it. You’ll get Yahoo!’s related searches and an idea of how many results each of those searches generates in Google. From there, you can choose the search terms that generate the most results or look the most promising based on your limited knowledge, or you can simply pick a road that appears less traveled.

The Code

Save the following code to a file called scattersearch.pl:

#!/usr/bin/perl-w
#
# Scattersearch -- Use the search suggestions from
# Yahoo! to build a series of intitle: searches at Google. 

use strict;

use LWP;
use SOAP::Lite;
use CGI qw/:standard/;

# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;

# Your Google API developer's key.
my $google_key = 'insert key here';

# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";

# search Yahoo! for the query.
my $ua  = LWP::UserAgent->new;
my $url = URI->new('http://search.yahoo.com/search');
$url->query_form(rs => "more", p => $query);
my $yahoosearch = $ua->get($url)->content;
$yahoosearch =~ s/[\f\t\n\r]//isg;

# and determine if there were any results.
$yahoosearch =~ m!Related:(.*?)<spacer!migs; 
die "Sorry, there were no results!\n" unless $1;
my $recommended = $1;

# now, add all our results into
# an array for Google processing.
my @googlequeries;
while ($recommended =~ m!<a href=".*?">(.*?)</a>!mgis) {
    my $searchitem = $1; $searchitem =~ s/nobr|<|>|\///g;
    push (@googlequeries, $searchitem); 
}

# print our header for the results page.
print join "\n",
start_html("ScatterSearch");
     h1("Your Scattersearch Results"),
     p("Your original search term was '$query'"),
     p("That search had " . scalar(@googlequeries). " recommended terms."),
     p("Here are result numbers from a Google search"),
     CGI::start_ol(  );

# create our Google object for API searches.
my $gsrch = SOAP::Lite->service("file:$google_wdsl");

# running the actual Google queries.
foreach my $googlesearch (@googlequeries) {
    my $titlesearch = "allintitle:$googlesearch"; 
    my $count = $gsrch->doGoogleSearch($google_key, $titlesearch,
                                       0, 1, "false", "",  "false",
                                       "", "", "");
    my $url = $googlesearch; $url =~ s/ /+/g; $url =~ s/\"/%22/g;
    print li("There were $count->{estimatedTotalResultsCount} ".
             "results for the recommended search <a href=\"http://www.".
             "google.com/search?q=$url&num=100\">$googlesearch</a>");
}

print CGI::end_ol(  ), end_html;

Running the Hack

This script generates an HTML file, ready for you to upload to a publicly accessible web site. If you want to save the output of a search for "siamese" to a file called scattersearch.html in your Sites directory, run the following command:

% perl scattersearch.pl "siamese" > ~/Sites/scattersearch.html

Your final results, as rendered by your browser, will look similar to Figure 4-2.

Scattersearch results for “siamese”
Figure 4-2. Scattersearch results for “siamese”

You’ll have to do a little experimenting to find out which terms have related searches. Broadly speaking, very general search terms are bad; it’s better to zero in on terms that people would search for and that would be easy to group together. As of this writing, for example, "heart" has no related search terms, but "blood pressure" does.

Hacking the Hack

You have two choices: you can either hack the interaction with Yahoo! or expand it to include something in addition to or instead of Yahoo! itself. Let’s look at Yahoo! first. If you take a close look at the code, you’ll see we’re passing an unusual parameter to our Yahoo! search results page:

$url->query_form(rs => "more", p => $query);

The rs=>"more" part of the search shows the related search terms. Getting the related search this way will show up to 10 results. If you remove that portion of the code, you’ll get roughly four related searches when they’re available. That might suit you if you want only a few, but maybe you want dozens and dozens! In that case, replace more with all.

Beware, though: this can generate a lot of related searches, and it can certainly eat up your daily allowance of Google API requests. Tread carefully.

Yahoo! isn’t the only search engine that has related search data. If you’re looking for related searches that will work with general search terms like "heart“, try AltaVista’s Prisma (http://www.altavista.com/prisma/):

#!/usr/bin/perl-w
use strict; 
use LWP;

# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;

# search Prisma for the query.
my $ua  = LWP::UserAgent->new;
my $url = URI->new('http://www.altavista.com/web/results');
$url->query_form('q' => $query);

my $prismasearch = $ua->get($url)->content;
$prismasearch =~ s/[\f\t\n\r]//isg;

while ($prismasearch =~ m!title="Add.*?to your.*?">(.*?)</a>!mgis) {
    my $searchitem = $1; print "$searchitem\n";
}

For clusters of related search results, in addition to similar queries, check out AlltheWeb (http://www.alltheweb.com). AlltheWeb’s related and clustered results are at the bottom of the search page, instead of at the top:

#!/usr/bin/perl-w
use strict; use LWP;

# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;

# search Prisma for the query.
my $ua  = LWP::UserAgent->new;
$ua->agent('Mozilla/4.76 [en] (Win98; U)');
my $url = URI->new('http://www.alltheweb.com/search');
$url->query_form('q' => $query, cat => 'web');

my $atwsearch = $ua->get($url)->content;
$atwsearch =~ s/[\f\t\n\r]//isg;

while ($atwsearch =~ m!<li>(.*?)">(.*?)</a>!mgis) {
    my ($searchlink, $searchitem) = ($1, $2);
    next if $searchlink !~ /c=web/;
    print "$searchitem\n";
}

Hack #49. Yahoo! Directory Mindshare in Google

How does link popularity compare in Yahoo!’s searchable subject index versus Google’s full-text index? Find out by calculating mindshare!

Yahoo! and Google are two very different animals. Yahoo! indexes only a site’s main URL, title, and description, while Google builds full-text indexes of entire sites. Surely there’s some interesting cross-pollination when you combine results from the two.

This hack scrapes all the URLs in a specified subcategory of the Yahoo! directory. It then takes each URL and gets its link count from Google. Each link count provides a nice snapshot of how a particular Yahoo! category and its listed sites stack up on the popularity scale.

What’s a link count? It’s simply the total number of pages in Google’s index that link to a specific URL.

There are a couple of ways you can use your knowledge of a subcategory’s link count. If you find a subcategory whose URLs have only a few links each in Google, you may have found a subcategory that isn’t getting a lot of attention from Yahoo!’s editors. Consider going elsewhere for your research. If you’re a webmaster and you’re considering paying to have Yahoo! add you to their directory, run this hack on the category in which you want to be listed. Are most of the links really popular? If they are, are you sure your site will stand out and get clicks? Maybe you should choose a different category.

We got this idea from a similar experiment Jon Udell (http://weblog.infoworld.com/udell/) did in 2001. He used AltaVista instead of Google; see http://udell.roninhouse.com/download/mindshare-script.txt. We appreciate the inspiration, Jon!

The Code

You will need a Google API account (http://api.google.com/), as well as the SOAP::Lite (http://www.soaplite.com/) and HTML::LinkExtor (http://search.cpan.org/author/GAAS/HTML-Parser/lib/HTML/LinkExtor.pm) Perl modules to run the following code:

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".
                  "eXtensible_Markup_Language_/RSS/News_Aggregators/";

# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);
sub mindshare { # for each link we find...

    my ($tag, %attr) = @_;

    # continue on only if the tag was a link,
    # and the URL matches Yahoo!'s redirectory.
    return if $tag ne 'a';
    return unless $attr{href} =~ /srd.yahoo/;
    return unless $attr{href} =~ /\*http/;

    # now get our real URL.
    $attr{href} =~ /\*(http.*)/; my $url = $1;

    # and process each URL through Google.
    my $results = $google_search->doGoogleSearch(
                        $google_key, "link:$url", 0, 1,
                        "true", "", "false", "", "", ""
                  ); # wheee, that was easy, guvner.
    $urls{$url} = $results->{estimatedTotalResultsCount};
}

# now sort and display.
my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

Running The Hack

The hack has its only configuration—the Yahoo! directory you’re interested in—passed as a single argument (in quotes) on the command line. If you don’t pass one of your own, a default directory will be used instead.

% perl mindshare.pl "/Entertainment/Humor/Procrastination/"

Your results show the URLs in those directories, sorted by total Google links:

340: http://www.p45.net/
246: http://www.ishouldbeworking.com/
81: http://www.india.com/
33: http://www.jlc.net/~useless/
23: http://www.geocities.com/SouthBeach/1915/
18: http://www.eskimo.com/~spban/creed.html
13: http://www.black-schaffer.org/scp/
3: http://www.angelfire.com/mi/psociety
2: http://www.geocities.com/wastingstatetime/

Hacking the Hack

Yahoo! isn’t the only searchable subject index out there, of course; there’s also the Open Directory Project (DMOZ, http://www.dmoz.org), which is the product of thousands of volunteers busily cataloging and categorizing sites on the Web—the web community’s Yahoo!, if you will. This hack works just as well on DMOZ as it does on Yahoo!; they’re very similar in structure.

Replace the default Yahoo! directory with its DMOZ equivalent:

my $dmoz_dir = shift || "/Reference/Libraries/Library_and_Information_[RETURN]
Science/Technical_Services/Cataloguing/Metadata/RDF/Applications/RSS/[RETURN] 
News_Readers/";

You’ll also need to change the download instructions:

# download the Dmoz.org directory.
my $data = get("http://dmoz.org" . $dmoz_dir) or die $!;

Next, replace the lines that check whether a URL should be measured for mindshare. When we were scraping Yahoo! in our original script, all directory entries were always prepended with http://srd.yahoo.com/ and then the URL itself. Thus, to ensure we received a proper URL, we skipped over the link unless it matched that criteria:

return unless $attr{href} =~ /srd.yahoo/;
return unless $attr{href} =~ /\*http/;

Since DMOZ is an entirely different site, our checks for validity have to change. DMOZ doesn’t modify the outgoing URL, so our previous Yahoo! checks have no relevance here. Instead, we’ll make sure it’s a full-blooded location (i.e., it starts with http://) and it doesn’t match any of DMOZ’s internal page links. Likewise, we’ll ignore searches on other engines:

return unless $attr{href} =~ /^http/;
return if $attr{href} =~ /dmoz|google|altavista|lycos|yahoo|alltheweb/;

Our last change is to modify the bit of code that gets the real URL from Yahoo!’s modified version. Instead of “finding the URL within the URL”:

# now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;

we simply assign the URL that HTML::LinkExtor has found:

# now get our real URL.
my $url = $attr{href};

Can you go even further with this? Sure! You might want to search a more specialized directory, such as the FishHoo! fishing search engine (http://www.fishhoo.com/).

You might want to return only the most linked-to URL from the directory, which is quite easy, by piping the results [Hack #28] to another common Unix utility:

% perl mindshare.pl | head 1

Alternatively, you might want to go ahead and grab the top 10 Google matches for the URL that has the most mindshare. To do so, add the following code to the bottom of the script:

print "\nMost popular URLs for the strongest mindshare:\n";
my $most_popular = shift @sorted_urls;
my $results = $google_search->doGoogleSearch(
                    $google_key, "$most_popular", 0, 10,
                    "true", "", "false", "", "", "" );

foreach my $element (@{$results->{resultElements}}) {
   next if $element->{URL} eq $most_popular;
   print " * $element->{URL}\n";
   print "   \"$element->{title}\"\n\n";
}

Then, run the script as usual (the output here uses the default hardcoded directory):

% perl mindshare.pl
27800: http://radio.userland.com/
6670: http://www.oreillynet.com/meerkat/
5460: http://www.newsisfree.com/
3280: http://ranchero.com/software/netnewswire/
1840: http://www.disobey.com/amphetadesk/
847: http://www.feedreader.com/
797: http://www.serence.com/site.php?page=prod_klipfolio
674: http://bitworking.org/Aggie.html
492: http://www.newzcrawler.com/
387: http://www.sharpreader.net/
112: http://www.awasu.com/
102: http://www.bloglines.com/
67: http://www.blueelephantsoftware.com/
57: http://www.blogtrack.com/
50: http://www.proggle.com/novobot/

Most popular URLs for the strongest mindshare:
 * http://groups.yahoo.com/group/radio-userland/
   "Yahoo! Groups : radio-userland"

 * http://groups.yahoo.com/group/radio-userland-francophone/message/76
   "Yahoo! Groupes : radio-userland-francophone Messages : Message 76 ... "

 * http://www.fuzzygroup.com/writing/radiouserland_faq.htm
   "Fuzzygroup :: Radio UserLand FAQ"
...

Hack #50. Weblog-Free Google Results

With so many weblogs being indexed by Google, you might worry about too much emphasis on the hot topic of the moment. In this hack, we’ll show you how to remove the weblog factor from your Google results.

Weblogs—those frequently updated, link-heavy personal pages—are quite the fashionable thing these days. There are at least 400,000 active weblogs across the Internet, covering almost every possible subject and interest. For humans, they’re good reading, but for search engines they are heavenly bundles of fresh content and links galore.

But some people think the search engine’s delight in weblogs is slanting their search results and giving too much emphasis to too small a group of recent rather than evergreen content. As I write, for example, I am the third most important Ben on the Internet, according to Google. This rank comes solely from my weblog’s popularity.

This hack searches Google, discarding any results coming from weblogs. It uses the Google Web Services API (http://api.google.com) and the API of Technorati (http://www.technorati.com/members), an excellent interface to David Sifry’s weblog data-tracking tool [Hack #70]. Both APIs require keys, available from the URLs mentioned.

Finally, you’ll need a simple HTML page with a form that passes a text query to the parameter q (the query that will run on Google), something like this:

<form action="googletech.cgi" method="POST">
Your query: <input type="text" name="q">
<input type="submit" name="Search!" value="Search!">
</form>

The Code

You’ll need the XML::Simple and SOAP::Lite Perl modules. Save the following code to a file called googletech.cgi:

#!/usr/bin/perl -w
# googletech.cgi
# Getting Google results
# without getting weblog results.
use strict;
use SOAP::Lite;
use XML::Simple;
use CGI qw(:standard);
use HTML::Entities (  );
use LWP::Simple qw(!head);

my $technoratikey = "your technorati key here";
my $googlekey = "your google key here";

# Set up the query term
# from the CGI input.
my $query = param("q");

# Initialize the SOAP interface and run the Google search.
my $google_wdsl = "http://api.google.com/GoogleSearch.wsdl";
my $service = SOAP::Lite->service->($google_wdsl);

# Start returning the results page -
# do this now to prevent timeouts
my $cgi = new CGI;

print $cgi->header(  );
print $cgi->start_html(-title=>'Blog Free Google Results');
print $cgi->h1('Blog Free Results for '. "$query");
print $cgi->start_ul(  );

# Go through each of the results
foreach my $element (@{$result->{'resultElements'}}) {

    my $url = HTML::Entities::encode($element->{'URL'});

    # Request the Technorati information for each result.
    my $technorati_result = get("http://api.technorati.com/bloginfo?".
                                "url=$url&key=$technoratikey");

    # Parse this information.
    my $parser = new XML::Simple;
    my $parsed_feed = $parser->XMLin($technorati_result);

    # If Technorati considers this site to be a weblog,
    # go onto the next result. If not, display it, and then go on.
    if ($parsed_feed->{document}{result}{weblog}{name}) { next; }
    else {
        print $cgi-> i('<a href="'.$url.'">'.$element->{title}.'</a>');
        print $cgi-> l("$element->{snippet}");
    }
}
print $cgi -> end_ul(  );
print $cgi->end_html;

Let’s step through the meaningful bits of this code. First comes pulling in the query from Google. Notice the 10 in the doGoogleSearch; this is the number of search results requested from Google. You should try to set this as high as Google will allow whenever you run the script, or else you might find that searching for terms that are extremely popular in the weblogging world do not return any results at all, having been rejected as originating from a blog.

Since we’re about to make a web services call for every one of the returned results, which might take a while, we want to start returning the results page now; this helps prevent connection timeouts. As such, we spit out a header using the CGI module, then jump into our loop.

We then get to the final part of our code: actually looping through the search results returned by Google and passing the HTML-encoded URL to the Technorati API as a get request. Technorati will then return its results as an XML document.

Be careful you do not run out of Technorati requests. As I write this, Technorati is offering 500 free requests a day, which, with this script, is around 50 searches. If you make this script available to your web site’s audience, you will soon run out of Technorati requests. One possible workaround is forcing the user to enter her own Technorati key. You can get the user’s key from the same form that accepts the query. See the “Hacking the Hack” section for a means of doing this.

Parsing this result is a matter of passing it through XML::Simple. Since Technorati returns only an XML construct containing name when the site is thought to be a weblog, we can use the presence of this construct as a marker. If the program sees the construct, it skips to the next result. If it doesn’t, the site is not thought to be a weblog by Technorati and we display a link to it, along with the title and snippet (when available) returned by Google.

Hacking the Hack

As mentioned previously, this script can burn through your Technorati allowances rather quickly under heavy use. The simplest way of solving this is to force the end user to supply his own Technorati key. First, add a new input to your HTML form for the user’s key:

Your query: <input type="text" name="key">

Then, suck in the user’s key as a replacement to your own:

# Set up the query term
# from the CGI input.
my $query = param("q");
$technoratikey = param("key");

—Ben Hammersley

Hack #51. Spidering, Google, and Multiple Domains

When you want to search a site, you tend to go straight to the site itself and use its native capabilities. But what if you could use Google to search across many similar sites, scraping the pages of most relevance?

If you’re searching for the same thing on multiple sites, it’s handy to use Google’s site: syntax, which allows you to restrict your search to just a particular domain (e.g., perl.org) or set of domains (e.g., org). For example, if you want to search several domains for the word perl, you might have a query that looks like this:

perl ( site:oreilly.com | site:perl.com | site:mit.edu | site:yahoo.com)

You can combine this search with a Perl script to do some specific searching that you can’t do with just Google and can’t do easily with just Perl.

You might wonder why you’d want to involve Google at all in this search. Why not just go ahead and search each domain separately via their search forms and LWP::Simple [Hack #9] or LWP::UserAgent [Hack #10]? There are a few reasons, the first being that each place you want to search might not have its own search engine. Second, Google might have syntaxes—such as title search, URL search, and full-word wildcard search—that the individual sites aren’t providing. Google returns its search results in an array that’s easy to manipulate. You don’t have to use regular expressions or parsing modules to get what you want. And, of course, you’ll also have all your results in one nice, regular format, independent of site-specific idiosyncrasies.

Example: Top 20 Searching on Google

Say you’re a publisher, like O’Reilly, that is interested in finding out which universities are using your books as textbooks. You could do the search at Google itself, experimenting with keywords and limiting your search to the top-level domain edu (like syllabus o'reilly site:edu, or syllabus perl "required reading" site:edu), and you’d have some success. But you’d get far more than the maximum number of results (Google returns only 1,000 matches for a given query) and you’d also get a lot of false positives—pages that include mentions about a book but don’t provide specific course information, or maybe weblogs discussing a class, or even old news stories! It’s difficult to get a list of just class results with keyword searching alone.

So, there are two overall problems to be solved: narrowing your search to edu leaves your pool of potential results too broad, and it’s extremely difficult to find just the right keywords for restricting to university course pages.

This hack tries to solve those problems. First, it uses the top 20 computer science grad schools (as ranked by U.S. News & World Report) as its site searches and puts those sites into an array. Then, it goes through the array and searches for pages from those schools five at a time using the site: syntax. Each query also searches for O'Reilly * Associates (to match both O'Reilly & Associates and O'Reilly and Associates) and the word syllabus.

The last tweak goes beyond keyword searching and makes use of Perl’s regular expressions. As each search result is returned, both the title and the URL are checked for the presence of a three-digit string. A three-digit string? Yup, a course number! This quick regular expression eliminates a lot of the false positives you’d get from a regular Google search. It is not something you can do through Google’s interface.

Search results that make it over all these hurdles are saved to a file.

The Code

This hack makes use of the SOAP-based Google Web Services API. You’ll need your own Google search key (http://api.google.com) and a copy of the SOAP::Lite (http://www.soaplite.com) Perl module installed.

Save the following code to a file called textbook.pl:

#!/usr/bin/perl -w
# textbooks.pl
# Generates a list of O'Reilly books used
# as textbooks in the top 20 universities.
# Usage: perl textbooks.pl

use strict;
use SOAP::Lite;

# all the Google information
my $google_key  = "your google key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");

my @toptwenty = ("site:cmu.edu", "site:mit.edu", "site:stanford.edu",
       "site:berkeley.edu", "site:uiuc.edu","site:cornell.edu",
       "site:utexas.edu", "site:washington.edu", "site:caltech.edu",
       "site:princeton.edu", "site:wisc.edu", "site:gatech.edu",
       "site:umd.edu", "site:brown.edu", "site:ucla.edu",
       "site:umich.edu", "site:rice.edu", "site:upenn.edu",
       "site:unc.edu", "site:columbia.edu");

my $twentycount = 0;
open (OUT,'>top20.txt')
 or die "Couldn't open: $!";

while ($twentycount < 20) {

   # our five universities
   my $arrayquery =
      "( $toptwenty[$twentycount] | $toptwenty[$twentycount+1] ".
      "| $toptwenty[$twentycount+2] | $toptwenty[$twentycount+3] ".
      "| $toptwenty[$twentycount+4] )";

   # our search term.
   my $googlequery = "\"o'reilly * associates\" syllabus $arrayquery"; 
   print "Searching for $googlequery\n"; 

   # and do it, up to a maximum of 50 results.
   my $counter = 0; while ($counter < 50) {
       my $result = $gsrch->doGoogleSearch($google_key, $googlequery,
                            $counter, 10, "false", "",  "false",
                            "lang_en", "", "");
       # foreach result.
       foreach my $hit (@{$result->{'resultElements'}}){
           my $urlcheck = $hit->{'URL'};
           my $titlecheck = $hit->{'title'}; 
           my $snip = $hit->{'snippet'};

           # if the URL or title has a three-digit
           # number in it, we clean up the snippet
           # and print it out to our file.
           if ($urlcheck =~/http:.*?\/.*?\d{3}.*?/
                 or $titlecheck =~/\d{3}/) {
              $snip =~ s/<b>/ /g;
              $snip =~ s/<\/b>/ /g;
              $snip =~ s/&#39;/'/g;
              $snip =~ s/&quot;/"/g;
              $snip =~ s/&amp;/&/g;
              $snip =~ s/<br>/ /g;
              print OUT "$hit->{title}\n";
              print OUT "$hit->{URL}\n";
              print OUT "$snip\n\n";
           }
        }

        # go get 10 more
        # search results.
        $counter += 10;
   }

   # our next schools.
   $twentycount += 5; 
}

Running the Hack

Running the hack requires no switches or variables:

% perl textbooks.pl

The output file, top20.txt, looks something like this:

Programming Languages and Compilers CS 164 - Spring 2002 
http://www-inst.eecs.berkeley.edu/~cs164/home.html 
... Tentative  Syllabus  & Schedule of Assignments.  ... you might find 
useful is "Unix in  a Nutshell (System V Edition)" by Gilly, published by  O 
' Reilly   & ...

CS378 (Spring 03): Linux Kernel Programming 
http://www.cs.utexas.edu/users/ygz/378-03S/course.html 
 ...  Guide, 2nd Edition By Olaf Kirch & Terry Dawson  O ' Reilly &   
Associates, ISBN 1-56592  ...  Please  visit Spring 02 homepage for 
information on  syllabus, projects, and  ...    
 
LIS 530: Organizing Information Using the Internet 
http://courses.washington.edu/lis541/syllabus-intro.html 
Efthimis N. Efthimiadis' Site LIS-541  Syllabus  Main Page Syllabus  - Aims  
& Objectives.  ...  Jennifer Niederst.  O'Reilly   and   Associates , 1999.
 
LIS415B * Spring98 * Class Schedule 
http://alexia.lis.uiuc.edu/course/spring1998/415B/lis415.spring98.schedule.
html 
LIS415 (section B): Class Schedule. Spring 98.  Syllabus ...  In Connecting 
to the Internet:  A buyer's guide. Sebastapol, California:  O ' Reilly &   
Associates .
 
Implementation of Information Storage and Retrieval 
http://alexia.lis.uiuc.edu/~dubin/429/429.pdf 
...  In addition to this  syllabus , this course is governed by the rules 
and  ... Advanced  Perl Programming , first edition ( O'Reilly   and   
Associates , Inc.,

INET 200: HTML, Dynamic HTML, and Scripting 
http://www.outreach.washington.edu/dl/courses/inet200/ 
...  such as HTML & XHTML: the Definitive Guide, 4 th edition, O'Reilly   
and  Associates   (which I  ... are assigned, and there is one on the course
syllabus  as Appendix B  ...

Hacking the Hack

There are plenty of things to change in this hack. Since it uses a very specific array (that is, the top 20 computer science grad schools), tweaking the array to your needs should be the first place you start. You can make that array anything you want: different kinds of schools, your favorite or local schools, and so on. You can even break out schools by athletic conference and check them that way. In addition, you can change the keywords to something more befitting your tastes. Maybe you don’t want to search for textbooks, but you’d rather find everything from chemistry labs to vegetarian groups. Change your keywords appropriately (which will probably require a little experimenting in Google before you get them just right) and go to town.

And don’t forget, you’re also running a regular expression check on each keyword before you save it to a file. Maybe you don’t want to do a three-digit check on the title and URL. Maybe you want to check for the string lib, either by itself or as part of the word library:

($urlcheck =~/http:.*?\/.*?lib.*?/) or ($titlecheck =~/.*?lib.*?/)

This particular search will find materials in a school library’s web pages, for the most part, or in web pages that mention the word “library” in the title.

If you’ve read Google Hacks (http://www.oreilly.com/catalog/googlehks/), you might remember that Google offers wildcards for full-word searches, but not for stemming. In other words, you can search for three * mice and get three blind mice, three blue mice, three green mice, and so on. But you can’t plug the query moon* into Google and get moons, moonlight, moonglow, and so on. When you use Perl to perform these checks, you are expanding the kind of searching possible with Google.

Hack #52. Scraping Amazon.com Product Reviews

While Amazon.com has made some reviews available through their Web Services API, most are available only at the Amazon.com web site, requiring a little screen scraping to grab them.

If you’ve written a book called Spidering Hacks and you’re interested to hear what people are saying about it, you could run off to Amazon.com each and every day to check out the reviews. Well, you certainly could, but you wouldn’t, else you’d deserve every bad comment that came your way. Here’s a way to integrate Amazon.com reviews with your web site. Unlike linking or monitoringreviews for changes, this puts the entire text of Amazon.com reviews into your own pages.

The easiest and most reliable way to access customer reviews programmatically is through Amazon.com’s Web Services API. Unfortunately, the API gives only a small window to the larger number of reviews available. An API query for the book Cluetrain Manifesto, for example, includes only three user reviews. If you visit the reviewpage for that book, though, you’ll find 128 reviews. To dig deeper into the reviews available on Amazon.com and use all of them on your own web site, you’ll need to spelunk a bit further into scripting.

The Code

This Perl script builds a URL to the review page for a given ASIN, uses regular expressions to find the reviews, and breaks the review into its pieces: rating, title, date, reviewer, and the text of the review.

Save the following script to a file called get_reviews.pl:

#!/usr/bin/perl -w
# get_reviews.pl
#
# A script to scrape Amazon, retrieve
# reviews, and write to a file.
# Usage: perl get_reviews.pl <asin>
use strict;
use LWP::Simple;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

# Loop through the HTML, looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)\n.
*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {

    my($rating,$title,$date,$reviewer,$review) = 
                      ($1||'',$2||'',$3||'',$4||'',$5||'');
    $reviewer =~ s!<.+?>!!g;   # drop all HTML tags
    $reviewer =~ s!\(.+?\)!!g; # remove anything in parenthesis
    $reviewer =~ s!\n!!g;      # remove newlines
    $review =~ s!<.+?>!!g;     # drop all HTML tags
    $review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

    # Print the results
    print "$title\n" . "$date\n" . "by $reviewer\n" .
          "$rating stars.\n\n" . "$review\n\n";

}

Running the Hack

This script can be run from a command line, and it requires an ASIN—an Amazon.com unique ID that can be found in the Product Details of each and every product, listed as either “ISBN” or “ASIN”, as shown in Figure 4-3.

Amazon.com’s unique ID, listed as an ASIN or ISBN
Figure 4-3. Amazon.com’s unique ID, listed as an ASIN or ISBN

The reviews are too long to read as they scroll past your screen, so it helps to send the information to a text file (in this case, reviews.txt), like so:

% perl get_reviews.pl 
               
                  asin
                
               > reviews.txt

See Also

—Paul Bausch

Hack #53. Receive an Email Alert for Newly Added Amazon.com Reviews

This hack keeps an eye on Amazon.com and notifies you, via email, when a new product review is posted to items you’re tracking.

There are obviously some products you care about more than others, and it’s good to be aware of how those products are perceived. Reviews give feedback to publishers, authors, and manufacturers; help customers make buying decisions; and help other retailers decide what to stock. If you want to monitor all the reviews for a product or set of products, visiting each Product Details page to see if a new review has been added is a tedious task.

Instead, you can use a script to periodically check the number of reviews for a given item, and have it send you an email when a new review is added.

The Code

This script requires you to have the XML::Simple Perl module installed, a Developer’s Token (http://www.amazon.com/gp/aws/landing.html), and a product’s unique ASIN (included in the details of every Amazon.com product).

Save the following script to a file called review_monitor.pl:

#!/usr/bin/perl-w
# review_monitor.pl
#
# Monitors products, sending email when a new review is added.
# Usage: perl review_monitor.pl <asin>
use strict;
use LWP::Simple;
use XML::Simple;

# Your Amazon developer's token.
my $dev_token='insert developer token';

# Your Amazon affiliate code. Optional.
# See http://associates.amazon.com/.
my $af_code='insert affiliate tag';

# Location of sendmail and your email.
my $sendmailpath = "insert sendmail location";
my $emailAddress = "insert your email address";

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl review_monitor.pl <asin>\n";

# Get the number of reviews the last time this script ran.
open (ReviewCountDB, "<reviewCount_$asin.db");
my $lastReviewCount = <ReviewCountDB> || 0;
close(ReviewCountDB); # errors?! bah!

# Assemble the query URL (RESTian).
my $url = "http://xml.amazon.com/onca/xml2?t=$af_code" . 
          "&dev-t=$dev_token&type=heavy&f=xml" .
          "&AsinSearch=$asin";

# Grab the content...
my $content = get($url);
die "Could not retrieve $url" unless $content;

# And parse it with XML::Simple.
my $response = XMLin($content);

# Send email if a review has been added.
my $currentReviewCount = $response->{Details}->{Reviews}->[RETURN]
{TotalCustomerReviews};
my $productName        = $response->{Details}->{ProductName};
if ($currentReviewCount > $lastReviewCount) {
    open (MAIL, "|$sendmailpath -t") || die "Can't open mail program!\n";
    print MAIL "To: $emailAddress\n";
    print MAIL "From: Amazon Review Monitor\n";
    print MAIL "Subject: A Review Has Been Added!\n\n";
    print MAIL "Review count for $productName is $currentReviewCount.\n";
    close (MAIL);

    # Write the current review count to a file.
    open(ReviewCountDB, ">reviewCount_$asin.db");
    print ReviewCountDB $currentReviewCount;
    close(ReviewCountDB);
}

This code performs a standard Web Services ASIN query, looking for one bit of data: the total number of customer reviews (TotalCustomerReviews). The script saves the number of reviews in a text file (ASIN .db) and, if the number is different than the last time the script was run, sends an email to let you know.

In your local path to sendmail, be sure to include a program that sends email from the server. Most ISPs have sendmail installed in some form or another (often at /usr/bin/sendmail). Check with your local administrator or Internet Service Provider (ISP) if you’re not sure where it’s located.

Running the Hack

Run the script from the command line, passing it an ASIN (to find an ASIN, see Figure 4-3 in [Hack #52] for guidance):

% perl review_monitor.pl 
               
                  ASIN
               

Ideally, you want to run this script once every so often in the background, instead of manually executing this query every day. On Linux, you can set it to run as a cron job [Hack #90], like so:

0 12 * * 1-5 perl review_monitor.pl ASIN

This schedules the script to run Monday through Friday at noon on each day. Be sure to replace ASIN with a real ASIN, and add jobs as necessary for all the items you want to monitor.

On Windows, you can run the script as a Scheduled Task. From the Control Panel, choose Scheduled Tasks and then Add Scheduled Task. Follow the wizard to set your execution time, and you should be all set for review notifications!

See Also

—Paul Bausch

Hack #54. Scraping Amazon.com Customer Advice

Screen scraping can give you access to Amazon.com community features not yet implemented through Amazon.com’s public Web Services API. In this hack, we’ll implement a script to scrape customer buying advice.

Customer buying advice isn’t available through Amazon.com’s Web Services API, so if you’d like to include this information on a remote site, you’ll have to get it from Amazon.com’s site through scraping. The first step to this hack is knowing where to find all the customer advice on one page. The following URL links directly to the advice page for a given ASIN (the unique ID Amazon.com displays for each product [Hack #52]):

http://amazon.com/o/tg/detail/-/insert ASIN/?vi=advice

For example, here is the advice page for Mac OS X Hacks:

http://amazon.com/o/tg/detail/-/0596004605/?vi=advice

The Code

This Perl script splits the advice page into two variables, based on the headings “in addition to” and “instead of.” It then loops through those sections, using regular expressions to match the products’ information. The script then formats and prints the information.

Save the following script to a file called get_advice.pl:

#!/usr/bin/perl -w
# get_advice.pl
#
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>
use strict; use LWP::Simple;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_advice.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=advice";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

# Get our matching data.
my ($inAddition) = (join '', $content) [RETURN]
    =~ m!in addition to(.*?)(instead of)?</td></tr>!mis;
my ($instead)    = (join '', $content) [RETURN]
    =~ m!recommendations instead of(.*?)</table>!mis;

# Look for "in addition to" advice.
if ($inAddition) { print "-- In Addition To --\n\n";
   while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/[RETURN]
(.*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
       my ($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
       $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
       print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
   }
}

# Look for "instead of" advice.
if ($instead) { print "-- Instead Of --\n\n";
    while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.[RETURN]
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
        my ($place,$thisAsin,$title,$number) [RETURN]
          = ($1||'',$2||'',$3||'',$4||'');
        $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
        print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
    }
}

Running the Hack

You can run this script from the command line, passing in any ASIN. Here is the one for Mac OS X Hacks:

% perl get_advice.pl 0596004605
-- In Addition To --

1. Mac OS X: The Missing Manual, Second Edition (0596004508)
(Recommendations: 1)

2. Mac Upgrade and Repair Bible, Third Edition (0764525948)
(Recommendations: 1)

If the book has long lists of alternate products, send the output to a text file. This example sends all alternate product recommendations for Google Hacks to a file called advice.txt:

% perl get_advice.pl 0596004478 > advice.txt

See Also

—Paul Bausch

Hack #55. Publishing Amazon.com Associates Statistics

Share some insider knowledge, such as the most popular item sold, with your site’s audience by republishing your Amazon.com Associates sales statistics.

Your web site has a unique audience, and looking at what they purchase through your Amazon.com Associate links can tell you more about them. It can provide insights into other items you might want to sell on your site, and it can help show what’s foremost on your visitors’ minds (for better or worse). Just as Amazon.com shares its aggregated sales information in the form of purchase circles, you can create your own purchase circle list by publishing your Associates sales information.

Your readers are probably just as curious about sales trends through your site as you are. Publishing the list can build a sense of community and, don’t forget, drive more sales through Associate links.

You could save the HTML reports available through your Associates account (http://associates.amazon.com) through your browser, but it would be much easier to automate the process and integrate it into your site design with a few lines of Perl.

The Code

To run this code, you’ll need to set the email address and password you use to log into your Associates account. This script will then do the logging in for you, and download the appropriate sales report. Once the script has the report, it will reformat it as HTML.

Because this script logs into Amazon.com, it requires the use of a cookie to remind Amazon.com that you’re an authenticated user. Since this is a one-time-only request, we use an in-memory cookie (which is forgotten when the script is finished).

The code listed here intentionally logs you in under an unsecured HTTP connection, to better ensure that the script is portable across systems that don’t have the relevant SSL libraries installed. If you know you have them working properly, be sure to change http:// to https:// to gain some added protection for your login information.

Save the following script to a file called get_earnings_report.pl:

#!/usr/bin/perl -w
# get_earnings_report.pl
#
# Logs into Amazon, downloads earning report,
# and writes an HTML version for your site.
# Usage: perl get_earnings_report.pl
use strict;
use URI::Escape;
use HTTP::Cookies;
use LWP::UserAgent;

# Set your Associates account info.
my $email = 'insert email address';
my $pass = 'insert password';
my $aftag = 'insert associates tag';

# Create a user agent object
# and fake the agent string.
my $ua = LWP::UserAgent->new;
$ua->agent("(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)");
$ua->cookie_jar({}); # in-memory cookie jar.

# Request earning reports, logging in as one pass.
my $rpturl  = "http://associates.amazon.com/exec/panama/login/".
              "attempt/customer/associates/no-customer-id/25/".
              "associates/resources/reporting/earnings/";
my $rptreq  = HTTP::Request->new(POST => $rpturl);
my $rptdata = "report-type=shipments-by-item".   # get individual items
              "&date-selection=qtd".             # all earnings this quarter
              "&login_id=".uri_escape($email).   # our email address.
              "&login_password=".uri_escape($pass).  # and password.
              "&submit.download=Download my report". # get downloadble.
              "&enable-login-post=true"; # log in and post at once.
$rptreq->content_type('application/x-www-form-urlencoded');
$rptreq->content($rptdata); my $report = $ua->request($rptreq);
 
# Uncomment the following line to see
# the report if you need to debug.
# print $report->content;

# Set the report to array.
my @lines = split(/\n/, $report->content);
 
# Get the time period.
my @fromdate = split(/\t/, $lines[1]);
my @todate = split(/\t/, $lines[2]);
my $from = $fromdate[1];
my $to = $todate[1];
 
# Print header...
print "<html><body>";
print "<h2>Items Purchased Through This Site</h2>";
print "from $from to $to <br><br>\n";
print "<ul>";
 
# Loop through the rest of the report.
splice(@lines,0,5);
foreach my $line (@lines) {
    my @fields  = split(/\t/, $line);
    my $title   = $fields[1];
    my $asin    = $fields[2];
    my $edition = $fields[4];
    my $items   = $fields[8];

    # Format items as HTML for display.
    print "<li><a href=\"http://www.amazon.com/o/ASIN/$asin/ref=nosim/".
          "$aftag\">$title</a> ($items) $edition <br>\n";
}
print "</ul></body></html>";

Running the Hack

Run the hack from a command line:

% perl get_earnings_report.pl

It prints out the formatted HTML results, so you might want to pipe its output to another file, like this:

% perl get_earnings_report.pl > amazon_report.html

You could also set this to run on a regular schedule [Hack #90] so your community’s buying habits stay up-to-date.

See Also

—Paul Bausch

Hack #56. Sorting Amazon.com Recommendations by Rating

Find the highest-rated items among your Amazon.com productrecommendations.

If you’ve taken the time to fine-tune your Amazon.com recommendations, you know how precise they can be. If you’ve also looked at the star rating for some of your favorite products, then you know that the rating can be a good indication of quality. The Amazon.com recommendation and the customer rating both add important information to a product, and they can help you make a decision about whether or not to buy one item over another.

To get a feel for the products Amazon.com recommends for you, you can visit your book recommendations at any time at the following URL:

http://www.amazon.com/o/tg/stores/recs/instant-recs/-/books/0/

In addition to books, you can also find recommendations in other product categories. You can replace books in the URL with any of Amazon.com’s catalogs, including music, electronics, dvd, and photo.

When you browse to your recommendations, you’ll likely find several pages of items. Wouldn’t it be great if you could add the customer review dimension by sorting the entire list by its average star rating? This hack does exactly that with a bit of screen scraping.

The Code

Because Amazon.com doesn’t offer sorting by customer rating, this script first gathers all of your Amazon.com book recommendations into one list. By providing your Amazon.com account’s email address and password, the script logs in as you and then requests the book recommendations page. It continues to request pages in a loop, picking out the details of your product recommendations with regular expressions. Once all the products and details are stored in an array, they can be sorted by star rating and printed out in any order you want—in this case, the average star rating.

Be sure to replace your email address and password in the proper places in the following code. You’ll also need to have write permission in the script’s directory so you can store Amazon.com cookies in a text file, cookies.lwp.

The code listed here intentionally logs you in under an unsecured HTTP connection, to better ensure that the script is portable across systems that don’t have the relevant SSL libraries installed. If you know you have them working properly, be sure to change http:// to https:// to gain some added protection for your login information.

Save the following script to a file called get_recommendations.pl:

#!/usr/bin/perl  -w
# get_recommendations.pl
#
# A script to log on to Amazon, retrieve
# recommendations, and sort by highest rating.
# Usage: perl get_recommendations.pl

use strict;
use HTTP::Cookies;
use LWP::UserAgent;

# Amazon email and password.
my $email = 'insert email address';
my $password = 'insert password';

# Amazon login URL for normal users.
my $logurl = "http://www.amazon.com/exec/obidos/flex-sign-in-done/";

# Now log into Amazon.
my $ua = LWP::UserAgent->new;
$ua->agent("(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)");
$ua->cookie_jar( HTTP::Cookies->new('file' => 'cookies.lwp','autosave' => 1));
my %headers = ( 'content-type' => "application/x-www-form-urlencoded" );
$ua->post($logurl, 
  [ email       => $email,
    password    => $password,
    method      => 'get', opt => 'oa',
    page        => 'recs/instant-recs-sign-in-standard.html',
    response    => "tg/recs/recs-post-login-dispatch/-/recs/pd_rw_gw_r",
    'next-page' => 'recs/instant-recs-register-standard.html',
    action      => 'sign-in checked' ], %headers);

# Set some variables to hold
# our sorted recommendations.
my (%title_list, %author_list);
my (@asins, @ratings, $done);

# We're logged in, so request the recommendations.
my $recurl = "http://www.amazon.com/exec/obidos/tg/". 
             "stores/recs/instant-recs/-/books/0/t";

# Set all Amazon recommendations in
# an array/title and author in hashes.
until ($done) {

     # Send the request for the recommendations.
     my $content = $ua->get($recurl)->content;

     # Loop through the HTML, looking for matches.
     while ($content =~ m!<td colspan=2 width=100%>.*?detail/-/(.*?)/ref.[RETURN]
*?<b>(.*?)</b>.*?by (.*?)\n.*?Average Customer Review&#58;.*?(.*?)out of 5 [RETURN]
stars.*?<td colspan=3><hr noshade size=1></td>!mgis) {
         my ($asin,$title,$author,$rating) = ($1||'',$2||'',$3||'',$4||'');
         $title  =~ s!<.+?>!!g; # drop all HTML tags, cheaply.
         $rating =~ s!\n!!g;    # remove newlines from the rating.
         $rating =~ s! !!g;     # remove spaces from the rating.
         $title_list{$asin} = $title;    # store the title.
         $author_list{$asin} = $author;  # and the author.
         push (@asins, $asin);           # and the ASINs.
         push (@ratings, $rating);       # and the ... OK!
     }

     # See if there are more results. If so, continue the loop.
     if ($content =~ m!<a href=(.*?instant-recs.*?)>more results.*?</a>!i) {
        $recurl = "http://www.amazon.com$1"; # reassign the URL.
     } else { $done = 1; } # nope, we're done.
}

# Sort the results by highest star rating and print!
for (sort { $ratings[$b] <=> $ratings[$a] } 0..$#ratings) {
    next unless $asins[$_]; # skip el blancos.
    print "$title_list{$asins[$_]}  ($asins[$_])\n" . 
          "by $author_list{$asins[$_]} \n" .
          "$ratings[$_] stars.\n\n";
}

Running the Hack

Run the hack from the command line and send the results to another file, like this:

% perl get_recommendations.pl > top_rated_recommendations.txt

The text file top_rated_recommendations.txt should be filled with product recommendations, with the highest-rated items on top. You can tweak the URL in $recurl to look for DVDs, CDs, or other product types, by changing the books URL to the product line you’re interested in.

See Also

—Paul Bausch

Hack #57. Related Amazon.com Products with Alexa

Given any URL, Alexa will return traffic data, user ratings, and even related Amazon.com products. This hack creates a cloud of related product data for any given URL.

Alexa (http://www.alexa.com), an Amazon.com property, measures a web site’s traffic, then rates it for popularity based on other sites with similar topics. Along with these Related Links, you also have the capability to read, add, and write reviews, as well as find similar products at Amazon.com. Some interesting scripts can be created, simply by following through with the various information Alexa provides via its XML exports. For example, we can create a list of products recommended not only for a given web site, but also for web sites that are related to the original. Following those related web sites and obtaining their related Amazon.com products creates a cloud of items related to the original URL. In the following section, we’ll walk you through the code for one such cloud creator.

The Code

For this script, you’ll need an Amazon.com developer token, which can be obtained from http://www.amazon.com/webservices/. Save the following code to a file called alexa.pl:

#!/usr/bin/perl -w
use strict;
use URI;
use LWP::Simple;
use Net::Amazon;
use XML::Simple;
use constant AMAZON_TOKEN => 'your token here';
use constant DEBUG => 0;

# get our arguments. the first argument is the
# URL to fetch, and the second is the output.
my $url = shift || die "$0 <url> [<output>]\n";
my $output = shift || '/www/htdocs/cloud.html';

# we'll need to fetch the Alexa XML at some point, and
# we'll do it a few different times, so we create a 
# subroutine for it. Using the URI module, we can
# correctly encode a URL with a query. In fact, you'll
# notice the majority of this function is involved with
# this, and at the end we use LWP::Simple to actually
# download and return the XML.
#####################################################
sub fetch_xml {
    my $url = shift;
    $url = "http://$url" unless $url =~ m[^http://];
    warn "Fetching Alexa data for $url\n" if DEBUG;

    my @args = (
        cli => 10,     dat => 'snba',
        ver => '7.0',  url => $url,
    );

    my $base = 'http://data.alexa.com/data';
    my $uri = URI->new( $base );
    $uri->query_form( @args );
    $uri = $uri->as_string;

    return get( $uri );
}

# raw XML is no good for us, though, as we want to extract
# particular items of interest. we use XML::Simple to turn
# the XML into Perl data structures, because it's easier
# than fiddling with event handling (as with XML::Parser
# or XML::SAX), and we know there's only a small amount of
# data. we want the list of related sites and the list of
# related products. we extract and return both.
#####################################################
sub handle_xml {
    my $page = shift;
    my $xml = XMLin( $page );
    my @related = map {
        {
            asin => $_->{ASIN},
            title => $_->{TITLE},
            href => $xml->{RLS}{PREFIX}.$_->{HREF},
        }
    } @{ $xml->{RLS}{RL} };

    my @products;
    if (ref $xml->{SD}{AMZN}{PRODUCT} eq 'ARRAY') {
        @products = map { $_->{ASIN} } @{ $xml->{SD}{AMZN}{PRODUCT} };
    } else { @products = $xml->{SD}{AMZN}{PRODUCT}{ASIN}; }

    return ( \@related, \@products );
}

# Functions done; now for the program:
warn "Start URL is $url\n" if DEBUG;
my @products; # running accumulation of product ASINs

{
    my $page = fetch_xml( $url );
    my ($related, $new_products) = handle_xml( $page );
    @products = @$new_products; # running list

    for (@$related) {
        my $xml = fetch_xml( $_->{href} );
        my ($related, $new_products) = handle_xml( $page );
        push @products, @$new_products;
    }
}

# We now have a list of products in @products, so
# we'd best do something with them. Let's look
# them up on Amazon and see what their titles are.
my $amazon = Net::Amazon->new( token => AMAZON_TOKEN );
my %products = map { $_ => undef } @products;

for my $asin ( sort keys %products ) {
    warn "Searching for $asin...\n" if DEBUG;
    my $response = $amazon->search( asin => $asin );
    my @products = $response->properties;
    die "ASIN is not unique!?" unless @products == 1;
    my $product = $products[0];
    $products{$asin} = {
        name => $product->ProductName,
        price => $product->OurPrice,
        asin => $asin,
    };
}

# Right. We now have name, price, and
# ASIN. Let's output an HTML report:
{
    umask 022;
    warn "Writing to $output\n" if DEBUG;
    open my $fh, '>', $output or die $!;
    print $fh "<html><head><title>Cloud around $url</title></head><body>";
    if (keys %products) {
        print $fh "<table>";
        for my $asin (sort keys %products) {
            my $data = $products{$asin};
            printf $fh "<tr><td>".
                       "<a href=\"http://amazon.com/exec/obidos/ASIN/%s\">".
                       "%s</a></td> <td>%s</td></tr>",
                       @{$data}{qw( asin name price )};
        }
        print $fh "</table>";
    }
    else { print $fh "No related products found.\n"; }
    print $fh "</body></html>\n";
}

Running the Hack

Run the script on the command line, passing it the URL you’re interested in and a filename to which you’d like the results saved (you can also hardcode a default output location into the script). The following output shows an example of the script’s DEBUG output turned on:

% perl alexa.pl http://www.gamegrene.com/ testing.html
Start URL is http://www.gamegrene.com/
Fetching Alexa data for http://www.gamegrene.com/
Fetching Alexa data for http://www.elvesontricycles.com/
Fetching Alexa data for http://www.chimeramag.com/
Fetching Alexa data for http://pages.infinit.net/raymondl
Fetching Alexa data for http://www.beyond-adventure.com/
Fetching Alexa data for http://strcat.com/News
Fetching Alexa data for http://members.aol.com/stocdred
Fetching Alexa data for http://lost-souls.hk.st/
Fetching Alexa data for http://www.gamerspulse.com/
Fetching Alexa data for http://www.gignews.com/
Fetching Alexa data for http://www.gamesfirst.com/
Searching for 0070120102...
Searching for 0070213631...
Searching for 0070464081...
Searching for 0070465886...
..etc..
Searching for 1879239027...
Writing to testing.html

Figure 4-4 shows an example of the resulting file.

Amazon.com’s related products for Gamegrene.com
Figure 4-4. Amazon.com’s related products for Gamegrene.com

Hacking the Hack

As the script stands, it requires manual running or a cron script [Hack #90] to regularly place the latest information on your own pages (if that’s your intent, of course). You might want to turn this into a CGI program and let people enter web sites of their own choice. This is pretty easy to do. If you’ve created an HTML form that accepts the desired web site in an input named url, like this:

<form method="GET" action="alexa.pl">
URL: <input type="text" name="url" />
</form>

then modifying your script to accept this value means changing this:

# get our arguments. the first argument is the
# URL to fetch, and the second is the output.
my $url = shift || die "$0 <url> [<output>]\n";
my $output = shift || '/www/htdocs/cloud.html';

to this:

use LWP::Simple qw(!head);
use CGI qw/:standard/;
my $url = param('url');

and changing the output from a filename from this:

warn "Writing to $output\n" if DEBUG;
open my $fh, '>', $output or die $!;

to the waiting web browser:

my $fh = *STDOUT; # redirect.
print $fh "Content-type: text/html\n\n";

Be sure to remove the extraneous use LWP::Simple; line at the beginning of the script. Since both CGI and LWP::Simple have a function named head, you’ll get a number of warning messages about redefinitions, unless you change the way LWP::Simple has been imported. By telling it not to import its own unnecessary head function, our new code circumvents these warnings.

—Iain Truskett

Hack #58. Scraping Alexa’s Competitive Data with Java

Alexa tracks the browsing habits of its millions of users daily. This hack allows you to aggregate the traffic statistics of multiple web properties into one RSS file, with subscriptions available daily.

Alexa (http://www.alexa.com) recently launched a section of its web site, detailing the observed traffic of its millions of users on a daily basis. Using this freely available data, you can track the traffic of your site, or your competitors’ sites, over time. We’ll scrape this traffic data into an RSS file [Hack #94] for your consumption.

The Code

The hack consists of five Java classes, each designed to handle different aspects of downloading, parsing, and presenting Alexa’s traffic content. The full code can be downloaded from this book’s web site (http://www.oreilly.com/catalog/spiderhks/).

The primary class of our Java application (Report) allows you to pass a URL to Alexa’s web site for every domain you’re interested in tracking. The appropriate Alexa page is downloaded, and its content is parsed for the key bits of daily data. Once this data is organized, we will need to mark it up for presentation and, finally, write the presentable file to disk.

The first step (Website) streams the source into your computer’s memory. We eliminate everything but the body of the page, since this is where all of our data lies.

Now that we have the page’s source stored in memory, we need to identify the key data components within the myriad lines of HTML. Alexa does not conform to strict XML, and string parse (Parse) is our best and quickest route of attack.

We will navigate through the page’s source code in serial, pulling the data we need and leaving a marker on our trail to speed up our search. Key phrases of text need to be identified in close vicinity to our key data so that we can consistently pull the correct data, regardless of the size of a web property.

Now that we have all our data, we need somewhere to store our findings for use across multiple classes. We create an entity bean-style data object to store each of the key pieces of data. Our code for doing so is in TrafficBean.

Finally, we present our findings to whomever might be interested through an RSS file (RSSWriter). By default, the RSS file is saved to the current user’s home directory (C:\Documents And Settings\ $user on Microsoft Windows platforms, or /home/ $user on most versions of Unix). It is assumed that you have sufficient write permissions within your home directory to perform this action.

Running the Hack

The only external library required is Apache’s Xerces for Java (http://xml.apache.org/xerces2-j/). Web property names should be hardcoded in the Report class to allow for consistent scheduled runs. You can pass domain strings in the format of site.tld at runtime or, if no parameters are found, the code will iterate through a previously created string array. You might also want to set yourself up with an RSS aggregator if you do not already have one. I use FeedDemon (http://www.bradsoft.com/feeddemon/index.asp) for Windows.

Hacking the Hack

Possibilities abound:

  • Set up a cron script [Hack #91] on your machine to generate a new report every evening.

  • Using the percentage numbers from the returned subdomains, calculate the total reach and views for each of the domains within the web property.

  • Hook your findings into a database for larger comparison sets over time.

  • Using the RSS file as your data source, create a time series graph [Hack #62]. Use views or ranges as your y-axis and time as your x-axis. Overlay all of your sites using different colors and save for use in reports.

—Niall Kennedy

Hack #59. Finding Album Information with FreeDB and Amazon.com

By combining identifying information from one database with related information from another, you can create powerful applications with little effort.

Although using an MP3 collection to turn your computer into a jukebox might be all the rage these days, some of us are still listening to audio CDs. And, thanks to the FreeDB project (http://www.freedb.org) and the original CDDB before it, we can identify CDs based on their contents and look up information such as artist and the names of tracks. Once we have that information, we can try looking up more from other sources.

With the help of the Amazon.com API (http://www.amazon.com/webservices/), we can find things like cover art, other albums by the same artist, and release dates of albums. If we put this all together, we can come up with a pretty decent Now Playing display for what we’re listening to.

Getting Started

So, this is what we want our script to do:

  • Calculate a disc ID for the current CD.

  • Perform a search on FreeDB for details on the CD.

  • Use the FreeDB record data to perform a search at Amazon.com.

  • Get information from the Amazon.com results for the current album.

  • Collect details on other albums from the same artist.

  • Construct an HTML page to display all the results.

To get this hack started, let’s sketch out the overall flow of our script:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;

# Settings for our Amazon developer account
our $amazon_affilate_id = "your affiliate ID, if any";
our $amazon_api_key     = "your amazon api key";

# Location of a FreeDB mirror web interface
our $freedb_url  = 'http://freedb.freedb.org/~cddb/cddb.cgi';

# Get the discid of the current CD
my $discid = get_discid(  );

# Search for the CD details on FreeDB
my $cd_info = freedb_search($discid);

# Given the artist, look for music on Amazon
my @amazon_rec = amazon_music_search($cd_info->{artist});

# Try to match the FreeDB title up
# with Amazon to find current playing.
my $curr_rec = undef;
my @other_recs = (  );
for my $rec (@amazon_rec) {
  if ( !defined $curr_rec && $cd_info->{title} eq $rec->{title} ) {
    $curr_rec = $rec;
  } else {
    push @other_recs, $rec;
  }
}

print html_template({current=>$curr_rec, others=>\@other_recs});

Note that we’ve set up a few overall configuration variables, such as our Amazon.com affiliate ID and a key for use with calls to the API. You’ll want to check out the documentation for Amazon.com Web Services and sign up for a developer token. This allows Amazon.com to identify one consumer of their services from another. Now we have the overall flow of the script, so let’s work out the implementation of the functions we’re calling.

Checking Your Disc ID

The first part of our hack is a little tricky, and it depends a lot on your operating system. To perform a search on FreeDB, we first need to identify the current CD, and that requires access to the CD device itself. This is fairly easy to do under Linux and Mac OS X; other environments will require more homework.

For Linux and Mac OS X, we can use a small program called cd-discid (http://lly.org/~rcw/cd-discid/). If you happen to be using Debian Linux, you can install the cd-discid package using apt-get. If you’re on Mac OS X and have Fink (http://fink.sourceforge.net) installed, use fink install cd-discid. If neither of these things apply to you, don’t worry, we can skip this step and use a hardcoded disc ID to see how the script works, at least.

Once the program is installed, we can use this function under Linux:

sub get_discid {
  # For Linux
  my $cd_discid = '/usr/local/bin/cd-discid';
  my $cd_dev    = '/dev/cdrom';
  return `$cd_discid $cd_dev`;
}

Basically, this calls the disc ID program using /dev/cdrom as the device containing the audio CD to be identified. You might need to adjust the path to both the program and the CD device in this function.

If you’re using Mac OS X, then this implementation should work for you:

sub get_discid {
  # For Mac OS X
  my $cd_discid = '/sw/bin/cd-discid';
  my ($cd_dev)  = '/dev/'.
    join '', map { /= "(.*?)"$/ }
      grep { /"BSD Name"/ }
        split(/\n/, `ioreg -w 0 -c IOCDMedia`);
  return `$cd_discid $cd_dev`;
}

This looks kind of tricky, but it uses a utility called ioreg, which lists I/O devices registered with the system. We check for devices in which CD media is currently inserted and do some filtering and scraping to discover the BSD Unix device name for the appropriate device. It’s dirty, but it works well.

However, if none of this works for you (either because you’re using a Windows machine, or else had installation problems with the source code), you can opt to use a canned disc ID in order to explore the rest of this hack:

sub get_discid {
  # If all else fails... use Weird Al's "Alapalooza"
  return "a60a840c+12 150 17795 37657 54225 72617 87907 106037 ".
    "125857 141985 164055 165660 185605 2694";
}

Digging Up the FreeDB Details

Once we have a disc ID, we can make a query against the FreeDB web service. From there, we should be able to get the name of the artist, as well as the album title and a list of track titles. Usage of the FreeDB web service is described at:

http://www.freedb.org/modules.php?name=Sections&sop=viewarticle&artid=28

under Addendum B, “CDDBP under HTTP.”

Let’s start implementing the FreeDB search by making a call to the web service:

sub freedb_search {
  my $discid = shift;

  # Get the discid for the current
  # CD and make a FreeDB query with it.
  $discid =~ s/ /\+/;
  my $disc_query = get("$freedb_url?cmd=cddb+query+$discid&".
                       "hello=joe_random+www.asdf.com+freebot+2.1&proto=1");
  my ($code, $cat, $id, @rest) = split(/ /, $disc_query);

The first thing we do is escape the spaces in the disc ID for use in the URL used to request a query on the FreeDB web service. Then, we request the URL. In response to the request, we get a status code, along with a category and record ID. We can use this category and record ID to look up the details for our audio CD:

  # Using the results of the discid query, look up the CD's details.
  # Create a hash from the name/value pairs in the detail response.
  # (Note that we clean up EOF characters in the data.)
  my %freedb_data =
    map { s/\r//; /(.*)=(.*)/ }
      split(/\n/,
            get("$freedb_url?cmd=cddb+read+$cat+$id&".
                "hello=deusx+www.decafbad.com+freebot+2.1&proto=1"));

The result of the FreeDB read request gives us a set of name/value pairs, one per line. So, we can split the result of the query by lines and use a regular expression on each to extract the name/value pairs and place them directly into a hash. However, as we receive it, the data is not quite as convenient to handle as it could be, so we can rearrange and restructure things before returning the results:

  # Rework the FreeDB result data into
  # a more easily handled structure.
  my %disc_info = ( );

  # Artist and title are separated by ' / ' in DTITLE.
  ($disc_info{artist}, $disc_info{title}) =
  split(/ \/ /, $freedb_data{DTITLE});

  # Extract series of tracks from
  # TTITLE0..TTITLEn; stop at
  # first empty title.
  my @tracks = (  );
  my $track_no = 0;
  while ($freedb_data{"TTITLE$track_no"}) {
    push @tracks, $freedb_data{"TTITLE$track_no"};
    $track_no++;
  }
  $disc_info{tracks} = \@tracks;

  return \%disc_info;
}

With this, we convert a flat set of cumbersome name/value pairs into a more flexible Perl data structure. Artist name and album title are accessible via artist and title keys in the structure, respectively, and track names are available as an array reference under the tracks key.

Rocking with Amazon.com

The next thing our script needs is the ability to search Amazon.com for products by a given artist. Luckily, Amazon.com’s Web Services produce clean XML, so it won’t be too hard to extract what we need from the data, even without using a full XML parser.

But first, we’ll need a couple of convenience functions added to our script:

sub trim_space {
  my $val = shift;
  $val=~s/^\s+//;
  $val=~s/\s+$//g;
  return $val;
}

sub clean_name {
  my $name = shift;
  $name=lc($name);
  $name=trim_space($name);
  $name=~s/[^a-z0-9 ]//g;
  $name=~s/ /_/g;
  return $name;
}

The first function trims whitespace from the ends of a string, and the second cleans up a string to ensure that it contains only lowercase alphanumeric characters and underscores. This last function is used to make fairly uniform hash keys in data structures.

Next, we can implement our Amazon.com Web Services (http://www.amazon.com/gp/aws/landing.html) searching code:

# Search for authors via the Amazon search API.
sub amazon_music_search {
  my ($artist) = @_;
  $artist =~ s/[^A-Za-z0-9 ]/ /;

  # Construct the base URL for Amazon author searches.
  my $base_url = "http://xml.amazon.com/onca/xml3?t=$amazon_affilate_id&".
    "dev-t=$amazon_api_key&mode=music&type=lite&f=xml".
      "&ArtistSearch=$artist";

The first thing we do is take the artist name as a parameter and try to clean up all characters that aren’t alphanumeric or spaces. Then, we construct the URL to query the web service, as described in the documentation from the Amazon.com software development kit.

Next, we start to get the results of our search. Queries on Amazon.com’s Web Services return results a handful at a time across several pages; so, if we want to gather all the results, we’ll first need to figure out how many total pages there are. Luckily, this is a part of every page of results, so we can grab the first page and extract this information with a simple regular expression:

  # Get the first page of search results.
  my $content = get($base_url."&page=1");

  # Find the total number of search results pages to be processed.
  $content =~ m{<totalpages>(.*?)</totalpages>}mgis;
  my ($totalpages) = ($1||'1');

After getting the total number of pages, we can start gathering the rest of the pages into an array, starting with the first page we have already downloaded. We can do this with a quick Perl expression that maps the page numbers to page requests, the results of which are added to the array. Notice that we also sleep for a second in between requests, as per the instructions in the Amazon.com Web Services license:

  # Grab all pages of search results.
  my @search_pages = ($content);
  if ($totalpages > 1) {
    push @search_pages,
      map { sleep(1); get($base_url."&page=$_") } (2..$totalpages);
  }

Now that we have all the pages of the results, we can process them all and extract data for each album found. Details for each item are, appropriately enough, found as children of a tag named details. We can extract these children from each occurrence of the details tag using a regular expression. We can also grab the URL to the item detail page from an attribute named url:

  # Extract data for all the records
  # found in the search results.
  my @records;
  for my $content (@search_pages) {

    # Grab the content of all <details> tags
    while ($content [RETURN]
        =~ m{<details(?!s) url="(.*?)".*?>(.*?)</details>}mgis) {
      # Extract the URL attribute and tag body content.
      my($url, $details_content) = ($1||'', $2||'');

After extracting the child tags for a detail record, we can build a Perl hash from child tag names and their content values, using another relatively simple regular expression and our convenience functions:

  # Extract all the tags from the detail record, using
  # tag name as hash key and tag contents as value.
  my %record = (_type=>'amazon', url=>$url);
  while ($details_content =~ m{<(.*?)>(.*?)</\1>}mgis) {
    my ($name, $val) = ($1||'', $2||'');
    $record{clean_name($name)} = $val;
  }

However, not all of the child tags of details are flat tags. In particular, the names of artists for an album are child tags. So, with one more regular expression and a map function, we can further process these child tags into a list. We can also rename productname to title, for more intuitive use later:

      # Further process the artists list to extract author
      # names, and standardize on product name as title.
      my $artists = $record{artists} || '';
      $record{artists} =
        [ map { $_ } ( $artists =~ m{<artist>(.*?)</artist>}mgis ) ];
      $record{title} = $record{productname};

      push @records, \%record;
    }
  }
  return @records;
}

So, with a few web requests and less than a handful of regular expressions, we can search for and harvest a pile of records on albums found at Amazon.com for a given artist.

Presenting the Results

At this point, we can identify a CD, look up its details in FreeDB, and search for albums at Amazon.com. The last thing our main program does is combine all these functions, determine which Amazon.com product is the current album, and feed it and the rest of the albums to a function to prepare an HTML page with the results.

Now, we can implement the construction of that page:

sub html_template {
  my $vars = shift;

  my $out = '';

  $out .= qq^
    <html>
      <head><title>Now Playing</title></head>
      <body>
        <div align="center">
          <h1>Now playing:</h1>
  ^;
  $out .= format_album($vars->{current}, 1);
  $out .= qq^
          <h1>Also by this artist:</h1>\n";
          <table border="1" cellspacing="0" cellpadding="8">
  ^;

This code begins an HTML page, using a function we’ll implement in a minute, which produces a display of an album with title and cover art. Next, we can put together a table that shows the rest of the related albums from this artist. We create a table showing smaller cover art, with three columns per row:

  my $col = 0;
  my $row = '';
  for my $rec (@{$vars->{others}}) {
    $row .= '<td align="center" width="33%">';
    $row .= format_album($rec, 0);
    $row .= "</td>\n";
    $col++;
    if (($col % 3) == 0) {
      $out .= "<tr>\n$row\n</tr>\n";
      $row = '';
    }
  }

Finally, we close up the table and finish off the page:

  $out .= qq^
          </table>
        </div>
    </body></html>
  ^;

  return $out;
}

The last thing we need is a function to prepare HTML to display an album:

sub format_album {
  my ($rec, $large) = @_;

  my $out = '';

  my $img = ($large) ? 'imageurllarge' : 'imageurlmedium';

  $out .= qq^<a href="$rec->{url}"><img src="$rec->{$img}"/></a><br/>^;
  $out .= qq^<b><a href="$rec->{url}">$rec->{title}</a></b><br />^;

  if (defined $rec->{releasedate}) {
    $out .= qq^Released: $rec->{releasedate}^;
  }

  if (ref($rec->{artists}) eq 'ARRAY') {
    $out .= '<br />by <b>'.join(', ', @{$rec->{artists}}).'</b>';
  }
}

This function produces several lines of HTML. The first line displays an album’s cover art in one of two sizes, based on the second parameter. The second line displays the album’s title, linked to its detail page at Amazon.com. The third line shows when the album was released, if we have this information, and the final line lists the artist who created the album.

Hacking the Hack

With this script and the help of FreeDB and Amazon.com, we can go from a CD in the tray to an HTML Now Playing display. This could be integrated into a CD player application and improved in any number of ways:

  • The handful of regular expressions used to parse Amazon.com’s XML are mostly adequate, but a proper XML parser, like XML::Simple, would be better.

  • Errors and unidentified CDs are not handled very well.

  • Other web services could be pulled in to further use the harvested CD data.

Maybe someday, something like this could be enhanced with the ability to automatically purchase and download music from an online store, to grab albums you don’t yet have in order to expand your jukebox even further. Powerful things happen when one simple tool can be easily chained to another.

—l.m.orchard

Hack #60. Expanding Your Musical Tastes

Looking for new music to complement your stale collection? With this script, you’ll be able to pass some names of your favorite artists, and get Audioscrobbler recommendations.

You’ve downloaded every album by your favorite artist, even the B-sides. Maybe your playlist of 3,000 songs is starting to get stale. For whatever reason, you’ve decided it is time to find new music to fall in love with. Downloading songs off Limewire (http://www.limewire.com) with “GET THIS” in the filename, only to find out it’s the ramblings of a broken bagpipe, is hit or miss at best. Wouldn’t it be great to see what other people, who tend to like the same music you do, are listening to?

Audioscrobbler (http://www.audioscrobbler.com) has a great solution: it accepts playlist information from its users about what they listen to and how often. From there, Audioscrobbler associates artists with each other based on how often users listen to them. We are going to use a script to access the Audioscrobbler web site and retrieve a list of artists with a correlation factor—how closely related that artist is to the artists you submit.

First of all, we need to find the traditional way this is done at the Audioscrobbler web site. A quick check of the site reveals a Related Artists link that takes you to a form where you can type in three artists and get a listing of matches. Since this is exactly what we need, let’s take a look at the code that runs the form. Looking at the HTML source, you can see the HTML code for the form at the bottom of the code. It’s a GET request with some predefined variables and our three input boxes, named a1, a2, and a3. If you go back to the page, fill in some artists, and click Do It, you’ll get a page with the results. Take a second to look at the URL for the page. The first thing to notice is that it is quite long; this is where the GET request parameter for the form comes in, because a GET request means that all the information for the page will be submitted within the URL. Using this knowledge, we can now construct our own URLs with the artists’ names in them to retrieve the results.

Once we have the results, we need to figure out how to get the data we need. Back to the HTML source. This time, again near the bottom, we find a single, extremely long line of HTML. Searching through it, you can see there is a simple format: a link for the artist’s name, td tags, and two img tags. We’ll use the width attribute for the second img tag to find the correlation. It just so happens that the width value is always a number between 1 and 300; this value determines the length of the pretty image on the page to the right of each artist.

The Code

Save the following code to a file called audioscrobble.pl:

#!/usr/bin/perl -w
#
# AudioScrobble - Finds artists similar to those you already like.
# Comments, suggestions, contempt? Email adam@bregenzer.net.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";

# make sure we have the modules we need, else die peacefully.
eval("use LWP 5.6.9;"); die "[err] LWP 5.6.9 or greater required.\n" if $@;

# base URL for all requests
my $base_url = "http://www.audioscrobbler.com/modules.php?".
               "op=modload&name=top10&file=scrobblersets";

my $counter = 0;         # counter of artists displayed
my $max_count = 10;      # maximum number of artists to display
my ($a1, $a2, $a3) = ''; # artist input variables

# Reminder: this code checks for arguments, therefore if a band
# name has multiple words make sure you put it in quotes.
# Also, Audioscrobbler accepts at most three band names so we
# will only look at the first three arguments.
$a1 = $ARGV[0] || die "No artists passed!\n";
$a2 = $ARGV[1] || ""; $a3 = $ARGV[2] || "";

# create a downloader, faking the User-Agent to get past filters.
print "Retrieving data for your matches... ";
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');
my $data = $ua->get("$base_url&a1=$a1&a2=$a2&a3=$a3")->content;
print "done.\n";

# print up a nice header.
print "Correlation\tArtist\n";
print "-" x 76, "\n";

# match on the URL before the artist's name through to
# the width of the bar image (to determine correlation).
while ($counter < $max_count && $data =~ /href="modules\.php\[RETURN]
?op=modload&name=top10&file=artistinfo&artist=[^"]+">([^<]+)<\/a>[^<]+<\[RETURN]
/td><td[^>]+><img[^>]+\/><img[^>]+width="([0-9]+)">(.*)/) {

    # print the correlation factor and the artist's name.
    printf "%1.2f", ($2 / 300); print "\t\t" . $1 . "\n";

    # continue with the
    # data that is left.
    $data = $3; $counter++;
}

if ($counter == 0) {print "No matches.\n";}
print "-" x 76, "\n";

Running the Hack

Invoke the script on the command line, passing it up to three artists you like. Make sure you put their names in quotes; you do not need to worry about capitalization. Audioscrobbler cannot handle more than three artists at a time; in fact, you might find that three artists is too many for it (i.e., you might not get any results). In such cases, try removing the last artist and running it again. It is also important to keep in mind that you will get better results if you list artists that are similar to each other. This is a proximity search, so listing a heavy metal, country, and classical artist in the same search is unlikely to return any results.

Appropriately prepared, venture forth into the world of new music and find your next favorite artists. Here is an example in which I find artists similar to Aphex Twin and Autechre:

% perl audioscrobble.pl "Aphex Twin" "Autechre"
Retrieving data for your matches... done.
Correlation     Artist
--------------------------------------------------------------------------
1.00            Boards Of Canada
1.00            Plaid
0.83            Underworld
0.83            Radiohead
0.83            Chemical Brothers
0.83            Orbital
0.67            Mu-Ziq
0.67            Led Zeppelin
0.67            AFX
0.67            Squarepusher

Hacking the Hack

There are a few ways you can improve upon this hack.

Changing the number of results returned

You can easily change the number of results by changing the hardcoded $max_count value to a different number. However, we are looking for something more elegant. If you add the following code above the comment that starts with #Reminder, you will be able to pass an argument to the script specifying the number of results to return:

# Check for a '-c' argument first
# specifying the number of
# results to return.
if ($ARGV[0] =~ /-c/) {
    shift @ARGV;
    $max_count = shift @ARGV;
}

And here is the requisite sample output:

% perl audioscrobble.pl -c 5 "Aphex Twin" "Autechre"
Retrieving data...done.
Correlation     Artist
--------------------------------------------------------------------------
1.00            Boards Of Canada
1.00            Plaid
0.83            Underworld
0.83            Radiohead
0.83            Chemical Brothers

If you plan on adding a number of command-line arguments, you might want to use a Perl module designed for the job: Getopt::Long. You can find various examples of its use within other hacks in this book.

Looking up artists

Now that you have a list of new artists, the next step is to have the script research these new artists and download sample songs, customer ratings, and so on. Code away, young grasshopper.

See Also

  • There are other sites available that aggregate and associate artist playlists, the most promising of late being EchoCloud [Hack #20]. As opposed to Audioscrobbler, which is an opt-in service, EchoCloud works by spidering P2P networks, such as Soulseek, for relevant information.

—Adam Bregenzer

Hack #61. Saving Daily Horoscopes to Your iPod

You’ve got a zillion songs on your new iPod, and you’re traveling around town oblivious to the sounds of the city. Worried about getting hit by a car, finding that special someone, or knowing when to ask for that raise? Take your horoscope along with you by running this hack daily.

With Apple’s newest iPods, the functionality you can bring along with you has greatly improved. Not only can you sync up your iCal calendars or Address Book entries, you can also include little snippets of text in the new Notes feature. Limited to 4 KB per note and 1,000 notes, there’s certainly room for improvement, but the ability to add your own navigational elements (either to other Notes or to songs and playlists) and paragraph styling (via HTML’s <P> and <BR> tags) is a good start for some interesting applications.

This isn’t to say that this hack is particularly interesting or useful, but it does give an example of programmatically determining the path to the currently mounted iPod via Perl. It’s not foolproof, though; if you’re rich enough to have more than one iPod mounted at the same time, then one will be chosen at random. Dealing with more than one iPod is an exercise you can pay someone else to do.

The Code

Save the following code to a file called horopod.pl:

#!/usr/bin/perl -w
#
# HoroPod - save your daily horoscope to the iPod.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";
use File::Spec::Functions;

# make sure we have the modules we need, else die peacefully.
eval("use LWP;"); die "[err] LWP is not installed.\n" if $@;

# really cheap Perl-only way of finding the path to
# the currently mounted iPod. searches the mounted
# Volumes for an iPod_Control folder and uses that.
my $ipod = glob("/Volumes/*/iPod_Control");
unless ($ipod) { die "[err] Could not find an iPod: $!\n"; }
$ipod =~ s/iPod_Control//g;  # we want one directory higher.
my $ipod_dir = catdir($ipod, "Notes", "Horoscopes");
mkdir $ipod_dir;  # no error checking by intention.

# create a downloader, faking the User-Agent to get past filters.
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');

# now, load up our horoscopes. first, define all the
# signs - these are used throughout the forloop.
my @signs = qw( aries taurus gemini cancer leo virgo libra
                scorpio sagittarius capricorn aquarius pisces );

# loop through each sign.
foreach my $sign (@signs) {

    # make it purdier for humans.
    my $display_sign = ucfirst($sign);

    # the Yahoo! URL, specific to the current sign.
    print "Grabbing horoscope for $display_sign...\n";
    my $url = "http://astrology.yahoo.com/us/astrology/".
                "today/$sign"."dailyhoroscope.html";

    # suck down the data or die.
    my $data = $ua->get($url)->content
      or die "[err] Could not download any data: $!\n";

    # snag the date by signature, not design.
    $data =~ /(\w{3} \w{3}\.? \d{1,2}, \d{4})/; my $date = $1;

    # and get the relevance. we could use an
    # HTML parser, but this is mindlessly easier.
    my $preface = '<font face="Arial" size="-1" color=black>';
    my $anteface = '</font></TD></TR></table>'; # ante up!
    $data =~ /$preface(.*)$anteface/i; my $proverb = $1;

    # save this proverb to our file.
    my $ipod_file = catfile($ipod_dir, $display_sign);
    open(IPOD_FILE, ">$ipod_file") or die "[err] Could not open file: $!\n";
    print IPOD_FILE "$display_sign\n$date\n\n";
    print IPOD_FILE "$proverb\n"; close(IPOD_FILE);

}

Running the Hack

To run the hack, make sure your iPod is mounted as a FireWire HD (i.e., you can see it on your Desktop when it’s plugged into your Mac or docking bay), launch the Terminal application (Applications Utilities Terminal), and type perl horopod.pl on the command line. After a few lines of output, your newly scraped horoscopes should be on your iPod as a Horoscopes folder under the Notes feature. They’ll be one file—or note—per sign.

Hacking the Hack

There are a few ways you can tweak the script, all mindlessly simple. For one, you might not want your horoscopes under a directory called Horoscopes; to change that behavior, merely tweak my $ipod_dir = catdir($ipod, "Notes", "Horoscopes"); to your desired path.

Concerning the source data itself, Yahoo! Horoscopes has a number of different sorts of predictions available; you can get them tweaked to Music, Movies, Romance, and what have you. Be sure to check out http://astrology.yahoo.com/ to find your desired version. When tweaking the script to support another type, you’ll want to tweak the URL being used, making sure to place $sign where appropriate:

my $url = "http://astrology.yahoo.com/us/astrology/".
                "today/$sign"."dailyhoroscope.html";

Also, tweak the $preface and $anteface of the code surrounding the actual fortune:

my $preface = '<font face="Arial" size="-1" color=black>';
my $anteface = '</font></TD></TR></table>'; # ante up!

Alternatively, you could scrap the entire horoscope feature and combine in another of the data scrubbers within this book, utilizing only the iPod-related code in this hack.

See Also

  • VersionTracker (http://www.versiontracker.com) for other iPod utilities, including Pod2Go (for weather, stocks, and more); PodNotes (for the latest news and driving directions); and VoodooPad (for Wiki-like Notes editing).

Hack #62. Graphing Data with RRDTOOL

Graphing data over time, either by itself or in comparison with another dataset, is the Holy Grail of analytical research. With the use of RRDTOOL, you’ll be able to store and display time-series data.

In this hack, we’re going to get some example data from Amazon.com and use the Round Robin Database Tool (RRDTOOL, http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/) to graph changes in Amazon.com Sales Rank over time.

Round robin is a way of storing a fixed amount of data and a pointer to the current element. This is much like a cyclic buffer with a fixed number of slots for data, where adding a new element pushes out the oldest to make space. This is a nice feature, because you never have to worry about using all your disk space or clearing out old data. The downside is that you have to decide the time period up front. This hack assumes you have RRDTOOL installed as per the online instructions.

First, let’s create a database to log an Amazon.com Sales Rank for a month:

% rrdtool create salesrank.rrd --start 1057241523  --step 86400 
              DS:rank:GAUGE:86400:1:U  RRA:AVERAGE:0.5:1:31  RRA:AVERAGE:0.5:7:10

We have now created a database called salesrank.rrd, starting when this was written, adding new data every 24 hours, and keeping two round robin datasets. There are numerous settings when creating a database, many more than we can hope to explain here. To give you a feel for it, we’ll just briefly explain the settings we used in this hack:

--start 1057241523 --step 86400

Defines when the time series starts, using Unix timestamps. Executing date +%s gives you the current time in the necessary format (number of seconds since the Epoch). Setting the number to 86400 for step defines the time in seconds between our data points. We arrive at that number with the following equation: 24 × 60 × 60 = 86400—or, 24 hours of 60 minutes each and each minute containing 60 seconds. In this case, we’re graphing one bit of data per day, every day, starting now.

DS:rank:GAUGE:86400:1:U

DS defines a dataset, rank is the name, and GAUGE is used when we’re more interested in the absolute number than a percentage change. We set the scale to begin with 1, because we know that the highest Sales Rank is 1. We set the upper limit of the scale to unlimited (U), because we don’t know how many products Amazon.com has; therefore, we can’t know how badly ranked our book will be, and thus the need for unlimited.

RRA:AVERAGE:0.5:1:31
RRA:AVERAGE:0.5:7:10

Here, we define our two round robin databases, the first keeping daily numbers and running for a total of 31 days, the second running weekly numbers (7 days) for a total of 10 weeks.

Now that we have the database created, it is time to start filling in some numbers by using the rrdtool update command:

% rrdtool update salesrank.rrd 1057241524:3689
% rrdtool update salesrank.rrd 1057327924:3629
...etc...
% rrdtool update salesrank.rrd 1059833523:2900

The numbers are in the format of timestamp:value, which, in this case, indicates a Sales Rank of 3689 for the first entry and 3629 for the next entry 24 hours later. The rule is that every update should be at least one second after the previous entry. With a total of 31 data points (not all are shown in the example), we now have something to display. To get textual results, we can use the fetch feature of rrdtool:

% rrdtool fetch salesrank.rrd AVERAGE --start 1057241524 --end 1059833524
1057190400: nan
1057276800: 3.6290017008e+03
1057363200: 3.6094016667e+03
...etc...

It’s not very pretty to look at, but it’s essentially the same as when we entered the data with timestamp:value. These are calculated numbers, so they are not exactly the same as those we entered. But (finally!) on to where this whole hack started: drawing graphs based on time-series data:

% rrdtool graph osxhacks.png --start 1057241524 --end 1059833524
              --imgformat PNG --units-exponent 0 DEF:myrank=salesrank.rrd:rank:AVERAGE  
              LINE1:myrank#FF0000:"Mac OS X Hacks"

This code produces the graph shown in Figure 4-5.

Graph of the Amazon.com Sales Rank for Mac OS X Hacks
Figure 4-5. Graph of the Amazon.com Sales Rank for Mac OS X Hacks

There’s an almost never-ending list of settings when displaying the graphs, which would be impossible to cover here. Most notable in our previous command is that we get the rank parameter out of our database and graph it in red with the legend “Mac OS X Hacks.” Other than that, we ask for files in PNG format and tell the graph not to do any scaling on the y-axis.

Doing this by hand on a regular basis would be incredibly tedious at best. cron and Perl to the rescue! First, we’ll create a Perl script that sucks down the Amazon.com product we’re interested in, and then we’ll capture the Sales Rank with a simple regular expression. This captured data, as well as the current timestamp, will be used to update our RRDTOOL database, and a new graph will be created.

The Code

Save the following code in a file called grabrank.pl:

#!/usr/bin/perl -w
#
# grabrank.pl
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as perl
#

use strict;
use LWP::Simple;
my $time=time(  );

# path to our local RRDTOOL.
my $rrd = '/usr/local/bin/rrdtool';

# Get the Amazon.com page for Mac OS X Hacks
my $data = get("http://www.amazon.com/exec/obidos/ASIN/0596004605/");
$data =~ /Amazon.com Sales Rank: <\/b> (.*) <\/span><br>/;
my $salesrank=$1; # and now the sales rank is ours! Muahh!

# Get rid of commas.
$salesrank =~ s/,//g;

# Update our rrdtool database.
`$rrd update salesrank.rrd $time:$salesrank`;

# Update our graph.
my $cmd= "$rrd graph osxhacks.png --imgformat PNG --units-exponent ".
         "0 DEF:myrank=salesrank.rrd:rank:AVERAGE LINE1:myrank#FF0000:".
         "'Mac OS X Hacks' --start ".($time-31*86400)." --end $time";
`$cmd`; # bazam! we're done.

Running the Hack

First, we need a cron job [Hack #90] to run this script once every day. On some systems, you can simply place the script in /etc/cron.daily. If you don’t have that option, then add something like this to your crontab file, which will tell cron to run our script every night at five minutes after midnight:

5 0 * * *       /path/to/your/grabrank.pl

Hacking the Hack

The graphs are not exactly pretty, so there are many possible improvements to be made, playing with intervals, colors, and so forth. If you look at the graph, you’ll see that the way it is displayed is somewhat counterintuitive, because a low figure is a sign of a higher ranking. If we knew the exact Sales Rank of the worst-selling item at Amazon.com in advance, then we could simply subtract the rank of the day from that and create a graph that rose with a higher ranking. Not having the right numbers, it’s going to take a few more calculations.

If you want to graph more than one Sales Rank, there’s not much to change, other than defining an extra data source when creating the database:

DS:otherrank:GAUGE:86400:1:U

And remember to add an extra DEF and LINE1 to the rrdtool graph command:

DEF:myotherrank=salesrank.rrd:rank:AVERAGE
LINE1:myotherrank#11EE11:"My other book"

Grabbing the extra data from Amazon.com is left as an exercise for the reader.

—Mads Toftum

Hack #63. Stocking Up on Financial Quotes

Keeping track of multiple stocks can be a cumbersome task, but using the Finance::Quote Perl module can greatly simplify it. And, while we’re at it, we’ll generate pretty graphs with RRDTOOL.

Collecting stock prices can be done using LWP [Hack #9] to download a financial site and regular expressions [Hack #23] to scrape the data, as well as always keeping a watchful eye for site design changes that could break things. But why go to the trouble when Finance::Quote (http://search.cpan.org/author/PJF/Finance-Quote/) provides a simple interface with numerous sources, such as Fidelity Investments, Trustnet, The Motley Fool, or Yahoo!?

Here’s a typical bit of code that uses Finance::Quote to fetch stock prices:

#!/usr/bin/perl
use Finance::Quote;
my $q = Finance::Quote->new;
my $quotes = $q->fetch("nasdaq","IBM");
print "Price range: $quotes->{'IBM','year_range'}\n";

We create a new Finance::Quote object and fetch data with $q->fetch($market,@stocks). In this case, we let the market point to nasdaq. Though @stocks is normally a list of desired stocks, we use just one (IBM). To get at the information that the module has grabbed for us, we use $quotes->{'IBM','year_range'}, which will get us the price range for the last 52 weeks:

% perl finance.pl
Price range: 54.01 - 90.404

There is much more information in addition to year_range; consult the Finance::Quote documentation for further explanation and details on which information is available from which sources. When in doubt, you can get a complete list of the available values by printing the returned $quotes structure:

use Data::Dumper;
print Dumper($quotes);

Adding these two lines to the previous code produces the following output:

$VAR1 = {
          'IBM{avg_vol' => 7264727,
          'IBM{div' => '0.64',
          'IBM{ask' => undef,
          'IBM{date' => '7/22/2003',
          'IBM{method' => 'yahoo',
          'IBM{div_yield' => '0.78',
          'IBM{low' => '81.65',
          'IBM{symbol' => 'IBM',
          'IBM{cap' => '141.2B',
          'IBM{day_range' => '81.65 - 83.06',
          'IBM{open' => '82.50',
          'IBM{bid' => undef,
          'IBM{eps' => '3.86',
          'IBM{time' => '1:40pm',
          'IBM{currency' => 'USD',
          'IBM{success' => 1,
          'IBM{volume' => 6055000,
          'IBM{last' => '81.70',
          'IBM{year_range' => '54.01 - 90.404',
          'IBM{close' => '82.50',
          'IBM{high' => '83.06',
          'IBM{net' => '-0.80',
          'IBM{p_change' => '-0.97',
          'IBM{ex_div' => 'May  7',
          'IBM{price' => '81.70',
          'IBM{pe' => '21.37',
          'IBM{name' => 'INTL BUS MACHINE',
          'IBM{div_date' => 'Jun 10'
        };

The first part of the variable name (IBM, in this case) is the stock symbol, the second part is a delimiter of some kind, and the third is the name of the data being referred to. The information we printed in our first code sample is emphasized in the preceding output as a guide.

We have the data; now it’s time to start plotting it into a graph. As in [Hack #62] we use RRDTOOL (http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/) to plot our data, but this time we will use a Perl interface. RRDTOOL has two Perl interfaces, but we will use the "Shared RRD module” only, as it is the most flexible of the two. The Perl interface will be very familiar to those who know the command-line interface.

To add data to stocks.rrd, for example, we would normally run this command:

% rrdtool update stocks.rrd N:12345

Using the Perl interface, all we have to do is call RRDs::update, like this:

use RRDs;
RRDs::update ("stocks.rrd","N:12345");

Similarly, RRDs::create, RRD::graph, and others all work like their command-line counterparts. More information on the Perl bindings are available within the supplied RRDTOOL documentation.

Putting it all together in a Perl script, we use a “does this database exist?” check to see whether we should create a new database or update an existing one. Then, we get new stock figures using Finance::Quote and add them to our database using RRDs::update. To create graphs, we run once with RRDs::graph and --start -1w to create a graph for the last week, and once with -1m to graph the entire last month.

The Code

Save the following code in a file called grabstocks.pl:

#!/usr/bin/perl -w
use strict; use RRDs;
use Finance::Quote qw/asx/;

# Declare basic variables.
my @stocks       = ('IBM','MSFT','LNUX');
my @stock_prices = (0,0,0);
my $workdir      = "./stocks";
my $db           = "$workdir/stocks.rrd";
my $now          = time(  );

# if the database hasn't been created,
# do so now, or die with an error.
if (!-f $db) {
    RRDs::create ($db, "--start", $now-1,
          "DS:IBM:ABSOLUTE:900:0:U",
          "DS:MSFT:ABSOLUTE:900:0:U",
          "DS:LNUX:ABSOLUTE:900:0:U",
          "RRA:AVERAGE:0.5:1:4800",
          "RRA:AVERAGE:0.5:4:4800",
          "RRA:AVERAGE:0.5:24:3000",
    );

    if (my $ERROR = RRDs::error) { die "$ERROR\n"; }
}

# now, get the quote information
# for IBM, Microsoft, and Linux.
my $q      = Finance::Quote->new(  );
my %quotes = $q->fetch("usa",@stocks);

# for each of our stocks, check to 
# see if we got data, and if so, 
# add it to our stock prices.
foreach my $code (@stocks) {
    my $count = 0; # array index.
    unless ($quote{$code, "success"}) {
        warn "$code lookup failed: ".$quote{$code,"errormsg"}."\n";
        $count++; next; # well, that's not a good sign.
    }

    # update the stock price, and move to the next.
    $stock_prices[$count] = $quote{$code,'last'}; $count++;
}

# we have our stock prices; update our database.
RRDs::update($db, "--template=" . join(':',@stocks),
                  "$now:" . join(':',@stock_prices));
if (my $ERROR = RRDs::error) { die "$ERROR\n"; }

# Generate weekly graph.
RRDs::graph("$workdir/stocks-weekly.png",
  "--title",     'Finance::Quote example',
  "--start",     "-1w",
  "--end",       $now+60,
  "--imgformat", "PNG",
  "--interlace", "--width=450",
  "DEF:ibm=$db:IBM:AVERAGE",
  "DEF:msft=$db:MSFT:AVERAGE",
  "DEF:lnux=$db:LNUX:AVERAGE",
  "LINE1:ibm#ff4400:ibm\\c",
  "LINE1:msft#11EE11:msft\\c",
  "LINE1:lnux#FF0000:lnux\\c"
); if (my $ERROR = RRDs::error) { die "$ERROR\n"; }

# Generate monthly graph.
RRDs::graph ("$workdir/stocks-weekly.png",
  "--title",     'Finance::Quote example',
  "--start",     "-1m",
  "--end",       $now+60,
  "--imgformat", "PNG",
  "--interlace", "--width=450",
  "DEF:ibm=$db:IBM:AVERAGE",
  "DEF:msft=$db:MSFT:AVERAGE",
  "DEF:lnux=$db:LNUX:AVERAGE",
  "LINE1:ibm#ff4400:ibm\\c",
  "LINE1:msft#11EE11:msft\\c",
  "LINE1:lnux#FF0000:lnux\\c"
); if (my $ERROR = RRDs::error) { die "$ERROR\n"; }

Running the Hack

First, we need a cron job [Hack #90] to run this script once every 15 minutes. To do that, add something like this to your crontab, telling cron to run our script four times every hour:

*/4 * * * Mon-Fri /path/to/your/grabstocks.pl

With that in place, new graphs will be generated every time the script runs.

Hacking the Hack

The first and most obvious thing is to change the code to get more data for more interesting stocks. The periods chosen in this hack might also need some updating, since getting data every 15 minutes gives a much higher resolution than we need if we’re interested in only monthly graphs. Likewise, running the script 24 hours a day doesn’t make much sense if there will be stock changes only during business hours.

—Mads Toftum

Hack #64. Super Author Searching

By combining multiple sites into one powerful script, you can get aggregated data results that are more complete than just one site could give.

Have you ever obsessively tried to find everything written by a favorite author? Have you ever wanted to, but never found the time? Or have you never really wanted to, but think it would be neat to search across several web sites at once? Well, here’s your chance.

To search for authors, let’s pick a few book-related sites, such as the Library of Congress (http://www.loc.gov), Project Gutenberg (http://promo.net/pg/), and Amazon.com (http://www.amazon.com). Between these three web sites, we should be able to get a wide range of works by an author. Some will be for sale, some will be available for free download, and others will be available at a library (or the Library of Congress, at least).

Gathering Tools

Before we do anything else, let’s get some tools together. We’re going to use Perl for this hack, with the following modules: LWP::Simple [Hack #9], WWW::RobotRules, WWW::Mechanize [Hack #21], and HTML::Tree. These modules give us the means to navigate sites and grab content to find and extract data from, all while trying to be a good little robot that follows the rules ([Hack #17] offers guidance on using LWP::RobotsUA to accomplish the same thing). It might seem like unnecessary effort, but taking a few extra steps to obey the Robots Exclusion Protocol (http://www.robotstxt.org) can go a long way in keeping us from trouble or losing access to the resources we want to gather.

Our script starts like so:

#!/usr/bin/perl-w
use strict;
use Data::Dumper qw(Dumper);

use LWP::Simple;
use WWW::RobotRules;
use WWW::Mechanize;
use HTML::Tree;

our $rules = WWW::RobotRules->new('AuthorSearchSpider/1.0');
our $amazon_affilate_id = "your affiliate ID here";
our $amazon_api_key     = "your key here";

my $author = $ARGV[0] || 'dumas, alexandre';

my @book_records = sort {$a->{title} cmp $b->{title}}
  (amazon_search($author), loc_gov_search($author), pg_search($author));

our %item_formats =
  (
   default => \&default_format,
   amazon  => \&amazon_format,
   loc     => \&loc_format,
   pg      => \&pg_format
  );

print html_wrapper($author,
                   join("\n", map { format_item($_) } @book_records));

So, here’s the basic structure of our script. We set up a few global resources, such as a way to mind rules of robot spiders and a way to access Amazon.com Web Services. Next, we attempt to get aggregate results of searches on several web sites and sort the records by title. Once we have those, we set up formatting for each type of result and produce an HTML page of the results.

Whew! Now, let’s implement all the subroutines that enable all these steps. First, in order to make a few things easier later on, we’re going to set up our robot rules handler and write a few convenience functions to use the handler and clean up bits of data we’ll be extracting:

# Get web content,
# obeying robots.txt
sub get_content {
  my $url = shift;
  return ($rules->allowed($url)) ? get($url) : undef;
}

# Get web content via WWW::
# Mechanize, obeying robots.txt
sub get_mech {
  my $url = shift;
  if ($rules->allowed($url)) {
    my $a = WWW::Mechanize->new(  );
    $a->get($url);
    return $a;
  } else { return undef }
}

# Remove whitespace from
# both ends of a string
sub trim_space {
  my $val = shift;
  $val=~s/^\s+//;
  $val=~s/\s+$//g;
  return $val;
}

# Clean up a string to be used
# as a field name of alphanumeric
# characters and underscores.
sub clean_name {
  my $name = shift;
  $name=lc($name);
  $name=trim_space($name);
  $name=~s/[^a-z0-9 ]//g;
  $name=~s/ /_/g;
  return $name;
}

Now that we have a start on a toolbox, let’s work on searching. The idea is to build a list of results from each of our sources that can be mixed together and presented as a unified whole.

Hacking the Library of Congress

Now, let’s visit the library. Right on the front page, we see a link inviting visitors to Search Our Catalogs, which leads us to a choice between a Basic Search and a Guided Search. For simplicity’s sake, we’ll follow the basic route.

This brings us to a simple-looking form (http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First), with options for the search text, the type of search we want, and the number of records per page. Using WWW::Mechanize, we can start our subroutine to use this form like this:

sub loc_gov_search {
  my $author = shift;

  # Submit search for author's name
  my $url = 'http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First';
  my $a = get_mech($url);
  $a->submit_form
    (
     form_number => 1,
     fields => { Search_Arg=>$author, Search_Code=>'NAME_', CNT=>70}
    );

The first result of this search is a list of links with which to further refine our author search. So, let’s try looking for links that contain the closest match to our author name:

  # Data structure for book data records
  my @hit_links = grep { $_->text() =~ /$author/i } $a->links(  );
  my @book_records = (  );
  for my $hit_link (@hit_links) {
    my $a = get_mech
      ('http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First');
    $a->submit_form
      (
       form_number => 1,
       fields => { Search_Arg=>$author, Search_Code=>'NAME_', CNT=>70}
      );
    $a->follow_link(text=>$hit_link->text(  ));

This particular bit of code uses the link-extraction feature of WWW::Mechanize to grab link tags from the initial search results page to which we just navigated. Due to some quirk in session management in the Library of Congress search, we need to start over from the search results page, rather than simply use the back function.

Once we have each secondary author page of the search, we can extract links to publications from these pages:

    # Build a tree from the HTML
    my $tree = HTML::TreeBuilder->new(  );
    $tree->parse($a->content(  ));

    # Find the search results table: first, look for a header
    # cell containing "#", then look for the parent table tag.
    my $curr;
    ($curr) = $tree->look_down
      (_tag => 'th', sub { $_[0]->as_text(  ) eq '#' } );
    next if !$curr;
    ($curr) = $curr->look_up(_tag => 'table');
    my ($head, @rows) = $curr->look_down
      (_tag => 'tr', sub { $_[0]->parent(  ) == $curr } );

This code uses the HTML::Tree package to navigate the structure of the HTML content that makes up the search hits page. Looking at this page, we see that the actual table listing the search hits starts with a table header containing the text "#“. If we look for this text, then walk back up to the containing parent, we can then extract the table’s rows to get the search hits.

Once we have the rows that contain links to details pages, let’s process them:

    # Extract and process the search
    # results from the results table.
    my @book_records = (  );
    while (@rows) {

      # Take the results in row pairs; extract 
      # the title and year cells from the first row.
      my ($r1, $r2) = (shift @rows, shift @rows);
      my (undef, undef, undef, undef, $td_title, $td_year, undef) =
        $r1->look_down(_tag => 'td', sub { $_[0]->parent(  ) == $r1 });
    
      # Get title link from the results; extract the detail URL.
      my ($a_title) = $td_title->look_down(_tag=>'a');
      my $title_url = "http://catalog.loc.gov".$a_title->attr("href");

      # Get the book detail page; follow the link to the Full record.
      $a->follow_link(url => $title_url);
      $a->follow_link(text => "Full");

Looking at this page, we see that each publication is listed as a pair of rows. The first row in each pair lists a few details of the publication, and the second row tells where to find the publication in the library. For our purposes, we’re interested only in the title link in the first row, so we extract the cells of the first row of each pair and then extract the URL to the publication detail page from that.

From there, we follow the details link, which brings us to a brief description of the publication. But we’re interested in more details than that, so on that details page we follow a link named “Full” to a more detailed list of information on a publication.

Finally, then, we’ve reached the full details page for a publication by our author. So, let’s figure out how to extract the fields that describe this publication. Looking at the table, we see that the table starts with a header containing the string "LC Control Number“. So, we look for that header, then backtrack to the table that contains it:

      # Find table containing book detail data by looking
      # for table containing a header with text "LC Control Number".
      my $t2 = HTML::TreeBuilder->new(  );
      $t2->parse($a->content(  ));
      my ($c1) = $t2->look_down
        (_tag=>'th', sub { $_[0]->as_text(  ) =~ /LC Control Number/ }) ||
          next;
      $c1 = $c1->look_up(_tag=>"table");

After finding the table that contains the details of our publication, we can walk through the rows of the table and extract name/value pairs. First, we start building a record for this book by noting the type of the search, as well as the URL of the publication details page:

      # Now that we have the table, look
      # for the rows and extract book data.
      my %book_record = (_type => 'loc', url=>$title_url);
      my @trs = $c1->look_down(_tag=>"tr");
      for my $tr (@trs[1..$#trs]) {

        # Grab the item name and value table
        # cells; skip to next if empty.
        my ($th_name)  = $tr->look_down(_tag=>"th");
        my ($td_value) = $tr->look_down(_tag=>"td");
        next if (!$th_name) || (!$td_value);

        # Get and clean up the item name and value
        # table data; skip to next if the name is empty.
        my $name  = clean_name($th_name->as_text(  ));
        my $value = trim_space($td_value->as_text(  ));
        next if ($name eq '');
  
        $book_record{$name} = $value;
      }

Luckily, the table that contains information about our publication is fairly clean, with every name contained in a header cell and every value contained in a corresponding data cell in the same row. So, we walk through the rows of the details table, collecting data fields by using the convenience methods we wrote earlier.

Now, we can finish up our subroutine, doing a little cleanup on the publication title and adding the finished record to a list that we return when all our wandering through the library is done:

      ($book_record{title}, undef) 
         = split(/ \//, $book_record{main_title});

      push @book_records, \%book_record;

      # Back up to the search results page.
      $a->back(); $a->back(  );
    }
  }
  return @book_records;
}

To summarize, this subroutine does the following:

  1. Performs an author search on the Library of Congress web site

  2. Follows links to author search results pages

  3. Follows publication details links on author search results pages

  4. Digs further down to full-detail records on publications

  5. Harvests data fields that describe a publication

In the end, by drilling down through several layers of search hits and details pages, we have collected a slew of records that describe publications by our author. These records are stored as a list of Perl hashes, each containing name/value pairs.

Each record also contains a value that indicates which source it was harvested from (i.e., _type=>'loc'). This will become important shortly, when we mix the results of other searches together.

Perusing Project Gutenberg

Next, let’s take a look at Project Gutenberg (http://promo.net/pg/). In case you’ve never heard of it, this is an effort to make public-domain books and publications available to the public in formats usable by practically all personal computers available. In the Project Gutenberg library, you can find an amazing array of materials, so our author search could benefit from a stroll through their stacks.

Wandering around the project’s site, we uncover a search form (http://www.ibiblio.org/gutenberg/cgi-bin/sdb/t9.cgi/). One of the fields in this form is Author, just what we need. Our search subroutine for this site begins like this:

# Search Project Gutenberg
# for books by an author
sub pg_search {
  my $author = shift;

  my $pg_base = 'http://www.ibiblio.org/gutenberg/cgi-bin/sdb';
  my @book_records = (  );

  # Submit an author search at Project Gutenberg
  my $a1 = get_mech("$pg_base/t9.cgi/");
  $a1->submit_form
    (
     form_number => 1,
     fields => { author => $author }
    );

As it turns out, this search results page is quite simple, with a link to every result contained in a list bullet tag. So, we can write a quick set of map expressions to find the bullets and the links within them, and extract the link URLs into a list:

  # Extract all the book details
  # pages from the search results
  my $t1 = HTML::TreeBuilder->new(  );
  $t1->parse($a1->content(  ));
  my (@hit_urls) =
    map { "$pg_base/".$_->attr('href') }
      map { $_->look_down(_tag=>'a') }
        $t1->look_down(_tag=>'li');

Now that we have a list of links to publication details pages, let’s chase each one down and collect the information for each book:

  # Process each book detail
  # page to extract book info
  for my $url (@hit_urls) {
    my $t2 = HTML::TreeBuilder->new(  );
    $t2->parse(get_content($url));

Luckily, these details pages also have a fairly simple and regular structure. So, we can quickly locate the table that contains the details by finding a table cell with the word download and backtrack to its parent table.

     # Find the table of book data: look for a table
     # cell containing 'download' and find its parent table.
     my ($curr) = $t2->look_down
       (_tag=>"td",
        sub { $_[0]->as_text(  ) =~ /download/i });
     ($curr) = $curr->look_up(_tag=>"table");

Most rows of this table contain name/value pairs in data cells, with the name of the pair surrounded by <tt> tags. The names also end in a colon, so we can add that for good measure:

     # Find the names of book data items: look for
     # all the <tt> tags in the table that contain ':'
     my (@hdrs) = $curr->look_down 
       (_tag=>'tt',
        sub { $_[0]->as_text(  ) =~ /\:/});

After finding all the book details field names, we can visit each of them to dig out the values. For each tag that contains a name, we find its parent table row and grab the row’s second column, which contains the value of the pair. So, we can start constructing a record for this book. Again, notice that we start out by identifying which source this search result was harvested from (i.e., _type=>'pg'):

     # Extract name/value data from book details page.
     my %book_record = (_type=>'pg', url=>$url);
     for my $hdr (@hdrs) {
  
       # Name is text of <tt> tag.
       my $name = clean_name($hdr->as_text(  ));
       next if ($name eq '');
  
       # Find the field value by finding the parent
       # table row, then the child table data cell.
       my ($c2) = $hdr->look_up(_tag=>'tr');
       (undef, $c2) = $c2->look_down(_tag=>'td');

Most values are simple strings, with the exception of the publication’s download links. When we encounter this value, we go a step further and extract the URLs from those links. Otherwise, we just extract the text of the table data cell. Using what we’ve extracted, we build up the book record:

       # Extract the value. For most fields, simply use the text of the
       # table cell. For the download field, find the URLs of all links.
       my $value;
       if ($name eq 'download') {
         my (@links) = $c2->look_down
           (_tag=>"a",
           sub { $_[0]->as_text(  ) =~ /(txt|zip)/} );
        $value = [ map { $_->attr('href') } @links ];
      } else {
        $value = $c2->as_text(  );
      }

      # Store the field name and value in the record.
      $book_record{$name} = $value;
    }

Finally, we store each book record in a list and return it from our subroutine:

    push @book_records, \%book_record;
  }
  return @book_records;
}

Although simpler, this search is similar to searching the Library of Congress:

  1. Perform an author search on the Project Gutenberg web site.

  2. Follow links in the search results to find publication details pages.

  3. Harvest data fields that describe a publication.

And, like the Library of Congress search, we collect a list of Perl hashes that contain book details. Also, each record is tagged with the source of the search.

Navigating the Amazon

Our final search involves the online catalog at Amazon.com, via its Web Services API (http://www.amazon.com/webservices). This API allows developers and webmasters to integrate a wide range of the features of their sites into their own applications and content. But before we can do anything with Amazon.com’s Web Services API, we need to sign up for a developer token. This allows Amazon.com to identify one consumer of its services from another. Once we have a token, we can get started using the API. First, we download the software development kit (SDK). In the documentation, we find that, among other services, the API offers simple XML-based author searches. So, we can use this service to build a search subroutine. Based on the SDK’s instructions, we can start like this:

# Search for authors via
# the Amazon search API.
sub amazon_search {
  my $author = shift;

  # Construct the base URL for Amazon author searches.
  my $base_url = "http://xml.amazon.com/onca/xml3?t=$amazon_affilate_id&".
    "dev-t=$amazon_api_key&AuthorSearch=$author&".
      "mode=books&type=lite&f=xml";

The first step is to use the XML service to submit a search query for our author. One quirk in the otherwise simple service is that results are served up only a few at a time, across a number of pages. So, we’ll grab the first page and extract the total number of pages that make up our search results:

  # Get the first page of search results.
  my $content = get_content($base_url."&page=1");

  # Find the total number of search results pages to be processed.
  $content =~ m{<totalpages>(.*?)</totalpages>}mgis;
  my ($totalpages) = ($1||'1');

Note that, in this hack, we’re going for a quick-and-dirty regular expression method for extracting information from XML. Normally, we’d want to use a proper XML parser, but this approach will work well enough to get this job done for now.

The next step, after getting the first page of search results and extracting the total number of pages, is to grab the rest of the pages for our search query. We can do this with another quick map expression in Perl to step through all the pages and store the content in a list.

One thing to note, however, is that we wait at least one second between grabbing results pages. The company may or may not enforce this restriction, but the license for using the Amazon.com Web Services API specifies that an application should make only one request per second. So, just as we make an effort to obey the Robots Exclusion Protocol, we should try to honor this as well.

Here’s how we do it:

  # Grab all pages of search results.
  my @search_pages = ($content);
  if ($totalpages > 1) {
    push @search_pages,
      map { sleep(1); get_content($base_url."&page=$_") } (2..$totalpages);
  }

Now that we have the content of all the search pages, we can extract records on the publications, just as we have in the previous two search subroutines. The biggest difference in this case, however, is that XML content is so much easier to handle than HTML tag soup. In fact, we can use some relatively simple regular expressions to process this data:

  # Extract data for all the books
  # found in the search results.
  my @book_records;
  for my $content (@search_pages) {

    # Grab the content of all <details> tags.
    while ($content=~ m{<details(?!s) url="(.*?)".*?>(.*?)</details>}mgis) {

      # Extract the URL attribute and tag body content.
      my($url, $details_content) = ($1||'', $2||'');

      # Extract all the tags from the detail record, using
      # tag name as hash key and tag contents as value.
      my %book_record = (_type=>'amazon', url=>$url);
      while ($details_content =~ m{<(.*?)>(.*?)</\1>}mgis) {
        my ($name, $val) = ($1||'', $2||'');
        $book_record{clean_name($name)} = $val;
      }

This code uses regular expressions to extract the contents of XML tags, starting with the details tag. The search results pages contain sets of these tags, and each set contains tags that describe a publication. We use a regular expression that matches on opening and closing tags, extracting the tag name and tag data as the name and value for each field. The names of these tags are described in the SDK, but we’ll just stuff them away in a book record for now.

Notice that this process is much simpler than walking through a tree built up from parsed HTML, looking for tag patterns. Things like this are usually simpler when an explicit service is provided for our use. So, we can apply a little last-minute processing—extracting lists of author subtags—finish up our book record, and wrap up our Amazon.com search subroutine:

      # Further process the authors list to extract author
      # names, and standardize on product name as title.
      my $authors = $book_record{authors} || '';
      $book_record{authors} =
        [ map { $_ } ( $authors =~ m{<author>(.*?)</author>}mgis ) ];
      $book_record{title} = $book_record{productname};

      push @book_records, \%book_record;
    }
  }

  return @book_records;
}

Compared to the previous two searches, this is the simplest of all. Since the XML provided by the Amazon.com search API is a well-defined and easily processed document, we don’t have to do any of the searching and navigation that is needed to extract records from HTML.

And, like the Library of Congress search, we collect a list of Perl hashes that contain book details. Also, each record is tagged with the source of the search.

Presenting the Results

We now have three subroutines with which to search for an author’s works. Each of them produces a similar set of results, as a list of Perl hashes that contain book details in name/value pairs. Although each site’s result records contain different sets of data, there are a few fields common to all three subroutines: _type, title, and url.

We can use these common fields to sort by title and format the results differently for each type of record. Now, we can build the parts to make the aggregate search and result formatting that we put together toward the beginning of the script. Let’s start with the wrapper HTML template:

sub html_wrapper {
  my ($author, $content) = @_;

  return qq^
    <html>
      <head><title>Search results for $author</title></head>
      <body>
        <h1>Search results for $author</h1>
        <ul>$content</ul>
      </body>
    </html>
    ^;
}

This is a simple subroutine that wraps a given bit of content with the makings of an HTML page. Next, let’s check out the basics of item formatting:

sub format_item {
  my $item = shift;
  return "<li>".((defined $item_formats{$item->{_type}})
    ? $item_formats{$item->{_type}}->($item)
    : $item_formats{default}->($item))."</li>";
}

sub default_format {
  my $rec = shift;
  return qq^<a href="$rec->{url}">$rec->{title}</a>^;
}

The first subroutine, format_item, uses the hash table of routines built earlier to apply formatting to items. The second subroutine, default_format, provides a simple implementation of an item format. Before we fill out implementations for the other record types, let’s build a quick convenience function:

sub field_layout {
  my ($rec, $fields) = @_;
  my $out = '';
  for (my $i=0; $i<scalar(@$fields); $i+=2) {
    my ($name, $val) = ($fields->[$i+1], $rec->{$fields->[$i]});
    next if !defined $val;
    $out .= qq^<tr><th align="right">$name:</th><td>$val</td></tr>^;
  }
  return $out;
}

This function takes a record and a list of fields and descriptions in order. It returns a string that contains a set of table rows, with descriptions paired with values. We’ll use this in the rest of the formatters to build tables quickly.

First, we build a formatter for the Library of Congress search records. Basically, this is an incremental improvement over the default formatter. It identifies the source of this result and uses the field-layout function we just built to display a small set of common fields found in Library of Congress publication records:

sub loc_format {
  my $rec = shift;
  my $out = qq^[LoC] <a href="$rec->{url}">$rec->{title}</a><br /><br />^;
  $out .= qq^<table border="1" cellpadding="4" cellspacing="0" [RETURN]
      width="50%">^;
  $out .= field_layout
    ($rec,
      [
        'publishedcreated'  => 'Published',
        'type_of_material'  => 'Type of material',
        'description'       => 'Description',
        'dewey_class_no'    => 'Dewey class no.',
        'call_number'       => 'Call number',
        'lc_classification' => 'LoC classification',
        'lc_control_number' => 'LoC control number',
        'isbn'              => 'ISBN',
      ]
    );
  $out .= "</table><br />";
  return $out;
}

Next, we build a formatter for the Project Gutenberg records. This implementation doesn’t display as many fields, but it has a special treatment of the download field in order to present the URLs as links:

sub pg_format {
  my $rec = shift;
  my $out = qq^[PG] <a href="$rec->{url}">$rec->{title}</a><br /><br />^;
  $out .= qq^<table border="1" cellpadding="4" cellspacing="0" [RETURN]
      width="50%">^;
  $out .= field_layout($rec, ['language' => 'Language']);
  $out .= qq^
    <tr><th align="right">Download:</th>
      <td>
  ^;
  for my $link (@{$rec->{download}}) {
    $out .= qq^<a href="$link">$link</a><br />^;
  }
  $out .= qq^</td></tr></table><br />^;
  return $out;
}

Finally, we build a formatter for the Amazon.com records, which has much in common with the Library of Congress record formatter. The biggest difference is that we’ve added the display of the publication’s cover image that is available at Amazon.com:

sub amazon_format {
  my $rec = shift;
  my $out = qq^[Amazon] <a href="$rec->{url}">$rec->{title}</a>[RETURN]
<br /><br />^;
  $out .= qq^
    <table border="1" cellpadding="4" cellspacing="0" width="50%">
      <tr><th align="center" colspan="2">
        <img src="$rec->{imageurlmedium}" />
      </th></tr>
  ^;
  $out .= field_layout
    ($rec,
      [
        'releasedate'  => 'Date',
        'manufacturer' => 'Manufacturer',
        'availability' => 'Availability',
        'listprice'    => 'List price',
        'ourprice'     => "Amazon's price",
        'usedprice'    => 'Used price',
        'asin'         => 'ASIN'
      ]
    );
  $out .= "</table><br />";
  return $out;
}

Running the Hack

Now our script is complete. We have code to search for an author across several sites, we have a means of driving these searches and aggregating the results, and we have a flexible means of presenting the results of our search. The design of this script should easily lend itself to adding further sites to be searched, as well as formatters for those results. Figure 4-6 shows the default format.

Search results for “dumas, alexandre”
Figure 4-6. Search results for “dumas, alexandre”

This script is best used from the command line, with the results saved to a file for viewing when the process is complete. Since this is a robot that spiders across quite a few pages from several sites, it won’t be unusual for this to take quite a bit of time. Also, since it generates quite a bit of traffic on the sites it visits, you’ll likely want to refrain from running it very often. In particular, this script is not really a good idea to adapt as a CGI script for a web search form.

Hacking the Hack

Exercises left for the reader include breaking up the search results into pages to make the results friendlier to browse. Also, without too much effort, this script could be modularized and turned into a fairly flexible search robot. In any case, enjoy your new powers of author searching, and good luck in building new search robots.

—l.m.orchard

Hack #65. Mapping O’Reilly Best Sellers to Library Popularity

If you’re using Google to look for books in university libraries, you’ll get better results using a Library of Congress Number than a plain old ISBN.

Earlier in the book, we looked at the variety of unique identifiers that can be used on a web site [Hack #7]. A number of these unique identifiers deal with books and other media.

You may one day find yourself with one identifier for a set of data but needing another set of data that uses a different identifier. That’s where I found myself when I was wondering exactly how many O’Reilly books were in university libraries, compared to their best-selling status (O’Reilly publishes a weekly list of best sellers at http://www.oreilly.com/catalog/top25.html).

Now, I could just use the ISBN, which O’Reilly supplies, and try to find library holdings that way. The problem, though, is that searching for ISBNs on Google will lead you to lots of false positives—bookstores or just mentions of books, instead of actual library holdings. But we do have an alternative: searching for a book’s Library of Congress (LOC) call number will eliminate most of those false positives.

But how do we get the LOC call number for each book? It’s not available from O’Reilly. I found a good search interface at the Rochester Institute of Technology’s library. I used the ISBNs from O’Reilly’s site to look up the LOC call number at RIT’s library. After I had the call number, I used Google’s API to count how many times the call number appeared in Google’s database.

Since the vast majority of LOC call numbers appear in Google search results from university web sites (and specifically library pages), this is a good way to gauge how popular an O’Reilly book is in university libraries versus how it ranks on O’Reilly’s overall best-selling list. Are the results perfect? No; most of the search results find acquisitions lists, not catalog search results. But you can get some idea of which books are popular in libraries and which ones apparently have very little appearance in libraries at all!

There’s another issue with this script. LOC call numbers end with the date a book was issued; for example, the call number for Mac OS X Hacks is QA76.76.O63 D67 2003. The “2003” is the year the book was published. In the case of Mac OS X Hacks, this is not a problem, since there’s only one edition of the book. But in cases of books like Learning Perl, where there are several editions available, searching for just the call number with the year of publication could miss libraries that simply have older versions of the book on their acquisitions lists.

To that end, this program actually takes two counts in Google using the LOC call number. In the first case, it searches for the entire number. In the second case, it searches for the number without the year at the end, giving two different results.

The Code

Save the following code to a file called isbn2loc.pl:

#!/usr/bin/perl-w
use strict;
use LWP::Simple;
use SOAP::Lite;

# All the Google information.
my $google_key  = "your Google API key";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");
my $bestsellers = get("http://www.oreilly.com/catalog/top25.html");

# Since we're getting a list of best sellers,
# we don't have to scrape the rank. Instead
# we'll just start a counter and increment
# it every time we move to the next book. 
my $rank = 1; 
while ($bestsellers =~ m!\[<a href="(.*?)">Read it on Safari!mgis) {
   my $bookurl = $1; $bookurl =~ m!http://safari.oreilly.com/(\w+)!;
   my $oraisbn = $1; next if $oraisbn =~ /^http/;

   # Here we'll search the RIT library for the book's ISBN. Notice
   # the lovely URL that allows us to get the book information.
   my $ritdata = get("http://albert.rit.edu/search/i?SEARCH=$oraisbn"); 
   $ritdata =~ m!field C -->&nbsp;<A HREF=.*?>(.*?)</a>!mgs; 
   my $ritloc = $1; # now we've got the LOC number.

   # Might as well get the title too, eh?
   $ritdata =~ m!<STRONG>\n(.*?)</STRONG>!ms; my $booktitle = $1; 

   # Check and see if the LOC code was found for the book.
   # In a few cases it won't be. If it was, keep on going.
   if ($ritloc =~ /^Q/ or $ritloc =~ /^Z/) {

      # The first search we're doing is for the entire LOC call number. 
      my $results = $gsrch ->doGoogleSearch($google_key, "\"$ritloc\"",
                             0, 1, "false", "",  "false", "", "", "");
      my $firstcount = $results->{estimatedTotalResultsCount};

      # Now, remove the date and check for all editions.
      $ritloc =~ m!(.*?) 200\d{1}!ms; my $ritlocall = $1; 
      $results = $gsrch ->doGoogleSearch($google_key, "\"$ritlocall\"",
                          0, 1, "false", "",  "false", "", "", "");
      my $secondcount = $results->{estimatedTotalResultsCount};

      # Now we print everything out.
      print "The book's title is $booktitle. \n"; 
      print "The book's O'Reilly bestseller rank is $rank.\n"; 
      print "The book's LOC number is $ritloc. \n";
      print "Searching for $ritloc on Google gives $firstcount results. \n"; 
      print "Searching for all editions on Google ($ritlocall) gives ".
            "$secondcount results.\n \n";  
   } 
   $rank++;
}

Running the Hack

Unlike many of the hacks in this book, this hack has no command-line switches or options. You just run it from the command line. It visits the top 25 best-seller list, gets the ISBNs, uses the ISBNs to get the LOC call numbers from the library at RIT, and then searches Google for the LOC call numbers with and without the year of publication. Output looks like this:

% perl isbn2loc.pl
The book's title is Learning Perl.
The book's O'Reilly bestseller rank is 8.
The book's LOC number is QA76.73.P33 S34 2001.
Searching for QA76.73.P33 S34 2001 on Google gives 0 results.
Searching for all editions on Google (QA76.73.P33 S34) gives 9 results.

The book's title is Running Linux.
The book's O'Reilly bestseller rank is 13.
The book's LOC number is QA76.76.O63 W465 2002.
Searching for QA76.76.O63 W465 2002 on Google gives 1 results.
Searching for all editions on Google (QA76.76.O63 W465) gives 20 results.

The book's title is Programming Perl.
The book's O'Reilly bestseller rank is 14.
The book's LOC number is QA76.73.P22 W348 2000.
Searching for QA76.73.P22 W348 2000 on Google gives 1 results.
Searching for all editions on Google (QA76.73.P22 W348) gives 10 results.

Hacking the Hack

This is a very closed hack; it has certain sources it uses and that’s that. So, the first thing I think of when I think about modifications is using different sources. O’Reilly doesn’t have the only best-seller list out there, you know. You could use Amazon.com, Barnes & Noble, or some other online bookstore or book list. You could also reference your own text file full of ISBN numbers.

You could also use Google’s daterange: syntax to check by month and see when the new acquisitions pages are being indexed. (There are too few search results to try to search on a day-by-day basis.) Another idea is to output the results into comma-delimited format, allowing you to put the information into a spreadsheet and lay it out that way.

Hack #66. Using All Consuming to Get Book Lists

You can retrieve a list of the most-mentioned books in the weblog community, as well as personal book lists and recommendations, through either of All Consuming’s two web service APIs.

This hack could represent the future of web applications. It glues together pieces of several web service APIs and then, in turn, offers an API to its features. If someone were to create a derivative application with this API, it would represent a third layer of abstraction from Amazon.com’s service. Entire lightweight services may someday be built layer upon layer like this, with dozens of interconnected applications exchanging data freely behind the scenes.

If this is a book about scraping and spidering, why include instructions on how to use web-based APIs? Quite simply, they make scraping easier. Instead of having to worry about ever-changing HTML [Hack #32], you merely have to do some quick research to learn the provided interface. Likewise, using an API makes it easier to combine raw data from scraping with prepared data from sites like Technorati, All Consuming, Alexa, and Amazon.com. For an example, check out [Hack #59].

All Consuming (http://www.allconsuming.net) is a fairly small application, built on top of a mountain of information that has been made freely available through web services. Amazon.com’s Web Services API fuels the invaluable book information, Google’s API allows us to get related web sites for book titles, and Weblogs.com has an XML file that lets us know which web sites have been updated each hour. Combining these three services, we can create lists of books that are being talked about on the Web. It only makes sense for us to give back to this generous community by opening up SOAP and REST interfaces to All Consuming’s information, to be used for free and in any way that can be invented.

The SOAP Code

Here’s an example of how you can access All Consuming information on your own, using SOAP and Perl. Create a file called display_weekly_list_with_soap.cgi:

#!/usr/bin/perl  -w
# display_weekly_list_with_soap.cgi
use strict; 

use SOAP::Lite +autodispatch => 
    uri => 'http://www.allconsuming.net/AllConsumngAPI',
    proxy => 'http://www.allconsuming.net/soap.cgi';

# optional values for the API.
my ($hour,$day,$month,$year) = qw( 12 05 28 2003 );

my $AllConsumingObject = 
AllConsumingAPI->new(
                         $hour,  # optional
                         $day,   # optional
                         $month, # optional
                         $year   # optional
                       );

This creates a new object, $AllConsumingObject, which you can then use to retrieve a wide variety of data, as explained in the following sections.

Most-mentioned lists

Every hour, All Consuming crawls recently updated weblogs to see if any new books have been mentioned for the first time on any given site. It combines this information with Amazon.com’s Web Services API, aggregates frequently mentioned books into hourly and weekly lists, and archives them all the way back to August 2002. GetHourlyList sends you the most recent hour’s list information, GetWeeklyList sends you the most recent aggregation of all activity during the last week, and GetArchiveList returns you the hourly or weekly list that corresponds with the date that you specify when creating the object (the $hour, $day, $month, and $year variables). For example:

my $HourlyData = $AllConsumingObject->GetHourlyList;
my $WeeklyData = $AllConsumingObject->GetWeeklyList;
my $ArchivedData = $AllConsumingObject->GetArchiveList;

Personal book lists

People have created their own book lists directly through All Consuming, assigning them to categories like Currently Reading, Favorite Books, and Completed Books. Although some of these lists are available for use on other sites through methods like JavaScript includes, if someone wants to add a Favorite Books list to their site, they’ll have to use the SOAP or REST interfaces to do so:

my $CurrentlyReading = $AllConsumingObject->GetCurrentlyReadingList('insert 
                  [RETURN]
                  
                  name');
my $FavoriteBooks = $AllConsumingObject->GetFavoriteBooksList('insert 
                  [RETURN]
                  
                  name');
my $PurchasedBooks = $AllConsumingObject->GetPurchasedBooksList('insert 
                  [RETURN]
                  
                  name');
my $CompletedBooks = $AllConsumingObject->GetCompletedBooksList('insert 
                  [RETURN]
                  
                  name');

Book metadata and weblog mentions

Some users have added valuable metadata about books, such as first lines and number of pages. This is mostly for fun, and it allows me to have an hourly “first line trivia” question on my homepage, to see if you can guess the book that the first line comes from. In any case, if you want to retrieve book metadata for a given book, you can do so with the following method:

my $Metadata = $AllConsumingObject->GetMetadataForBook('insert ISBN');

The argument passed in is the ISBN (International Standard Book Number) for the book you’d like to retrieve metadata from. For a list of metadata that’s currently available for use, you can check out the metadata scorecard at All Consuming (http://www.allconsuming.net/scorecard.html).

Alternatively, if you’d like to receive a list of all of the weblogs that have mentioned a particular book, you can retrieve that information using the following method:

my $WeblogMentions = $AllConsumingObject->GetWeblogMentionsForBook('insert 
                  [RETURN]
                  
                  ISBN');

Friends and recommendations

All Consuming also has friend relationships—between people who have marked their favorite web sites so they can keep track of what they’re reading—as well as book recommendations based on the sum of all those friend relationships. You can get a list of web sites that you or someone else has marked as a friend, by including your weblog URL:

my $Friends = $AllConsumingObject->GetFriends('insert URL');

And to get a list of books that all of your friends are currently reading, sorted by those that are mentioned recently and the most times, you can do this:

my $Recommendations = $AllConsumingObject->GetRecommendations('insert URL');

To iterate through the results these methods return, do something like this:

# The array here may differ depending
# on the type of data being returned.
if (ref($WeeklyData->{'asins'}) eq 'ARRAY') {
    foreach my $item (@{$WeeklyData->{'asins'}}) {
        print "TITLE: $item->{'title'}\n",
        "AUTHOR: $item->{'author'}\n\n";
    }
}

Of course, in either of these examples, you can change the URL passed to any other URL. For a full list of methods you can invoke on this object, visit the instructions (http://allconsuming.net/news/000012.html) and code samples (http://allconsuming.net/soap-code-example.txt).

The REST Code

For those who think SOAP is a bit of overkill for simple applications like this, you can get the same information REST-style. Add this code to a file called display_weekly_list_with_rest.cgi:

#!/usr/bin/perl -w
# display_weekly_list_with_rest.cgi
use strict;
use LWP::Simple;
use XML::Simple;

# Any of the URLs mentioned below can replace this one.
my $URLToGet = 'http://allconsuming.net/rest.cgi?weekly=1';

# Download and parse.
my $XML = get($URLToGet);
my $ParsedXML = XMLin($XML, suppressempty => 1);

# The array here may differ depending
# on the type of data being returned.
if (ref($ParsedXML->{'asins'}) eq 'ARRAY') {
    foreach my $item (@{$ParsedXML->{'asins'}}) {
        print "TITLE: $item->{'title'}\n",
        "AUTHOR: $item->{'author'}\n\n";
    }
}

Following are the URL formats you can access via HTTP to return XML data directly.

Most-mentioned lists

Here’s the REST interface for requesting the hourly and weekly most-mentioned lists:

http://allconsuming.net/rest.cgi?hourly=1
http://allconsuming.net/rest.cgi?weekly=1

If you’d like to retrieve an archived list of most-mentioned books, you can specify the date, like so:

http://allconsuming.net/rest.cgi?archive=1&hour=12&day=12&month=5&year=2003

Personal book lists

To retrieve a list of any of your categorized books in XML format, add your username to any of the following URLs. Note the category name in the URL.

http://allconsuming.net/rest.cgi?currently_reading=1&username=insert name
http://allconsuming.net/rest.cgi?favorite_books=1&username=insert name
http://allconsuming.net/rest.cgi?purchased_books=1&username=insert name
http://allconsuming.net/rest.cgi?completed_books=1&username=insert name

Book metadata and weblog mentions

To get XML data about a specific item, include the ISBN in these URLs:

http://allconsuming.net/rest.cgi?metadata=1&isbn=insert ISBN
http://allconsuming.net/rest.cgi?weblog_mentions_for_book=1&isbn=insert ISBN

Friends and recommendations

To find XML data that includes friends or recommendations for a given weblog, you can include the weblog’s URL in the appropriate format:

http://allconsuming.net/rest.cgi?friends=1&url=insert URL
http://allconsuming.net/rest.cgi?recommendations=1&url=insert URL

Running the Hack

Running display_weekly_list_with_rest.cgi without modification shows:

% perl display_weekly_list_with_rest.cgi
TITLE: Peer-to-Peer : Harnessing the Power of Disruptive Technologies
AUTHOR: Andy Oram

TITLE: Quicksilver : Volume One of The Baroque Cycle
AUTHOR: Neal Stephenson

TITLE: A Pattern Language: Towns, Buildings, Construction
AUTHOR: Christopher Alexander, Sara Ishikawa, Murray Silverstein

TITLE: Designing With Web Standards
AUTHOR: Jeffrey Zeldman

TITLE: Slander: Liberal Lies About the American Right
AUTHOR: Ann H. Coulter

TITLE: Bias : A CBS Insider Exposes How the Media Distort the News
AUTHOR: Bernard Goldberg

TITLE: The Adventures of Charmin the Bear
AUTHOR: David McKee, Joanna Quinn

The XML Results

The returned output of both the SOAP and REST interfaces will be XML that looks something like this:

<opt>
  <header 
    lastBuildDate="Sat May 28 13:30:02 2003" 
    title="All Consuming" 
    language="en-us" 
    description="Most recent books being talked about by webloggers." 
    link="http://allconsuming.net/" 
    number_updated="172" 
  />
  <asins 
    asin="0465045669" 
    title="Metamagical Themas" 
    author="Douglas R. Hofstadter" 
    url="http://www.erikbenson.com/"
    image="http://images.amazon.com/images/P/0465045669.01.THUMBZZZ.jpg" 
    excerpt="Douglas Hoftstadter's lesser-known book, Metamagical Themas, 
has a great chapter or two on self-referential sentences like 'This sentence 
was in the past tense.'." 
    amazon_url="http://amazon.com/exec/obidos/ASIN/0465045669/"
    allconsuming_url="http://allconsuming.net/item.cgi?id=0465045669"
  />
</opt>

If multiple items are returned, there will be multiple <asins /> elements.

Hacking the Hack

Although All Consuming currently tracks only book trends, it also stores information about other types of items that are available at Amazon.com, such as CDs, DVDs, and electronics. You can’t find this information anywhere on All Consuming’s site, but if you use either of the APIs to retrieve weblog mentions for an ASIN (Amazon.com Standard Identification Number) that belongs to a product category other than books, it will still faithfully return any weblog data that it has for that item.

—Erik Benson

Hack #67. Tracking Packages with FedEx

When you absolutely, positively have to know where your package is right now!

So many times when using the Web, all you need is one bit of information, especially when you’re running a specific search. You want to know when your flight is coming in. You want to know how much a book costs. You want to know when your FedEx package is going to arrive.

Spidering is ideal for grabbing this one bit of information without expending a lot of effort. This hack helps you track FedEx packages.

The Code

Save the following code as fedex_tracker.pl:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TableExtract;

# we use the Canada/English site, because its table
# of package tracking is simpler to parse than the "us".
my $url_base = "http://www.fedex.com/cgi-bin/tracking?action=track".
               "&cntry_code=ca_english&tracknumbers="; # woo hah.

# user wants to add a new tracking number.
my @tracknums; push(@tracknums, shift) if @ARGV;

# user already has some data on disk, so suck it in.
# we could technically add a grep on the readdir, but
# we have to postprocess @files anyway, so...
opendir(CWD, ".") or die $!; my @files = readdir(CWD); closedir(CWD);
foreach (@files) { /fedex_tracker_(\d+).dat/; push(@tracknums, $1) if $1; }
unless (@tracknums) { die "We have no packages to track!\n"; }
my %h; undef (@h{@tracknums}); @tracknums = keys %h; # quick unique.

# each tracking number, look it up.
foreach my $tracknum (@tracknums) {

    # suck down the data or end.
    my $data = get("$url_base$tracknum") or die $!;
    $data =~ s/&nbsp;/ /g; # sticky spaces.

    # and load our specific tracking table in.
    my $te = HTML::TableExtract->new(
           headers => ["Scan Activity","Date/Time"]);
    $te->parse($data); # alright, we've got everything loaded, hopefully.

    # now, get the new info.
    my $new_data_from_site;
    foreach my $ts ($te->table_states) {
       foreach my $row ($ts->rows) {
           $new_data_from_site .= " " . join(', ', @$row) . "\n";
       }
    }

    # if this is a broken tracking number,
    # move on and try the other ones we have.
    unless ($new_data_from_site) {
       print "No data found for package #$tracknum. Skipping.\n"; next; 
    }

    # if this package has never been tracked
    # before, then we'll create a file to
    # hold the data. this will be used for
    # comparisons on subsequent runs.
    unless (-e "fedex_tracker_$tracknum.dat") {
       open(FILE, ">fedex_tracker_$tracknum.dat") or die $!;
       print FILE $new_data_from_site; close (FILE);
       print "Adding the following data for #$tracknum:\n";
       print $new_data_from_site;
    }

    # if the datafile does exist, load it 
    # into a string, and do a simplisitic
    # comparison to see if they're equal.
    # if not, assume things have changed.
    if (-e "fedex_tracker_$tracknum.dat") {
        open(FILE, "<fedex_tracker_$tracknum.dat");
        $/ = undef; my $old_data_from_file = <FILE>; close(FILE);
        if ($old_data_from_file eq $new_data_from_site) {
            print "There have been no changes for package #$tracknum.\n";
        } else {
            print "Package #$tracknum has advanced in its journey!\n";
            print $new_data_from_site; # update the user.
            open(FILE, ">fedex_tracker_$tracknum.dat");
            print FILE $new_data_from_site; close(FILE);
            # the file is updated for next compare.
        }
    }
}

Running the Hack

To use the script, pass a package number on the command line. If you’ve already entered one, the script will try to use those packages you’ve entered previously. When you run the script with a package number, the output will look like this:

% perl fedex_tracker.pl 047655634284503
 Adding the following data for #047655634284503:
  Departed FedEx sort facility/SACRAMENTO, CA, 08/06/2003 06:54
  Scanned at FedEx sort facility/SACRAMENTO, CA, 08/06/2003 00:14
  Scanned at FedEx origin location/SACRAMENTO, CA, 08/05/2003 23:57
  Customer-Loaded Trailer Picked Up/SACRAMENTO, CA, 08/05/2003 00:00
 There have been no changes for package #047655634284503.

Once you’ve run this search, the script will create a new file, fedex_tracker_ PACKAGENUM .dat, in the same directory. In the previous example, the new file is called fedex_tracker_047655634284503.dat. Each successive run, the script will search for and update this package’s information. How do you get it to stop searching for a particular package? Simply delete its .dat file.

Just because you have an existing package doesn’t mean you can’t continue to search for other packages. Say you run the previous search and then want to run another. This will work:

% perl fedex_tracker.pl 123456789
 No data found for package #123456789. Skipping.
 There have been no changes for package #047655634284503.

If no data is found for a package, a .dat file will not be created for it.

Hacking the Hack

If you want to write something similar that grabs information from Amazon.com’s order-tracking pages, there are a variety of things you could do.

First, you could run it from cron [Hack #90] so you’d have a daily update of where your package is. Alternatively, you could have another small script that periodically gathers up the names of the various .dat files you generate and sends them to you, so you could metatrack which packages you’ve tracked. Or, perhaps you want to export the package information into a comma-delimited file, in which case you’ll need a permanent record of shipping progress.

Hack #68. Checking Blogs for New Comments

Tend to respond directly to weblog posts with a comment or three? Ever wonder about the reactions to your comments? This hack automates the process of keeping up with the conversation you started.

Blogs are the savior of independent publishing, and the ability of most to allow commenting creates an intimate collaboration between performer and audience: read the blog’s entry and any existing comments, and then add your own thoughts and opinions. What’s most annoying, however, is needing to return on a regular basis to see if anyone has added additional comments, whether to the original posting or to your own follow up.

With the RSS syndication format, you can monitor new blog entries in a standard way with any number of popular aggregators. Unfortunately, unless the site in question has provided its comments in RSS format also, there’s not really a standard way for comments to be used, repurposed, or monitored.

However, the more you read blogs and the comments themselves, you’ll begin to see patterns emerge. Perhaps a comment always starts with “On DATE, PERSON said” or “posted by PERSON on DATE,” or even plain old "DATE, PERSON.” These comment signatures can be the beginning of an answer to your needs: a script that uses regular expressions to check for various types of signatures can adequately tell you when new comments have been posted.

The Code

Save this script as chkcomments.pl:

#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use LWP::Simple;
my %opts; GetOptions(\%opts, 'v|verbose');

# where we find URLs. we'll also use this
# file to remember the number of comments.
my $urls_file = "chkcomments.dat";

# what follows is a list of regular expressions and assignment
# code that will be executed in search of matches, per site.
my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/&middot; (.*?) &middot; .*?<a href="(.*?)">(.*?)<\/a>/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})&nbsp;(.*)/,
     assign => '($date,$name,$contact) = ($1,$2,"none")'
   },
);

# open our URL file, and suck it in.
open(URLS_FILE, "<$urls_file") or die $!;
my %urls; while (<URLS_FILE>) { chomp;
   my ($url, $count) = split(/\|%%\|/);
   $urls{$url} = $count || undef;
} close (URLS_FILE);

# foreach URL in our dat file:
foreach my $url (keys %urls) {

   next unless $url; # no URL, no cookie.
   my $old_count = $urls{$url} || undef;

   # print a little happy message.
   print "\nSearching $url...\n"; 

   # suck down the data.
   my $data = get($url) or next;

   # now, begin looping through our matchers.
   # for each regular expression and assignment
   # code, we execute it in this namespace in an
   # attempt to find matches in our loaded data.
   my $new_count; foreach my $code (@signatures) {

      # with our regular expression loaded,
      # let's see if we get any matches.
      while ($data =~ /$code->{regex}/gism) {

         # since our $code contains two Perl statements
         # (one being the regex, above, and the other
         # being the assignment code), we have to eval
         # it once more so the assignments kick in.
         my ($date, $contact, $name); eval $code->{assign};
         next unless ($date && $contact && $name);
         print "  - $date: $name ($contact)\n" if $opts{v};
         $new_count++; # increase the count.
      }

      # if we've gotten a comment count, then assume
      # our regex worked properly, spit out a message,
      # and assign our comment count for later storage.
      if ($new_count) {
         print " * We saw a total of $new_count comments".
               " (old count: ". ($old_count || "unchecked") . ").\n";
         if ($new_count > ($old_count || 0)) { # joy of joys!
             print " * Woo! There are new comments to read!\n"
         } $urls{$url} = $new_count; last; # end the loop.
      }
   }
} print "\n";

# now that our comment counts are updated,
# write it back out to our datafile.
open(URLS_FILE, ">$urls_file") or die $!;
foreach my $url (keys %urls) {
   print URLS_FILE "$url|%%|$urls{$url}\n";
} close (URLS_FILE);

Running the Hack

This script depends on being fed a file that lists URLs you’d like to monitor. These should be the URLs of the page that holds comments on the blog entry, often the same as the blog entry’s permanent link (or permalink). If you’re reading http://www.gamegrene.com, for instance, and you’ve just commented on the “The Lazy GM” article, you’ll add the following URL into a file named chkcomments.dat:

http://www.gamegrene.com/game_material/the_lazy_gm.shtml

A typical first run considers all comments new—new to you and your script:

% perl chkcomments.pl
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
 * We saw a total of 5 comments (old count: unchecked).
 * Woo! There are new comments to read!

You can also show the name, date, and contact information of each individual comment, by passing the --verbose command-line option. This example shows the script checking for new comments on the same URL:

% perl chkcomments.pl --verbose
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
  - July 23, 2003 01:53 AM: VMB (mailto:vesab@jippii.fi)
  - July 23, 2003 10:55 AM: Iridilate (mailto:)
  - July 29, 2003 02:46 PM: The Bebop Cow (mailto:blackcypress@yahoo.com)
... etc ...
 * We saw a total of 5 comments (old count: 5).

Since no comments were added between our first and second runs, there’s nothing new.

But how did the script know how many comments there were in the first place? The answer, as I alluded to previously, is comment signatures. In HTML, every comment on Gamegrene looks like this:

On July 23, 2003 01:53 AM,<a href="mailto:vesab@jippii.fi">VMB</a> said:

In other words, it has a signature of On DATE, <a href="CONTACT">PERSON</a> said or, if you were expressing it as a regular expression, On (.*?), <a href="(.*?)">(.*?)<\/a> said. Keen observers of the script will have noticed this regular expression appear near the top of the code:

my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },

What about the assign line, though? Simply enough, it takes our captured bits of data from the regular expression (the bits that look like (.*?)) and assigns them to more easily understandable variables, like $date, $contact, and $name. The number of times our regular expression matches is the number of comments we’ve seen on the page. Likewise, the information stored in our variables is the information printed out when we ask for --verbose output.

If you refer back to the code, you’ll notice two other signatures that match the comment styles on Dive Into Mark (http://www.diveintomark.org) and the O’Reilly Network (http://www.oreillynet.com) (and possibly other sites that we don’t yet know about). Since their signatures already exist, we can add the following URLs to our chkcomments.dat file:

http://diveintomark.org/archives/2003/07/28/atom_news
http://www.oreillynet.com/pub/wlg/3593
http://macdevcenter.com/pub/a/mac/2003/08/01/cocoa_series.html?page=2

and run our script on a regular basis to check for new comments:

% perl chkcomments.pl 
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
 * We saw a total of 5 comments (old count: 5).

Searching http://diveintomark.org/archives/2003/07/28/atom_news...
 * We saw a total of 11 comments (old count: unchecked).
 * Woo! There are new comments to read!

Searching http://www.oreillynet.com/pub/wlg/3593 ...
 * We saw a total of 1 comments (old count: unchecked).
 * Woo! There are new comments to read!

Searching http://macdevcenter.com/pub/a/mac/2003/08/01/cocoa_seri...
 * We saw a total of 9 comments (old count: unchecked).
 * Woo! There are new comments to read!

Hacking the Hack

The obvious way of improving the script is to add new comment signatures that match up with the sites you’re reading. Say we want to monitor new comments on Harvard Weblogs (http://blogs.law.harvard.edu/). The first thing we need is a post with comments, so that we can determine the comment signature. Once we find one, view the HTML source to see something like this:

<div class="date"><a href="http://scripting.com">
Dave Winer</a> &#0149; 7/18/03; 7:58:33 AM</div>

The comment signature for Harvard Weblogs is equivalent to <a href="CONTACT">PERSON</a> DATE, which can be stated in regular expression form as date"><a href="(.*?)">(.*?)<\/a> &#0149; (.*?)<\/div>. Once we have the signature in regular expression form, we just need to assign our matches to the variable names and add the signature to our listings at the top:

my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/&middot; (.*?) &middot; .*?<a href="(.*?)">(.*?)<\/a>/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})&nbsp;(.*)/,
     assign => '($date,$name,$contact) = ($1,$2,"none")'
   },
   { regex  => qr/date"><a href="(.*)">(.*)<\/a> &#0149; (.*)<\/div>/,
     assign => '($contact,$name,$date) = ($1,$2,$3)'
   },
);

Now, just add the URL we want to monitor to our chkcomments.dat file, and run the script as usual. Here’s an output of our first check, with verbosity turned on:

Searching http://blogs.law.harvard.edu/comments?u=homeManilaWebs...
  - 7/18/03; 1:23:14 AM: James Farmer (http://radio.weblogs.com/0120501/)
  - 7/18/03; 4:06:10 AM: Phil Wolff (http://dijest.com/aka)
  - 7/18/03; 7:58:33 AM: Dave Winer (http://scripting.com)
  - 7/18/03; 6:23:14 PM: Phil Wolff (http://dijest.com/aka)
 * We saw a total of 4 comments (old count: unchecked).
 * Woo! There are new comments to read!

Hack #69. Aggregating RSS and Posting Changes

With the proliferation of individual and group weblogs, it’s typical for one person to post in multiple places. Thanks to RSS syndication, you can easily aggregate all your disparate posts into one weblog.

You might have heard of RSS. It’s an XML format that’s commonly used to syndicate headlines and content between sites. It’s also used in specialty software programs called headline aggregators or readers. Many popular weblog software packages, including Movable Type (http://www.movabletype.org) and Blogger (http://www.blogger.com), offer RSS feeds. So too do some of the content management systems—Slashcode (http://slashcode.com), PHPNuke (http://phpnuke.org), Zope (http://www.zope.org), and the like—that run some of the more popular tech news sites.

If you produce content for various people, you might find your writing and commentary scattered all over the place. Or, say you have a group of friends and all of you want to aggregate your postings into a single place without abandoning your individual efforts. This hack is a personal spider just for you; it aggregates entries from multiple RSS feeds and posts those new entries to a Movable Type blog.

The Code

You’ll need LWP::Simple, Net::Blogger, and XML::RSS to use this. Save the following code to a file named myrssmerger.pl:

#!/usr/bin/perl -w
#
# MyRSSMerger - read multiple RSS feeds, post new entries to Movable Type.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";
use Getopt::Long;
my %opts;

# make sure we have the modules we need, else die peacefully.
eval("use LWP::Simple;");  die "[err] LWP::Simple not installed.\n" if $@;
eval("use Net::Blogger;"); die "[err] Net::Blogger not installed.\n" if $@;
eval("use XML::RSS;");    die "[err] XML::RSS not installed.\n" if $@;

# define our command line flags (long and short versions).
GetOptions(\%opts, 'server|s=s',      # the POP3 server to use.
                   'username|u=s',    # the POP3 username to use.
                   'password|p=s',    # the POP3 password to use.
                   'blogid|b=i',      # unique ID of your blog.
                   'catid|c=i',       # unique ID for posting category.
                   'showcategories',  # list categories for blog.
                   'filter|f=s',      # per item filter for posting?
);

# at the very least, we need our login information.
die "[err] XML-RPC URL missing, use --server or -s.\n" unless $opts{server};
die "[err] Username missing, use --username or -u.\n"  
    unless $opts{username};
die "[err] Password missing, use --password or -p.\n"  
    unless $opts{password};
die "[err] BlogID missing, use --blogid or -b.\n"      unless $opts{blogid};

# every request past this point requires
# a connection, so we'll go and do so.
print "-" x 76, "\n"; # visual separator.
my $mt = Net::Blogger->new(engine=>"movabletype");
$mt->Proxy($opts{server});       # the servername.
$mt->Username($opts{username});  # the username.
$mt->Password($opts{password});  # the... ok. self-
$mt->BlogId($opts{blogid});      # explanatory!

# show existing categories.
if ($opts{showcategories}) {

    # get the list of categories from the server.
    my $cats = $mt->mt()->getCategoryList(  )
      or die "[err] ", $mt->LastError(  ), "\n";

    # and print 'em.
    if (scalar(@$cats) > 0) {
        print "The following blog categories are available:\n\n";
        foreach (sort { $a->{categoryId} <=> $b->{categoryId} } @$cats) {
            print " $_->{categoryId}: $_->{categoryName}\n";
        }
    } else { print "There are no selectable categories available.\n"; }

    # done with this request, so exit.
    print "\nCategory ID's can be used for --catid or -c.\n";
    print "-" x 76, "\n"; exit; # call me again, again!

}

# now, check for passed URLs for new-item-examination.
die "[err] No RSS URLs were passed for processing.\n" unless @ARGV;

# and store today's date for comparison.
# who needs the stinkin' Date:: modules?!
my ($day, $month, $year) = ((localtime)[3, 4, 5]);
$year+=1900; $month = sprintf("%02.0d", ++$month);
$day = sprintf("%02.0d", $day);  # zero-padding.
my $today = "$year-$month-$day"; # final version.

# loop through each RSS URL.
foreach my $rss_url (@ARGV) {

    # download whatever we've got coming.
    print "Downloading RSS feed at ", substr($rss_url, 0, 40), "...\n";
    my $data = get($rss_url) or print " [err] Data not downloaded!\n";
    next unless $data; # move onto the next URL in our list, if any.

    # parse it and then
    # count the number of items.
    # move on if nothing parsed.
    my $rss = new XML::RSS; $rss->parse($data);
    my $item_count = scalar(@{$rss->{items}});
    unless ($item_count) { print " [err] No parsable items.\n"; next; }

    # sandwich our post between a preface/anteface.
    my $clink = $rss->{channel}->{"link"}; # shorter variable.
    my $ctitle = $rss->{channel}->{title}; # shorter variable.
    my $preface = "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";
    my $anteface = "</blockquote>\n\n"; # new items as quotes.

    # and look for items dated today.
    foreach my $item (@{$rss->{items}}) {

        # no description or date for our item? move on.
        unless ($item->{description} or $item->{dc}->{date}) {
          print " Skipping (no description/date): '$item->{title}'.\n";
          next;
        }

        # if we have a date, is it today's?
        if ($item->{dc}->{date} =~ /^$today/) {

            # shorter variable. we're lazy.
            my $creator = $item->{dc}->{creator};

            # if there's a filter, check for goodness.
            if ($opts{filter} && $item->{description} !~ /$opts{filter}/i) {
                print " Skipping (failed filter): '$item->{title}'.\n"; 
                next;
            }

            # we found an item to post, so make a
            # final description from various parts.
            my $description = "$preface$item->{description} ";
            $description   .= "($creator) " if $creator;
            $description   .= "<a href=\"$item->{link}\">Read " .
                              "more from this post.</a>$anteface";

            # now, post to the passed blog info.
            print " Publishing item: '$item->{title}'.\n";
            my $id = $mt->metaWeblog(  )->newPost(
                              title       => $item->{title},
                              description => $description,
                              publish     => 1)
                     or die "[err] ", $mt->LastError(  ), "\n";

            # set the category?
            if ($opts{catid}) {
                $mt->mt(  )->setPostCategories(
                              postid     => $id,
                              categories => [ {categoryId => $opts{catid}}])
                or die " [err] ", $mt->LastError(  ), "\n";

                # "edit" the post with no changes so
                # that our category change activates.
                $mt->metaWeblog(  )->editPost(
                              title       => $item->{title},
                              description => $description,
                              postid      => $id,
                              publish     => 1)
                     or die " [err] ", $mt->LastError(  ), "\n";
            }
        } else { 
           print " Skipping (failed date check): '$item->{title}'.\n"; 
        }
    }
    print "-" x 76, "\n"; # visual separator.
}

exit;

Running the Hack

To run the code, you’ll need a Movable Type weblog. At the very least, you need the username, password, XML-RPC URL for Movable Type, and the blog ID (normally 1 if you have only one). Here’s an example of connecting to Kevin’s Movable Type installation to show a list of categories to post to (the --showcategories switch is, strangely enough, showing the categories):

% perl myrssmerger.pl -s http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi -u 
morbus -p HAAHAHAH -b 1 --showcategories

The output looks like this:

----------------------------------------------------------------------
 The following blog categories are available:

 1: Disobey Stuff
 2: The Idiot Box
 3: CHIApet
 4: Friends O' Disobey
 5: Stalkers O' Morbus
 6: Morbus Shoots, Jesus Saves
 7: El Casho Disappearo
 8: TechnOccult
 9: Potpourri
 10: Collected Nonsensicals

Category ID's can be used for --catid or -c.
----------------------------------------------------------------------

If you have no categories, you’ll be told as such. When you’re actually posting to the blog, you can choose to post into a category or not; if you want to post into Disobey Stuff, use either -c 1 or --catid 1 when you run the program. If you want no category, specify no category.

Let’s take a look at a few examples of how to use the script. Say Kevin wants to aggregate all the data from all the places he publishes information. Every night he’ll use cron [Hack #90] to run the script for various RSS feeds. Here’s an example:

% perl myrssmerger.pl --server 
               [RETURN]
               
               http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi 
               [RETURN]
               
               --username morbus --password HAAHAHAH --blogid 1 --catid 1 
               http://gamegrene.com/index.xml

In this case, he’s saying, “Every night, check the Gamegrene RSS files for entries posted today. If you see any, post them to Disobey Stuff” (which is the first category, referenced with the --catid 1 switch). He can then run the script again, only for a different RSS feed with a different category switch, and so on. Let’s take a look at the output of the Gamegrene example:

----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
 Publishing item: 'RPG, For Me'.
 Skipping (failed date check): 'Just Say No To Powergamers'.
 Skipping (failed date check): 'Every Story Needs A Soundtrack'.
 Skipping (failed date check): 'The Demise of Local Game Shops'.
 Skipping (failed date check): 'Death Of A Gaming System'.
 Skipping (failed date check): 'What Do You Do With Six Million Elves?'.
----------------------------------------------------------------------

As you can see, the script checks the dates in the RSS feed to make sure they’re new before the items are added to the Movable Type weblog. Dates are determined from the <dc:date> entry in the remote RSS URL; if the feed doesn’t have them, the script won’t function correctly.

What happens when you want to check many RSS feeds but you want to add them all to the same category? You can do that by running the script one time. Say you want to check three different RSS feeds, not necessarily all yours. Here’s an example of Kevin checking three feeds (including Tara’s) and adding new additions to the category:

% perl myrssmerger.pl --server 
               [RETURN]
               
               http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi 
               [RETURN]
               
               --username morbus --password HAAHAHAH --blogid 1 --catid 4 
               [RETURN]
               
               http://gamegrene.com/index.xml http://researchbuzz.com/researchbuzz.rss 
               http://camworld.com/index.rdf

The shortened output looks like this:

----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
 Skipping (failed date check): 'RPG, For Me'.
 Skipping (failed date check): 'Just Say No To Powergamers'.
 Skipping (failed date check): 'Every Story Needs A Soundtrack'.
----------------------------------------------------------------------
Downloading RSS feed at http://camworld.com/index.rdf...
 Publishing item: 'Trinity's Hack from Matrix Reloaded'.
 Skipping (failed date check): 'Siberian Desktop'.
 Skipping (failed date check): 'The Sweet Hereafter'.
----------------------------------------------------------------------
Downloading RSS feed at http://researchbuzz.com/researchbuzz.rss...
 Skipping (no description/date): 'Northern Light Coming Back?'.
 Skipping (no description/date): 'This Week in LLRX'.
----------------------------------------------------------------------

Note that Tara’s feed fails usage by this script; that’s because she’s generating her RSS by hand and her feed doesn’t have dates. Most program-generated feeds, like those of Movable Type, have dates and descriptions and will be just fine.

As you can see, we can choose a variety of feeds to use and we can post them to any of our Movable Type categories. Is there anything else this script can do? Well, actually, yes; it can filter incoming entries that match a specified keyword. To do that, use the --filter switch. As an example, this script posts only those entries whose descriptions include the string "perl“:

% perl myrssmerger.pl --server 
               [RETURN]
               
               http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi 
               [RETURN]
               
               --username morbus --password HAAHAHAH --blogid 1 --catid 4 --filter "perl" 
               [RETURN]
               
               http://camworld.com/index.rdf

Hacking the Hack

Actually, this is both a “hacking the hack” and “some things to consider” section. Right now, the biggest downside is that this hack works only on Movable Type. You could dive into Net::Blogger a bit and make it usable by Blogger (http://www.blogger.com), Radio Userland (http://radio.userland.com/), or any one of the other weblogging platforms.

This script is designed to run once a day. To that end, the script does a full download of the RSS feed every time. As it stands, you should probably run it just once a day, for two reasons:

  • If you run the script more than once a day, you might have bandwidth issues running the script and downloading full RSS files too often.

  • The more often you run the script, the more often you’re going to post repetitive items.

All right, let’s talk about a couple of actual hacks. First is error checking; as is, the script doesn’t check the URLs to make sure they start with http://. That’s easily solved; just add the code in bold:

# loop through each RSS URL.
foreach my $rss_url (@ARGV) {

    # not an HTTP URL.
    next unless $rss_url =~ !^http://!;

    # download whatever we've got coming.

Next, the preface and the anteface (i.e., the text that surrounds the posted entry) are hardcoded into the script, but we can change that via a switch on the command line. First make the preface and anteface command-line options:

GetOptions(\%opts, 'server|s=s',      # the POP3 server to use.
                   'username|u=s',    # the POP3 username to use.
                   'password|p=s',    # the POP3 password to use.
                   'blogid|b=i',      # unique ID of your blog.
                   'catid|c=i',       # unique ID for posting category.
                   'showcategories',  # list categories for blog.
                   'filter|f=s',      # per item filter for posting?
                   'preface|r=s',    # the preface text before a posted item
                   'anteface|a=s"    # the text included after a posted item
               );

You’ll then need to make a change to the preface line:

my $preface = $opts{preface} || "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";

and a similar change to the anteface line:

my $anteface = $opts{anteface} 
    || "</blockquote>\n\n"; # new items as quotes.

See Also

Hack #70. Using the Link Cosmos of Technorati

Similar to other indexing sites like Blogdex, the Link Cosmos at Technorati keeps track of an immense number of blogs, correlating popular links and topics for all to see. With the recently released API, developers can now integrate the results into their own scripts.

Technorati (http://www.technorati.com) walks, crawls, investigates, and generally mingles around weblog-style web sites and indexes them, gathering loads of information. I mean loads: it keeps track of articles on the web site, what links to it, what it links to, how popular it is, how popular the web sites that link to it are, how popular the people that read it are, and who is most likely to succeed. Well, it does most of those things.

Need Some REST?

The current version of the Technorati interface uses a REST (Representational State Transfer) interface. REST interfaces allow for transfer of data via the GET or POST method of a URL. We will initially use the interface to access the Technorati Cosmos data. The Cosmos is the set of data that keeps track of who links to whom and essentially contains who thinks who is interesting. Technorati allows queries of the following information via the REST interface:

Link Cosmos

Who you link to, who links to who, and when.

Blog info

General information about a specified weblog, including the weblog name, URL, RSS URL (if one exists), how many places it links to, how many places link to it, and when it was last updated. This is the same information that is returned for each weblog in the Cosmos lookup.

Outbound blogs

A list of web sites that the specified URL links to.

We’re going to focus on the Link Cosmos information, which in my bloated opinion is the most important. The following small piece of code uses the Technorati interface to grab the current weblog listing and print the resulting XML data that is returned from the Technorati interface. You’ll need to become a member of the site to receive your developer’s API key:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;

my $key       = "your developer key";
my $searchURL = "http://www.perceive.net/";
my $restAPI   = "http://api.technorati.com/cosmos?key=$key&url=".
                "$searchURL&type=weblog&format=xml";
my $xml = get($restAPI);
print "$xml\n";

Dave Sifry, the developer of Technorati, has also made a small distinction between general web sites and weblogs. Notice type=weblog in the URL of the previous code. You can change this to type=link, and you’ll get the last 20 web sites that link to your site, rather than just the last 20 blogs. This is a small distinction, but one that could be useful.

The returned result is a chunk of XML, which resembles this:

<item>
  <weblog>
    <name>phil ringnalda dot com</name>
    <url>http://philringnalda.com</url>
    <rssurl>http://www.philringnalda.com/index.xml</rssurl>
    <inboundblogs>339</inboundblogs>
    <inboundlinks>471</inboundlinks>
    <lastupdate>2003-07-11 21:09:28 GMT</lastupdate>
  </weblog>
</item>

Many REST interfaces use XML as the format for returning data to the requestor. This allows the data to be parsed easily and used in various ways, such as creating HTML for your web site:

use XML::Simple;
my $parsed_data = XMLin($xml);
my $items = $parsed_data->{document}->{item};

print qq{<ol>\n};
for my $item (@$items) {
    my ($weblog, $url) = ($item->{weblog}->{name}, $item->{weblog}->{url});
    print qq{<li><a href="$url">$name</a></li>};    
}
print qq{</ol>};

First, we load the XML::Simple module, which will allow us to load the data into a hash. The XMLin function does this for us and returns a hash of hashes and arrays. After XMLin has loaded the data, we get an array of weblog items and iterate through it, printing some HTML with links to the web sites. We could just as easily have printed it as a comma-delimited file or anything else we could cook up in our silly little heads.

The most interesting part of all of this is the transfer and use of the information; Technorati allows us to see who has created links to our web site and use that data for free. Dave obviously learned how to share in kindergarten.

A Skeleton Key for Words

In addition to the lovely Cosmos API, Technorati provides us with an interface to query for weblog posts that contain a specified keyword. For instance, say you really like Perl; you can query the API periodically to get all the recent posts that contain “Perl.” I can imagine some handy uses for that: if you have keywords attached to posts in your weblog, you could have a Related Posts link that queries Technorati for other posts containing those keywords and shows a list of articles similar to yours.

The API to retrieve this information is also a REST interface, following the lead made by the Cosmos API. We can alter the code for the Cosmos API to provide access to this data:

#!/usr/bin/perl-w
use strict;
use LWP::Simple;

my $key        = "your developer key";
my $searchTerm = "Perl";
my $restAPI    = "http://api.technorati.com/search?key=$key".
                 "&query=$searchTerm&format=xml";
my $xml = get($restAPI);
print "$xml\n";

Searching using the Keyword API returns more information in the XML stream, which gives some context to why it returned a match for a given item:

<context>
   <excerpt>
    Ben Trott has uploaded version 0.02 of XML::FOAF to CPAN.
    This is a<b>Perl</b> module designed to make it...
   </excerpt>
   <title>New version of XML::FOAF in CPAN</title>
   <link>http://rdfweb.org/mt/foaflog/archives/000033.html</link>
</context>

The returned data consists of an excerpt of words that appear near the keyword that was searched for (the keyword is also tagged as bold in the HTML—<b>Perl</b> in this example), the title of the article it was found in, and a URL to the item. The result also contains the same information about the weblog it was found in, such as inbound and outbound links.

We can slightly modify the previous code from the Cosmos API to display these related articles in a nice, concise format:

use XML::Simple;
my $parsed_data = XMLin($xml);
my $items = $parsed_data->{document}->{item};

print qq{<dl>\n};
for my $item (@$items) {
    my ($weblog, $context, $title, $link) =
      ($item->{weblog}->{name}, $item->{context}->{excerpt},
      $item->{context}->{title}, $item->{context}->{link});
    print qq{<dt><a href="$link">$weblog : $title</a></dt>};
    print qq{<dd>$context</dd>};
}
print qq{</dl>};

The Technorati API is a useful method for retrieving information about weblogs, and it can help in the aggregation of useful data. With the attention that is paid to Technorati, I’m sure that these interfaces will become even more robust and useful as the development progresses. With the information in this hack, you are capable of using and expanding on these interfaces, creating uses of the data that are even more interesting. Further information is available at the Technorati Developer Wiki (http://developers.technorati.com/wiki/) and mailing list (http://developers.technorati.com/mailman/listinfo/api-discuss).

—Eric Vitiello

Hack #71. Finding Related RSS Feeds

If you’re a regular reader of weblogs, you know that most syndicate their content in a format called RSS. By querying aggregated RSS databases, you can find related sites you may be interested in reading.

One of the fastest growing applications on the Web is the use of RSS feeds. Although there’s some contention regarding what RSS stands for—one definition of the acronym calls it “Really Simple Syndication” and another calls it “Rich Site Summary”—RSS feeds are XML documents that provide a feed of headlines from a web site (commonly a weblog or news site) that can be processed easily by a piece of software called a news aggregator . News aggregators allow you to subscribe to content from a multitude of web sites, allowing the program to go out and check for new content, rather than requiring you to go out and look for it.

RSS feeds are like potato chips, though. Once you subscribe to one, you find yourself grabbing one after another. It would be nice if you could supply a list of feeds you already read to a robot and have it go out and find related feeds in which you might also be interested.

Filling Up the Toolbox

We’re going to need a number of tools to get this script off the ground. Also, we’ll be calling on a couple of web services, namely those at Syndic8 (http://www.syndic8.com) and Technorati (http://www.technorati.com).

Syndic8 is a catalog of feeds maintained by volunteers, and it contains quite a bit of information on each feed. It also catalogs feeds for sites created by people other than the site owners, so even if a particular site might not have a feed, Syndic8 might be able to find one anyway. Also, Syndic8 employs several categorization schemes; so, given one feed, we might be able to find others in its category. Since Syndic8 offers an XML-RPC web service, we can call upon this directory for help.

Technorati is a search engine and a spider of RSS feeds and weblogs. Among other things, it indexes links between weblogs and feeds, and it maps the relationships between sites. So, while we’re looking for feeds, Technorati can tell us which sites link to each other. Since it supports a simple URL-based API that produces XML, we can integrate this into our script fairly easily.

Let’s gather some tools and start the script:

#!/usr/bin/perl -w
use strict;
use POSIX;
use Memoize;
use LWP::Simple;
use XMLRPC::Lite;
use XML::RSS;
use HTML::RSSAutodiscovery;

use constant SYNDIC8_ID => 'syndic8_id';
use constant FEED_URL   => 'feed_url';
use constant SITE_URL   => 'site_url';

This script starts off with some standard Perl safety features. The Memoize module is a useful tool we can use to cache the results of functions so that we aren’t constantly rerequesting information from web services. LWP::Simple allows us to download content from the Web; XMLRPC::Lite allows us to call on XML-RPC web services; XML::RSS allows us to parse and extract information from RSS feeds themselves; and HTML::RSSAutodiscovery gives us a few tricks to locate a feed for a site when we don’t know its location.

The rest of this preamble consists of a few constants we’ll use later. Now, let’s do some configuration:

our $technorati_key = "your Technorati key";
our $ta_url         = 'http://api.technorati.com';
our $ta_cosmos_url  = "$ta_url/cosmos?key=$technorati_key&url=";

our $syndic8_url = 'http://www.syndic8.com/xmlrpc.php';
our $syndic8_max_results = 10;

my @feeds =
  qw(
   http://www.macslash.com/macslash.rdf
   http://www.wired.com/news_drop/netcenter/netcenter.rdf
   http://www.cert.org/channels/certcc.rdf
  );

Notice that, like many web services, the Technorati API requires you to sign up for an account and be assigned a key string in order to use it (http://www.technorati.com//members/apikey.html). You might also want to check out the informal documentation for this service (http://www.sifry.com/alerts/archives/000288.html). After we set our API key, we construct the URL we’ll be using to call upon the service.

Next, we set up the URL for the Syndic8 XML-RPC service, as well as a limit we’ll use later for restricting the number of feeds we want the robot to look for at once.

Finally, we set up a list of favorite RSS feeds to use in digging for more feeds. With configuration out of the way, we have another trick to use:

map { memoize($_) }
  qw(
     get_ta_cosmos
     get_feed_info
     get_info_from_technorati
     get_info_from_rss
    );

This little map statement sets up the Memoize module for us so that the mentioned function names will have their results cached. This means that, if any of the four functions in the statement are called with the same parameters throughout the program, the results will not be recomputed but will be pulled from a cache in memory. This should save a little time and use of web services as we work.

Next, here’s the main driving code of the script:

my $feed_records = [];
for my $feed (@feeds) {
  my %feed_record = (url=>$feed);
  $feed_record{info}    = get_feed_info(FEED_URL, $feed);
  $feed_record{similar} = collect_similar_feeds($feed_record{info});
  $feed_record{related} = collect_related_feeds($feed_record{info});
  push @$feed_records, \%feed_record;
}

print html_wrapper(join("<hr />\n",
                   map { format_feed_record($_) }
                   @$feed_records));

This loop runs through each of our favorite RSS feeds and gathers records for each one. Each record is a hash, whose primary keys are info, similar, and related. info will contain basic information about the feed itself; similar will contain records about feeds in the same category as this feed; and related will contain records about feeds that have linked to items from the current feed.

Now, let’s implement the functions that this code needs.

Getting the Dirt on Feeds

The first thing we want to do is build a way to gather information about RSS feeds, using our chosen web services and the feeds themselves:

sub get_feed_info {
  my ($type, $id) = @_;
  return {} if !$id;

  my ($rss, $s_info, $t_info, $feed_url, $site_url);

  if ($type eq SYNDIC8_ID) {
    $s_info = get_info_from_syndic8($id) || {};
    $feed_url = $s_info->{dataurl};
  } elsif ($type eq FEED_URL) {
    $feed_url = $id;
  } elsif ($type eq SITE_URL) {
    my $rss_finder = new HTML::RSSAutodiscovery(  );
    eval {
      ($feed_url) = map { $_->{href} } @{$rss_finder->locate($site_url)};
    };
  }

  $rss = get_info_from_rss($feed_url) || {};
  $s_info ||= get_info_from_syndic8($feed_url) || {};
  $site_url = $rss->{channel}{link} || $s_info->{dataurl};

  $t_info = get_info_from_technorati($site_url);

  return {url=>$feed_url, rss=>$rss, syndic8=>$s_info, technorati=>$t_info};
}

This function gathers basic information on a feed. It accepts several different forms of identification for a feed: the Syndic8 feed internal ID number, the URL of the RSS feed itself, and the URL of a site that might have a feed. The first parameter indicates which kind of identification the function should expect (using the constants we defined at the beginning of the script), and the second is the identification itself.

So, we must first figure out a URL to the feed from the identification given. With a Syndic8 feed ID, the function tries to grab the feed’s record via the Syndic8 web service and then get the feed URL from that record. If a feed URL is given, great; use it. Otherwise, if a site URL is given, we use the HTML::RSSAutodiscovery module to look for a feed for this site.

Once we have the feed URL, we get and parse the feed, grab information from Syndic8 if we haven’t already, and then get feed information from Technorati. All of this information is then collected into a hash and returned. You might want to check out the documentation for the Syndic8 and Technorati APIs to learn what information each service provides on a feed.

Moving on, let’s see what it takes to get information from Syndic8:

sub get_info_from_syndic8 {
  my $feed_url = shift;
  return {} if !$feed_url;

  my $result = {};
  eval {
    $result = XMLRPC::Lite->proxy($syndic8_url)
      ->call('syndic8.GetFeedInfo', $feed_url)->result(  ) || {};
  };
  return $result;
}

Here, we expect a feed URL and return empty-handed if one isn’t given. If a feed URL is given, we simply call the Syndic8 web service method syndic8.GetFeedInfo with the URL to our feed and catch the results. One thing to note is that we wrap this call in an eval statement, which prevents any ostensibly fatal errors in this call or XML parsing from exiting the script. In the case of such an error, we simply return an empty record.

Grabbing information from Technorati is a little more complex, if only because we’ll be parsing the XML resulting from calls without the help of a convenience package such as XMLRPC::Lite. But let’s get on with that:

sub get_info_from_technorati {
  my $site_url = shift;
  return {} if !$site_url;

  my $xml = get_ta_cosmos($site_url);

  my $info = {};
  if ($xml =~ m{<result>(.*?)</result>}mgis) {
    my $xml2 = $1;
    $info = extract_ta_bloginfo($xml2);
  }
  return ($info->{lastupdate} =~ /1970/) ? {} : $info;
}

Here, we make a request to the web service’s cosmos method with the site URL parameter. Using a regular expression, we look for the contents of a results tag in the response to our query and call upon a convenience function to extract the XML data into a hash. We also check to make sure the date doesn’t contain 1970, a value that occurs when a record isn’t found.

The implementation of our first convenience function goes like so:

sub get_ta_cosmos {
  my $url = shift;
  return get($ta_cosmos_url.$url);
}

This is just a simple wrapper around LWP::Simple’s get function, done so that we can memoize it without interfering with other modules’ use of the same function. Next, here’s how to extract a hash from the XML data:

sub extract_ta_bloginfo {
  my $xml = shift;
  my %info = (  );

  if ($xml =~ m{<weblog>(.*?)</weblog>}mgis) {
    my ($content) = ($1||'');
    while ($content =~ m{<(.*?)>(.*?)</\1>}mgis) {
      my ($name, $val) = ($1||'', $2||'');
      $info{$name} = $val;
    }
  }

  return \%info;
}

With another couple of regular expressions, we look for the weblog tag in a given stream of XML and extract all of the tags it contains into a hash. Hash keys are tag names, and the values are the contents of those tags. The resulting hash contains basic information about a weblog cited in the Technorati results. We’ll also use this in another function in a little bit.

We can extract information from both services, but how about feeds themselves? We can extract feeds with a simple function:

sub get_info_from_rss {
  my $feed_url = shift;
  return {} if !$feed_url;

  my $rss = new XML::RSS(  );
  eval {
    $rss->parse(get($feed_url));
  };
  return $rss;
}

Again, we expect a feed URL and return empty handed if one is missing. If a feed URL is given, we download the contents of that URL and use the XML::RSS module to parse the data. Notice that we use another eval statement to wrap this processing so that parsing errors do not exit our script. If everything goes well, we return an instance of XML::RSS.

Our basic feed information-gathering machinery is in place now. The next thing to tackle is gathering feeds. Let’s start with employing the Technorati API to find feeds that have referred to a given feed:

sub collect_related_feeds {
  my $feed_info = shift;
  my $site_url = $feed_info->{rss}{channel}{link} || $feed_info->{url};
  my %feeds = (  );

We start off by expecting a feed information record, as produced earlier by our get_info function. From this record, we get the site URL for which the feed is a summary. We try two options. First, we check the RSS feed itself for the information. Then, we check the record as a backup and treat the RSS feed URL itself as the site URL so that we at least have something to go on.

With that, we call on the Technorati API to get a list of related feeds:

  my $xml = get_ta_cosmos($site_url);
  while ($xml =~ m{<item>(.*?)</item>}mgis) {
    my $xml2 = $1;
    my $ta_info = extract_ta_bloginfo($xml2);

    my $info = ($ta_info->{rssurl} ne '') ?
      get_feed_info(FEED_URL, $ta_info->{rssurl}) :
      get_feed_info(SITE_URL, $ta_info->{url});

With our previous call to the Technorati API, we were gathering information about a feed. This time, we’re using the same call to gather information about related feeds. Thanks to Memoize, we should be able to reuse the results of a given API call for the same site URL over and over again, though we actually call upon the API only once.

So, we use a regular expression to iterate through item tags in the resulting data and extract weblog information from each result. Then, we check to see if a URL to this weblog’s RSS feed was supplied. If so, we use it to get a feed record on this site; otherwise, we use the site URL and try to guess where the feed is.

After getting the record, we grab the rest of the information in the item tag:

    $info->{technorati} = $ta_info;

    while ($xml2 =~ m{<(.*?)>(.*?)</\1>}mgis) {
      my ($name, $val) = ($1||'', $2||'');
      next if $name eq 'weblog';
      $info->{technorati}{$name} = $val;
    }

Once more, we use a regular expression to convert from tag names and contents to a hash. The hash contains information about the weblog’s relationship to the feed we’re considering, among other things.

To finish up, let’s add this record to a hash (to prevent duplicate records) and return that hash when we’re all done:

      $feeds{$info->{url}} = $info;
  }

  return \%feeds;
}

The returned hash will contain feed URLs as keys and feed records as values. Each of these feeds should be somewhat related to the original feed, if only because they linked to its content at one point.

Now, let’s go on to use the Syndic8 API to find feeds in a category:

sub collect_similar_feeds {
  my $feed_info = shift;
  my %feeds = (  );

  my $categories = $feed_info->{syndic8}->{Categories} || {};
  for my $cat_scheme (keys %{$categories}) {
    my $cat_name = $categories->{$cat_scheme};

The first thing we do is expect a feed information record and try to grab a list of categories from it. This will be a hash whose keys are codes that identify categorization schemes and whose values identify category titles. We’ll loop through each of these pairs and gather feeds in each category:

    my $feeds = XMLRPC::Lite->proxy($syndic8_url)
      ->call('syndic8.GetFeedsInCategory', $cat_scheme, $cat_name)
        ->result(  ) || [];

    # Limit the number of feeds handled in any one category
    $feeds = [ @{$feeds}[0..$syndic8_max_results] ]
      if (scalar(@$feeds) > $syndic8_max_results);

Once we have a category scheme and title, we call on the Syndic8 API web service to give us a list of feeds in this category. This call returns a list of internal Syndic8 feed ID numbers, which is why we built in the ability to use them to locate feeds earlier, in our get_feed_info function. Also, we limit the number of results used, based on the configuration variable at the beginning of the script.

Next, let’s gather information about the feeds we’ve found in this category:

    for my $feed (@$feeds) {
      my $feed_info = get_feed_info(SYNDIC8_ID, $feed);
      my $feed_url = $feed_info->{syndic8}{dataurl};
      next if !$feed_url;
      $feeds{"$cat_name ($cat_scheme)"}{$feed_url} = $feed_info;
    }
  }

  return \%feeds;
}

Using the Syndic8 feed ID returned for each feed, we get a record for each and add it to a hash whose keys are based on the category and the feed URL. This is an attempt to make sure there is a list of unique feeds for each category. Finally, we return the results of this process.

Reporting on Our Findings

At this point, we can gather information about feeds and use the Syndic8 and Technorati APIs to dig for feeds in similar categories and feeds related by linking. Now, let’s produce an HTML page for what we find for each of our favorite feeds:

sub html_wrapper {
  my $content = shift;
  return qq^
    <html>
      <head>
        <title>Digging for RSS feeds</title>
      </head>
      <body>
        $content
      </body>
    </html>
    ^;
}

We just put together a simple HTML shell here to contain our results. It wraps whatever content it is given with a simple HTML skeleton. The next step, since our basic unit of results is the feed information record, is to come up with a means of formatting one:

sub format_feed_info {
  my $info = shift;
  my ($feed_url, $feed_title, $feed_link) =
    ($info->{url}, feed_title($info), feed_link($info));
  return qq^<a href="$feed_link">$feed_title</a>
    (<a href="$feed_url">RSS</a>)^;
}

This doesn’t do much with the wealth of data contained in a feed information record, but for now we simply construct a link to the site and a link to the feed. We’ll use this to format the results of our digging for a given feed:

sub format_feed_record {
  my $record = shift;
  my $out = '';
  $out .= qq^
    <div class="record">
      ^;

  $out .= qq^<h2 class="main_feed">^.
    format_feed_info($record->{info})."</h2>\n";

The first thing we do here is open a div tag to contain these particular record results. Then, we format the record that describes the favorite feed under investigation. Next, we format the results of looking for related feeds:

  my $related = $record->{related};
  if (keys %{$related}) {
    $out .= "<h3>Feeds related by links:</h3>\n<ul>\n";
    $out .= join
      ('',
       map { "<li>".format_feed_info($related->{$_})."</li>\n" }
       sort keys %{$related})."\n\n";
    $out .= "</ul>\n";
  }

This produces a bulleted list of feeds discovered, as related by linking to our feed. Next, we include the feeds related by category:

  my $similar = $record->{similar};
  if (keys %{$similar}) {
    $out .= "<h3>Similar feeds by category:</h3>\n<ul>\n";
    for my $cat (sort keys %{$similar}) {
      $out .= "<li>$cat\n<ul>";
      $out .= join
        ('',
         map { "<li>".format_feed_info($similar->{$cat}{$_})."</li>\n" }
         sort keys %{$similar->{$cat}})."\n\n";
        );
      $out .= "</ul>\n</li>\n";
    }
    $out .= "</ul>\n";
  }

A little bit more involved, this produces a set of nested lists, with the outer bullets describing categories and the inner bullets describing feeds belonging to the categories. Finally, let’s wrap up our results:

  $out .= qq^
    </div>
      ^;

  return $out;
}

We now have just a few loose ends to tie up. Some feed titles have a bit of extra whitespace in them, so we’ll need to tidy that:

sub trim_space {
  my $val = shift;
  $val=~s/^\s+//;
  $val=~s/\s+$//g;
  return $val;
}

And, since there’s a lot of variability in our results as to where a feed’s title is, we employ several options in grabbing it:

sub feed_title {
  my $feed_info = shift;
  return trim_space
    (
     $feed_info->{rss}{channel}{title} ||
     $feed_info->{syndic8}{sitename} ||
     $feed_info->{technorati}{name} ||
     $feed_info->{url} ||
     '(untitled)'
    );
}

As with the title, there are many places where a link to the feed can be found, so we do something similar with it:

sub feed_link {
  my $feed_info = shift;
  return trim_space
    (
     $feed_info->{rss}{channel}{link} ||
     $feed_info->{syndic8}{siteurl} ||
     $feed_info->{technorati}{url} ||
     $feed_info->{url} ||
     ''
    );
}

Figure 4-7 shows a sample of the generated HTML results.

A sampling of possibly related sites
Figure 4-7. A sampling of possibly related sites

With the use of two web services, we have a pretty powerful robot with which to dig for more interesting feeds. This hack makes quite a few calls to web services, so, although you might want to run it every now and then to find updates, you might want to go easy on it.

Hacking the Hack

A few things are left as exercises for the reader. Most notably, we don’t make much use of all the information gathered into a feed information record. In our report, we simply display a link to a site and a link to its feed. In fact, this record also contains all the most recent headlines for a feed, as well as the wealth of information provided by the Syndic8 and Technorati APIs. With some homework, this tool could be expanded even further to make use of all of this additional information.

—l.m.orchard

Hack #72. Automatically Finding Blogs of Interest

An easy way to find interesting new sites is to peruse an existing site’s blogroll: a listing of blogs they read regularly. Let’s create a spider to automate this by looking for keywords in the content of outbound links.

I enjoy reading blogs, but with the demands of the day, I find it difficult to read the dozen or so I like most, let alone discover new ones. I often have good luck when clicking through the blogrolls of writers I enjoy.

I decided to set out and automate this process, by creating a script that starts at one of my favorite sites and then visits each outbound link that site has to offer. As the script downloads each new page, it’ll look through the content for keywords I’ve defined, in hopes of finding a new daily read that matches my own interests.

The Code

Save the following script as blogfinder.pl:

#!/usr/bin/perl  -w
use strict; $|++;

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

# where should results go?
my $result_file  = "./result.html";
my $keywords_reg = qr/pipe-delimited search terms/;
my $starter_url  = "your favorite blog here";

# open and create the result.html file.
open(RESULT, ">$result_file") or die "Couldn't create: $!\n";
print RESULT "<html><head><title>Spider Findings</title></head><body>\n";

# our workhorse for access.
my $ua = LWP::UserAgent->new;
print "\nnow spidering: $starter_url\n";

# begin our link searching. LinkExtor takes a 
# subroutine argument to handle found links,
# and then the actual data of the page. 
HTML::LinkExtor->new(
  sub {
        my ($tag, %attr) = @_;
        return if $tag ne 'a';

        # make any href relative link into
        # an absolute value, and add to an
        # internal list of links to check out.
        my @links = map { url($_, $starter_url)->abs(  ) }
                      grep { defined } @attr{qw/href/};

        # make 'em all pretty...
        foreach my $link (@links) {
           print " + $link\n"; # hello!
           my $data = $ua->get($link)->content;
           if ($data =~ m/$keywords_reg/i) {
              open(RESULT, ">>$result_file");
              print RESULT "<a href=\"$link\">$link</a><br>\n";
              close(RESULT); # one match printed, yes!
           }
        }

# and now, the actual content that
# HTML::LinkExtor goes through...
})->parse(
  do {
     my $r = $ua->get($starter_url);
     $r->content_type eq "text/html" ? $r->content : "";
  }
);

print RESULT "</body></html>";
close RESULT; exit;

Once the LWP::UserAgent [Hack #10] object is created, we drop into the main workhorse loop of the spider subroutine. Here is where the script decides which link to spider. Obviously, the seed link is first, but as the spider traverses the first web page, it is on the lookout for links to extract. This is handled by the HTML::LinkExtor object (http://search.cpan.org/~gaas/HTML-Parser-3.34/lib/HTML/LinkExtor.pm). Each link, in turn, is passed to an HTML::LinkExtor callback, which downloads each page, looks for the magic keywords, and makes note of any matches in a newly created results.html file.

When the spider has finished its run, you will be left with an HTML file that contains links that match your search criteria. There is, of course, room for refinement. However, one thing I enjoy about this script is the subtle entropy that seems to arise in it. Through this unintended randomness, I am able to discover blogs I would never have discovered by other means. More often than not, such a discovery is one I would rather not have made. But every now and then, a real gem can be seen gleaming at the bottom of the trash heap that is so often our beloved Internet.

Running the Hack

The first thing you should do is replace the two lines at the top of the script with your favorite blog URL and a pipe-delimited (|) list of values, like so:

my $keywords_reg = qr/foaf|perl|os x/;
my $starter_url  = "http://myfavoriteblog.com";

The pipe is the equivalent of OR, so these lines mean “Spider myfavoriteblog.com and search for foaf OR perl OR os x.” If you know regular expressions, you can modify this even further to check for word boundaries (so that perl would not match amityperl, for instance). Once these two lines are configured, run the script, like so:

% perl blogfinder.pl
now spidering: http://www.myfavoriteblog.com
 + http://myfavoriteblog.com
 + http://www.luserinterface.net/index.cgi/colophon/
 + mailto:saf@luserinterface.net
 + http://jabber.org/
 + http://sourceforge.net/projects/gaim/
 + http://scottfallin.com/hacks/popBlosx.text

Once the script is finished spidering the outbound links, you’ll have a new file in the current directory, with a list of URLs that match your keyword criteria.

Hacking the Hack

There are a few ways to modify the hack, the most interesting of which is to add another level of link crawling to begin creating "blog neighborhoods” similar to the idea of "Six Degrees of Kevin Bacon” (http://www.wired.com/news/culture/0,1284,49343,00.html; see also an implementation by Mark Pilgrim based on Google search results: http://diveintomark.org/archives/2002/06/04/who_are_the_people_in_your_neighborhood). One of the easiest additions, however, involves stopping the spider from indexing more data than necessary.

As you can see from the sample output, the spider will look at any URI that has been put into an HTML A tag, which could involve email addresses, IRC and FTP servers, and so forth. Since the spider isn’t equipped to handle those protocols, telling it to skip over them is a simple modification:

foreach my $link (@links) {
    next unless $link =~ /^http/i;
    print " + $link\n"; # hello!
    my $data = $ua->get($link)->content;

Other possibilities could restrict spidering to third-party sites only (since you’re not interested in spidering your favorite site, but rather the sites it links to) or add an upper limit to the number of sites spidered (i.e., “spider as much as you can, to a maximum of 200 sites”).

—Scott Fallin

Hack #73. Scraping TV Listings

Freeing yourself from flipping through a weekly publication by visiting the TV Guide Online web site might sound like a good idea, but being forced to load heavy pages, showing only hours at a time and channels you don’t care for, isn’t exactly the utopia for which you were hoping.

To grab the latest TV listings from TV Guide Online (http://www.tvguide.com), we could write an HTML scraper from scratch using HTML::TableExtract [Hack #67] and similar modules, or we could go Borg on a script called tvlisting and assimilate it into our collective consciousness. Why reinvent the wheel if you don’t have to, right? The author of tvlisting , Kurt V. Hindenburg, has extensively reverse-engineered TV Guide Online’s dynamic site and created a script that can pull down all the TV listings for a whole day and output it in several different formats, including XML.

Grab tvlisting from http://www.cherrynebula.net/projects/tvlisting/tvlisting.html and follow the terse documentation to get it running on your platform. There are tons of options you can use when running tvlisting, most of which we won’t cover for sake of brevity. So, snoop around in the tvlisting code, as well as the included sample_rc file, and check out the various options available. For our purposes, we’ll modify the sample_rc file and use command-line arguments when we call the script. Open the sample_rc file and save it as tvlisting_config; then we’ll get started. Let’s look at a small portion of our new tvlisting_config file:

## To use this script as a
## CGI; please read CGI.txt
## Choices : $TRUE, $FALSE
$options{USE_CGI} = $FALSE;

## Choices : WGET, LYNX, CURL, LWPUSERAGENT
$options{GET_METHOD} = qw(LWPUSERAGENT);

## Choices : HTML, TEXT, LATEX, XAWTV, XML
$options{OUTPUT_FORMAT} = qw(XML);

## Choices : TVGUIDE
$options{INPUT_SOURCE} = qw(TVGUIDE);

### Attributes dealing with channels.
## Should channels be run through the filter?
## Choices : $TRUE, $FALSE
$options{FILTER_CHANNELS} = $TRUE;

## Filter by NAME and/or NUMBER?
$options{FILTER_CHANNELS_BY_NAME} = $FALSE;
$options{FILTER_CHANNELS_BY_NUMBER} = $TRUE;

## List of channels to OUTPUT
$options{FILTER_CHANNELS_BY_NAME_LIST} = 
   ["WTTV", "WISH", "WTHR", "WFYI", "WXIN", "WRTV", "WNDY", "WIPX"];

$options{FILTER_CHANNELS_BY_NUMBER_LIST} = 
   [qw( 2 3 4 5 6 7 9 11 12 14 15 16 18 28 29 30 31 32
        33 34 35 36 37 38 39 49 50 53 55 71 73 74 75 78)];

## Your personal Service ID, used by
## tvguide.com to localize your listings.
$options{SERVICE_ID} = 359508;

As you can see, there are many options available (the preceding listing is about half of what you’d see in a normal configuration file). Starting from the top, I set USE_CGI to $FALSE, GET_METHOD to LWPUSERAGENT, and OUTPUT_FORMAT to XML. You may have noticed that you can output to HTML as well, but I’m not crazy about the quality of its HTML output. The FILTER_ options allow us to choose only the channels we are interested in, rather than having to weed through hundreds of useless entries to find what we’re looking for. The most important option, SERVICE_ID, is what TV Guide Online uses to specify the stations and channel numbers that are available in your area. Without this option set correctly, you’ll receive channels that do not map to the channels on your TV, and that’s no fun. The Readme.txt file has some further information on how to hunt this ID down.

After configuration, it’s simply a matter of running the script to get an output of the current hour’s listings for just the channels you’re interested in. If you specified TEXT output, you’ll see something like this (severely truncated for readability):

% bin/tvlisting
            6:30 PM             7:00 PM             7:30 PM   
           +---------+---------+---------+---------+---------+
76 WE      Felicity             Hollywood Wives
77 OXYGN   Can You Tell?        Beautiful

An XML output format grants the following snippet, which is readily parseable:

% bin/tvlisting
<Channel Name="TOON" Number="53">
  <Shows Title="Dexter's Laboratory" Sequence="1" Duration="6" />
  <Shows Title="Ed, Edd n Eddy" Sequence="2" Duration="6" />
  <Shows Title="Courage the Cowardly Dog" Sequence="3" Duration="6" />
  <Shows Title="Pokemon" Sequence="4" Duration="6" />
</Channel>

Even though you can filter by channels within tvlisting, there doesn’t seem to be a way to filter by type of program, such as all “horror” movies or anything with Mister Miyagi. For that, we’d have to build our own quick scraper.

The Code

Save the following code as tvsearch.pl:

#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use LWP::Simple;
use HTML::TableExtract;
my %opts;

# our list of tvguide.com categories.
my @search_categories = ( qw/ action+%26+adventure adult Movie
                              comedy drama horror mystery+%26+suspense
                              sci-fi+%26+paranormal western Sports
                              Newscasts+%26+newsmagazines health+%26+fitness
                              science+%26+technology education Children%27s
                              talk+%26+discussion soap+opera
                              shopping+%26+classifieds music / );

# instructions for if the user doesn't
# pass a search term or category. bah.
sub show_usage {
 print "You need to pass either a search term (--search)\n";
 print "or use one of the category numbers below (--category):\n\n";
 my $i=1; foreach my $cat (@search_categories) {
    $cat =~ s/\+/ /g; $cat =~ s/%26/&/; $cat =~ s/%27/'/;
    print "  $i) ", ucfirst($cat), "\n"; $i++;
 } exit;
}

# define our command-line flags (long and short versions).
GetOptions(\%opts, 'search|s=s',      # a search term.
                   'category|c=s',    # a search category.
); unless ($opts{search} || $opts{category}) { show_usage; }

# create some variables for use at tvguide.com.
my ($day, $month) = (localtime)[3..4]; $month++;
my $start_time = "8:00";         # this time is in military format
my $time_span  = 20;             # number of hours of TV listings you want
my $start_date = "$month\/$day"; # set the current month and day
my $service_id = 61058;          # our service id (see tvlisting readme)
my $search_phrase = undef;       # final holder of what was searched for
my $html_file = undef;           # the downloaded data from tvguide.com
my $url = 'http://www.tvguide.com/listings/search/SearchResults.asp';

# search by category.
if ($opts{category}) {
   my $id = $opts{category}; # convenience.
   die "Search category must be a number!" unless $id =~ /\d+/;
   die "Category ID was invalid" unless ($id >= 1 && $id <= 19);
   $html_file = get("$url?l=$service_id&FormCategories=".
                    "$search_categories[$id-1]");
   die "get(  ) did not return as we expected.\n" unless $html_file;
   $search_phrase = $search_categories[$id-1];
}
elsif ($opts{search}) { 
   my $term = $opts{search}; # convenience.
   $html_file = get("$url?I=$service_id&FormText=$term");
   die "get(  ) did not return as we expected.\n" unless $html_file;
   $search_phrase = $term;
}

# now begin printing out our matches.
print "Search Results for '$search_phrase':\n\n";

# create a new table extract object and pass it the
# headers of the tvguide.com table in our data. 
my $table_extract =
   HTML::TableExtract->new(
        headers => ["Date","Start Time", "Title", "Ch#"],
            keep_html => 1 );
$table_extract->parse($html_file);

# now, with our extracted table, parse.
foreach my $table ($table_extract->table_states) {
    foreach my $cols ($table->rows) {

        # this is not the best way to do this...
        if(@$cols[0] =~ /Sorry your search found no matches/i)
          { print "No matches to found for your search!\n"; exit; }

        # get the date.
        my $date = @$cols[0];
        $date =~ s/<.*>//g;       $date =~ s/\s*//g;
        $date =~ /(\w*)\D(\d*)/g; $date = "$1/$2";

        # get the time.
        my $time = @$cols[1];
        $time =~ m/(\d*:\d*\s+\w+)/;
        $time = $1;

        # get the title, detail_url, detail_number, and station.
        @$cols[2] =~ /href="(.*\('\d*','(\d*)','\d*','\d*','(.*)',.*)"/i;
        my ($detail_url, $detail_num, $channel) = ($1, $2, $3);
        my $title = @$cols[2]; $title =~ s/<.*>//g;
        $title =~ /(\b(.*)\b)/g; $title = $1;

        # get channel number
        my $channel_num = @$cols[3];
        $channel_num =~ m/>\s*(\d*)\s*</;
        $channel_num = $1;

        # turn the evil Javascript URL into a normal one.
        $detail_url =~ /javascript:cu\('(\d+)','(\d+)'/;
        my $iSvcId = $1; my $iTitleId = $2;
        $detail_url = "http://www.tvguide.com/listings/".
                      "closerlook.asp?I=$iSvcId&Q=$iTitleId";

        # now, print the results.
        print " $date at $time on chan$channel_num ($channel): $title\n";
        print "    $detail_url\n\n";
    }
}

Running the Hack

A search for Farscape looks something like this:

% perl tvsearch.pl --search  farscape
Search Results for 'farscape':

 Mon/28 at 12:00 AM on chan62 (SCI-FI): Farscape: What Was Lost: Sacrifice
    http://www.tvguide.com/listings/closerlook.asp?I=61058&Q=3508575

 Mon/4 at 12:00 AM on chan62 (SCI-FI): Farscape: What Was Lost: Resurrection
    http://www.tvguide.com/listings/closerlook.asp?I=61058&Q=3508576

—William Eastler

Hack #74. What’s Your Visitor’s Weather Like?

You have a web site, as most people do, and you’re interested in getting a general idea of what you’re visitor’s weather is like. Want to know if you get more comments when it’s raining or sunny? With the groundwork laid in this hack, that and other nonsense will be readily available.

When you’re spidering, don’t consider only data available on the Web. Sometimes, the data is right under your nose, perhaps on your own server or even on your own hard drive [Hack #82]. This hack demonstrates the large amount of information available, even when you have only a small amount of your own data to start with. In this case, we’re looking at a web server’s log file, taking the IP address of the last few visitors’ sites, using one database to look up the geographical location of that IP address, and then using another to find the weather there. It’s a trivial example, perhaps, but it’s also quite nifty. For example, you could easily modify this code to greet visitors to your site with commiserations about the rain.

For the geographical data, we’re going to use the Perl interface to the CAIDA project (http://www.caida.org/tools/utilities/netgeo/NGAPI/index.xml); for the weather data, we’re using the Weather::Underground module, which utilizes the information at http://www.wunderground.com.

The Code

Copy this code, changing the emphasized line to reflect the path to your Apache installation’s access_log. Here, mine is in the same directory as the script:

#!/usr/bin/perl -w
#
# Ben Hammersley ben@benhammersley.com
# Looks up the real-world location of visiting IPs
# and then finds out the weather at those places
#

use strict;
use CAIDA::NetGeoClient;
use Weather::Underground;
use Geography::Countries;

my $apachelogfile = "access_log";
my $numberoflines = 10;
my $lastdomain    = "";

# Open up the logfile.
open (LOG, "<$apachelogfile") or die $!;

# Place all the lines of the logfile
# into an array, but in reverse order.
my @lines = reverse <LOG>;

# Start our HTML document.
print "<h2>Where my last few visitors came from:</h2>\n<ul>\n";

# Go through each line one
# by one, setting the variables.
my $i; foreach my $line (@lines) {
    my ($domain,$rfc931,$authuser,$TimeDate,
        $Request,$Status,$Bytes,$Referrer,$Agent) =
        $line =~ /^(\S+) (\S+) (\S+) \[([^\]\[]+)\] \"([^"]*)\" (\S+) # (\S+) 
\"?([^"]*)\"? \"([^"]*)\"/o;

    # If this record is one we saw
    # the last time around, move on.
    next if ($domain eq $lastdomain);

    # And now get the geographical info.
    my $geo     = CAIDA::NetGeoClient->new(  );
    my $record  = $geo->getRecord($domain);
    my $city    = ucfirst(lc($record->{CITY}));
    my $region  = "";

    # Check to see if there is a record returned at all.
    unless ($record->{COUNTRY}) { $lastdomain = $domain; next; }

    # If city is in the U.S., use the state as the "region". 
    # Otherwise, use Geography::Countries to munge the two letter
    # code for the country into its actual name. (Thanks to
    # Aaron Straup Cope for this tip.)
    if ($record->{COUNTRY} eq "US") {
        $region = ucfirst(lc($record->{STATE}));
    } else { $region = country($record->{COUNTRY}); }

    # Now get the weather information.
    my $place   = "$city, $region";
    my $weather = Weather::Underground->new(place => $place);
    my $data    = $weather->getweather(  );
    next unless $data; $data = $data->[0];

    # And print it for our HTML.
    print " <li>$city, $region where it is $data->{conditions}.</li>\n";

    # Record the last domain name
    # for the repeat prevention check
    $lastdomain = $domain;

    # Check whether you're not at the limit, and if you are, finish.
    if ($i++ >= $numberoflines-1) { last; }
}

print "</ul>";

The code loads up the access_log, reverses it to put the last accesses at the top, and then goes through the resulting list, line by line. First, it runs the line through a regular expression:

my ($domain,$rfc931,$authuser,$TimeDate,$Request,$Status,$Bytes,$Referrer,$Agen
t) = $line =~ /^(\S+) (\S+) (\S+) \[([^\]\[]+)\] \"([^"]*)\" (\S+) (\S+) \"?([^"]*)\
"? \"([^"]*)\"/o;

This splits the line into its different sections and is based on Apache’s combined log format. We’ll be using only the first variable (the domain itself) from these results, but, because this regular expression is so useful, I include it for your cannibalistic pleasure.

Anyhow, we take the domain and pass it to the CAIDA module, retrieving a result and checking whether that result is useful. If it’s not useful, we go to the next line in the access_log. This highlights an important point when using third-party databases: you must always check for a failed query. Indeed, it might even be a good idea to treat a successful query as the exception rather than the rule.

Assuming we have a good result, we need to detect if the country is the U.S. If it is, we make the $region the value of the U.S. state; otherwise, we use the two-letter code for the country. We use the country function from the Geography::Countries module to convert the full name of the country to the two-letter code.

Running the Hack

Here’s a typical run of the script, invoked on the command line:

% perl weather.pl
<h2>Where my last few visitors came from:</h2>

<ul>
 <li>London, UK, where it is cloudy</li>
 <li>New York, NY, where it is sunny</li>
</ul>

Using and Hacking the Hack

I have this script installed on my weblog using an Apache server-side include. This is probably a bad idea, given the potential for slow server responses on behalf of CAIDA and Weather Underground, but it does allow for completely up-to-date information. A more sensible approach might be to change the script to produce a static file and run this from cron [Hack #90] every few minutes.

If you’re sure of fast responses, and if you have a dynamically created page, it would be fun to customize that page based on the weather at the reader’s location. Pithy comments about the rain are always appreciated. Tweaking the Weather Underground response to give you the temperature instead of a descriptive string creates the possibility of dynamically selecting CSS stylesheets, so that colors change based on the temperature. Storing the weather data over a period of time gives you the possibility of creating an “average readership temperature” or the amount of rain that has fallen on your audience this week. These would be fun statistics for some and perhaps extremely useful for others.

—Ben Hammersley

Hack #75. Trendspotting with Geotargeting

Compare the relative popularity of a trend or fashion in different locations, using only Google and Directi search results.

One of the latest buzzwords on the Internet is geotargeting , which is just a fancy name for the process of matching hostnames (e.g., www.oreilly.com) to addresses (e.g., 208.201.239.36) to country names (e.g., USA). The whole thing works because there are people who compile such databases and make them readily available. This information must be compiled by hand or at least semiautomatically, because the DNS system that resolves hostnames to addresses does not store it in its distributed database.

While it is possible to add geographic location data to DNS records, it is highly impractical to do so. However, since we know which addresses have been assigned to which businesses, governments, organizations, or educational establishments, we can assume with a high probability that the geographic location of the institution matches that of its hosts, at least of most of them. For example, if the given address belongs to the range of addresses assigned to British Telecom, then it is highly probable that it used by a host located within the territory of the United Kingdom.

Why go to such lengths when a simple DNS lookup (e.g., nslookup 208.201.239.36) gives the name of the host, and in that name we can look up the top-level domain (e.g., .pl, .de, or .uk) to find out where this particular host is located? There are four good reasons for this:

  • Not all lookups on addresses return hostnames.

  • A single address might serve more than one virtual host.

  • Some country domains are registered by foreigners and hosted on servers on the other side of the globe.

  • .com, .net, .org, .biz, or .info domains tell us nothing about the geographic location of the servers they are hosted on. That’s where geotargeting can help.

Geotargeting is by no means perfect. For example, if an international organization like AOL gets a large chunk of addresses that it uses not only for servers in the USA, but also in Europe, the European hosts might be reported as being based in the U.S. Fortunately, such aberrations do not constitute a large percentage of addresses.

The first users of geotargeting were advertisers, who thought it would be a neat idea to serve local advertising. In other words, if a user visits a New York Times site, the ads they see depend on their physical location. Those in the U.S. might see the ads for the latest Chrysler car, while those in Japan might see ads for i-mode; users from Poland might see ads for Ekstradycja (a cult Polish police TV series), and those in India might see ads for the latest Bollywood movie. While such use of geotargeting might be used to maximize the return on the invested dollar, it also goes against the idea behind the Internet, which is a global network. (In other words, if you are entering a global audience, don’t try to hide from it by compartmentalizing it.) Another problem with geotargeted ads is that they follow the viewer. Advertisers must love it, but it is annoying to the user; how would you feel if you saw the same ads for your local burger bar everywhere you went in the world?

Another application of geotargeting is to serve content in the local language. The idea is really nice, but it’s often poorly implemented and takes a lot of clicking to get to the pages in other languages. The local pages have a habit of returning out of nowhere, especially after you upgrade your web browser to a new version. A much more interesting application of geotargeting is analysis of trends, which is usually done in two ways: via analysis of server logs and via analysis of results of querying Google.

Server log analysis is used to determine the geographic location of your visitors. For example, you might discover that your company’s site is being visited by a large number of people from Japan. Perhaps that number is so significant that it will justify the rollout of a Japanese version of your site. Or it might be a signal that your company’s products are becoming popular in that country and you should spend more marketing dollars there. But if you run a server for U.S. expatriates living in Tokyo, the same information might mean that your site is growing in popularity and you need to add more information in English. This method is based on the list of addresses of hosts that connect to the server, stored in your server’s access log. You could write a script that looks up their geographic location to find out where your visitors come from. It is more accurate than looking up top-level domains, although it’s a little slower due to the number of DNS lookups that need to be done.

Another interesting use of geotargeting is analysis of the spread of trends. This can be done with a simple script that plugs into the Google API and the IP-to-Country database provided by Directi (http://ip-to-country.directi.com). The idea behind trend analysis is simple: perform repetitive queries using the same keywords, but change the language of results and top-level domains for each query. Compare the number of results returned for each language, and you will get a good idea of the spread of the analyzed trend across cultures. Then, compare the number of results returned for each top-level domain, and you will get a good idea of the spread of the analyzed trend across the globe. Finally, look up geographic locations of hosts to better approximate the geographic spread of the analyzed trend.

You might discover some interesting things this way: it could turn out that a particular .com domain that serves a significant number of documents and that contained the given query in Japanese is located in Germany. It might be a sign that there is a large Japanese community in Germany that uses that particular .com domain for their portal. Shouldn’t you be trying to get in touch with them?

The geospider.pl script shown in this hack is a sample implementation of this idea. It queries Google and then matches the names of hosts in returned URLs against the IP-to-Country database.

The Code

You will need the Getopt::Std and Net::Google modules for this script. You’ll also need a Google API key (http://api.google.com) and the latest ip-to-country.csv database, available from http://ip-to-country.directi.com/.

Save the following code as geospider.pl:

#!/usr/bin/perl-w
#
# geospider.pl
#
# Geotargeting spider -- queries Google through the Google API, extracts
# hostnames from returned URLs, looks up addresses of hosts, and matches
# addresses of hosts against the IP-to-Country database from Directi:
# ip-to-country.directi.com. For more information about this software:
# http://www.artymiak.com/software or contact jacek@artymiak.com
# 
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; 
use Getopt::Std;
use Net::Google;
use constant GOOGLEKEY => 'Your Google API key here;
use Socket;

my $help = <<"EOH";
----------------------------------------------------------------------------
Geotargeting trend analysis spider
----------------------------------------------------------------------------
Options:

  -h    prints this help
  -q    query in utf8, e.g. 'Spidering Hacks'
  -l    language codes, e.g. 'en fr jp'
  -d    domains, e.g. '.com'
  -s    which result should be returned first (count starts from 0), e.g. 0
  -n    how many results should be returned, e.g. 700
----------------------------------------------------------------------------
EOH

# define our arguments and show the
# help if asked, or if missing query.
my %args; getopts("hq:l:d:s:n:", \%args);
die $help if exists $args{h};
die $help unless $args{'q'};

# create the Google object.
my $google = Net::Google->new(key=>GOOGLEKEY);
my $search = $google->search(  );

# language, defaulting to English.
$search->lr(qw($args{l}) || "en");

# what search result to start at, defaulting to 0.
$search->starts_at($args{'s'} || 0);

# how many results, defaulting to 10.
$search->starts_at($args{'n'} || 10);

# input and output encoding.
$search->ie(qw(utf8)); $search->oe(qw(utf8));

my $querystr; # our final string for searching.
if ($args{d}) { $querystr = "$args{q} .site:$args{d}"; }
else { $querystr = $args{'q'} } # domain specific searching.

# load in our lookup list from
# http://ip-to-country.directi.com/
my $file = "ip-to-country.csv";
print STDERR "Trying to open $file... \n";
open (FILE, "<$file") or die "[error] Couldn't open $file: $!\n";

# now load the whole shebang into memory.
print STDERR "Database opened, loading... \n";
my (%ip_from, %ip_to, %code2, %code3, %country);
my $counter=0; while (<FILE>) {
    chomp; my $line = $_; $line =~ s/"//g; # strip all quotes.
    my ($ip_from, $ip_to, $code2, $code3, $country) = split(/,/, $line);

    # remove trailing zeros.
    $ip_from =~ s/^0{0,10}//g; 
    $ip_to =~ s/^0{0,10}//g;

    # and assign to our permanents.
    $ip_from{$counter} = $ip_from;
    $ip_to{$counter}   = $ip_to;
    $code2{$counter}   = $code2;
    $code3{$counter}   = $code3;
    $country{$counter} = $country;
    $counter++; # move on to next line.
}

$search->query(qq($querystr));
print STDERR "Querying Google with $querystr... \n";
print STDERR "Processing results from Google... \n";

# for each result from Google, display 
# the geographic information we've found.
foreach my $result (@{$search->response(  )}) {
    print "-" x 80 . "\n";
    print " Search time: " . $result->searchTime(  ) . "s\n";
    print "       Query: $querystr\n";
    print "   Languages: " . ( $args{l} || "en" ) . "\n";
    print "      Domain: " . ( $args{d} || "" ) . "\n";
    print "    Start at: " . ( $args{'s'} || 0 ) . "\n";
    print "Return items: " . ( $args{n} || 10 ) . "\n";
    print "-" x 80 . "\n";

    map {
        print "url: " . $_->URL(  ) . "\n";
        my @addresses = get_host($_->URL(  ));
        if (scalar @addresses != 0) {
            match_ip(get_host($_->URL(  )));
        } else {
            print "address: unknown\n";
            print "country: unknown\n";
            print "code3: unknown\n";
            print "code2: unknown\n";
        } print "-" x 50 . "\n";
    } @{$result->resultElements(  )};
}

# get the IPs for 
# matching hostnames.
sub get_host {
    my ($url) = @_;

    # chop the URL down to just the hostname.
    my $name = substr($url, 7); $name =~ m/\//g;
    $name = substr($name, 0, pos($name) - 1);
    print "host: $name\n";

    # and get the matching IPs.
    my @addresses = gethostbyname($name);
    if (scalar @addresses != 0) {
        @addresses = map { inet_ntoa($_) } @addresses[4 .. $#addresses];
    } else { return undef; }
    return "@addresses";
}

# check our IP in the
# Directi list in memory.
sub match_ip {
    my (@addresses) = split(/ /, "@_");
    foreach my $address (@addresses) {
        print "address: $address\n";
        my @classes = split(/\./, $address);
        my $p; foreach my $class (@classes) {
            $p .= pack("C", int($class));
        } $p  = unpack("N", $p);
        my $counter = 0;
        foreach (keys %ip_to) {
            if ($p <= int($ip_to{$counter})) {
                print "country: " . $country{$counter} . "\n";
                print "code3: "   . $code3{$counter}   . "\n";
                print "code2: "   . $code2{$counter}   . "\n";
                last;
            } else { ++$counter; }
        } 
    }
}

Running the Hack

Here, we’re querying to see how much worldly penetration AmphetaDesk, a popular news aggregator, has, according to Google’s top search results:

% perl geospider.pl -q "amphetadesk"
Trying to open ip-to-country.csv... 
Database opened, loading... 
Querying Google with amphetadesk... 
Processing results from Google... 
--------------------------------------------------------------
 Search time: 0.081432s
       Query: amphetadesk
   Languages: en
      Domain: 
    Start at: 0
Return items: 10
--------------------------------------------------------------
url: http://www.macupdate.com/info.php/id/9787
host: www.macupdate.com
host: www.macupdate.com
address: 64.5.48.152
country: UNITED STATES
code3: USA
code2: US
--------------------------------------------------
url: http://allmacintosh.forthnet.gr/preview/214706.html
host: allmacintosh.forthnet.gr
host: allmacintosh.forthnet.gr
address: 193.92.150.100
country: GREECE
code3: GRC
code2: GR
--------------------------------------------------
...etc...

Hacking the Hack

This script is only a simple tool. You will make it better, no doubt. The first thing you could do is implement a more efficient way to query the IP-to-Country database. Storing data from ip-to-country.csv on a database server would speed script startup time by several seconds. Also, the answers to address-to-country queries could be obtained much faster.

You might ask if it wouldn’t be easier to write a spider that doesn’t use the Google API and instead downloads page after page of results returned by Google at http://www.google.com. Yes, it is possible, and it is also the quickest way to get your script blacklisted for the breach of the Google’s user agreement. Google is not only the best search engine, it is also one of the best-monitored sites on the Internet.

—Jacek Artymiak

Hack #76. Getting the Best Travel Route by Train

A web scraper can help you find faster train connections in Europe.

If you ever visit Europe and want to travel by train, you will find the PKP (Polskie Koleje Panstwowe, or Polish State Railways) server (http://www.rozklad.pkp.pl) a handy place to find information about European train connections.

This hack queries the timetables of the PKP site and scrapes a variety of information from the results, including the time of departure and arrival, as well as the number of changes you’ll have to make along the way.

The Code

Save the following code as broute.pl:

#!/usr/bin/perl -w
#
# broute.pl
# 
# A European train timetable hack that displays available train connections
# between two cities, with dates, times, and the number of changes. You
# can limit the number of acceptable changes with -c. If there are no
# connections, try earlier/later times/dates or search again for connections
# with intermediate stops, e.g., instead of Manchester -> Roma, choose 
# Manchester -> London, London -> Paris, and Paris -> Roma.
# 
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict;
use LWP::UserAgent;
use Net::HTTP;
use Getopt::Std;

my $help = <<"EOH";
---------------------------------------------------------------------------
Best train routes in Europe

Options: -a   depart from
         -z   arrive in
         -d   date (of departure, if -s d; arrival, if -s a)
              in dd.mm.yy format (e.g. June 1, 2004 is 01.06.04)
         -t   time (of departure, if -s d; arrival, if -s a)
              in hh:mm format (e.g. 12:45)
         -s   select time point for -d and -t options, default -s d
         -c   maximum number of changes, default 0
         -h   print this help
EOH

# set out command-line options,
# requirements, and defaults.
my %args; getopt('ha:z:d:t:s:c:', \%args);
die $help if exists $args{h};
die $help unless $args{a};
die $help unless $args{z};
die $help unless $args{t};
$args{'s'} = 'depart' unless $args{'s'};
$args{'s'} = 'depart' if $args{'s'} eq 'd';
$args{'s'} = 'arrive' if $args{'s'} eq 'a';

# our requesting agent. define our URL and POST.
my $url  = 'http://www.rozklad.pkp.pl/cgi-bin/new/query.exe/en';
my $post = "protocol=http:&from=$args{a}&to=$args{z}&datesel=custom".
           "&date=$args{d}&timesel=$args{s}&time=$args{t}";

# the headers we'll send off...
my $hdrs = HTTP::Headers->new(Accept => 'text/plain',
                 'User-Agent' => 'PKPTrainTimetableLookup/1.0');

# and the final requested documents.
my $uable = HTTP::Request->new(POST, $url, $hdrs, $post);
my $ua    = LWP::UserAgent->new; my $req = $ua->request($uable);

# if a success,
# let's parse it!
die $req->message
  unless $req->is_success;
my $doc = $req->content;

$doc =~ s/[\f\t\n\r]//isg; # remove linefeeds.
while ($doc =~ m/ NAME=sel[0-9]{1,2}>/isg) {
    my $begin = pos($doc);
    $doc =~ m/<TR>/isg;
    my $end = pos($doc);
    next unless $begin;
    next unless $end;

    # munch our content into columns.
    my $content = substr($doc, $begin, ($end -= 5) - $begin);
    $doc = substr($doc, $end);
    my @columns = split(/<TD/, $content); shift @columns;
    foreach my $column (@columns) {
        $column = '<TD' . $column;
        $column =~ s/<[^>]*>//g;
        $column =~ s/<[^>]*//g;
    }

    # skip schedules that have more hops than we want.
    if ($args{c} and int $args{c} < int $columns[2]) { next; }

    # and print out our data.
    print "-" x 80 . "\n";
    print "             From: $columns[0]\n";
    print "               To: $columns[1]\n";
    print "          Changes: $columns[2]\n";
    print "Date of Departure: $columns[3]\n" if $args{'s'} eq 'depart';
    print "  Date of Arrival: $columns[3]\n" if $args{'s'} eq 'arrive';
    print "   Departure Time: $columns[4]\n";
    print "     Arrival Time: $columns[5]\n";
}

Running the Hack

The script has several command-line options that are viewable in the code or by requesting its display with perl broute.pl -h.

Here are a couple of example runs. Let’s find all connections from Berlin to Szczecin with an arrival time of 8:00 A.M. on December 15, 2004 with no changes:

% perl broute.pl -a Berlin -z Szczecin -s a -d 12.15.04 -t 8:00 -c 0

How about all connections from Manchester to Rome with departure time of 8:00 A.M. on December 15, 2004 with a maximum of four changes:

% perl 
               broute.pl -a Manchester -z Roma -s d -d 12.15.04 -t 8:00 -c 4

A typical run looks something like this:

trying http://www.rozklad.pkp.pl/cgi-bin/new/query.exe/en ...
-------------------------------------------------------------------------
             From: Berlin Ostbf
               To: Szczecin G_wny
          Changes: 0
  Date of Arrival: 05.07.03
Departure Time: 5:55
  Arrival Time: 7:41

Hacking the Hack

There a few things you can do to expand this hack. For example, you could add subroutines that find connections within 24 hours (12 hours before and 12 hours ahead) of the given time of departure or arrival. Another addition could be a module that displays names of the transfer stations.

—Jacek Artymiak

Hack #77. Geographic Distance and Back Again

When you’re traveling from one place to another, it’s usually handy to know exactly how many miles you’re going to be on the road. One of the best ways to get the most accurate result is to use latitude and longitude.

Dr. Seuss once wrote, “From here to there, from near to far, funny things are everywhere.” But just how far apart are those funny things, anyway?

Given the latitude and longitude of two terrestrial objects, and assuming the earth to be a perfect sphere with a smooth surface, the “great circle” calculation to find the shortest surface distance between those two objects is a simple bit of trigonometry. Even though the earth is neither smooth nor a perfect sphere, the calculation is surprisingly accurate. I found the position—i.e., the latitude and longitude—of my home and the home of a friend who lives a short distance away. Using a town map and a ruler, I calculated the distance at 7.49 miles. Using the positions and trigonometry, the calculated distance came out at 7.43 miles.

That was good enough for me, so I set about to create a program that would accept two addresses and return the distance between them. Initially, I thought I’d have the program done in about 30 minutes. Ultimately, it required a few hours of research and a creative hack of MapPoint. The tough part? Getting the true latitude and longitude for an address, something I mistakenly thought would be trivial on our little high-tech planet—not so!

The Latitude/Longitude Question

The difficulty associated with this hack can be demonstrated through a very simple exercise: right now, before you read any further, using any online resource that you like, go find the latitude and longitude of your house—not just of your Zip Code, but of your actual house.

Not so easy, is it? In fact, I was surprised by the difficulty this problem presented. I found several resources—the easiest to use being the U.S. Census web site (http://www.census.gov)—that could turn Zip Codes into positions, presumably somewhere near the center of the Zip Code’s geographic region, but virtually nothing that would give me the position of an actual address. In the past, I used a mapping service called MapBlast! (http://www.mapblast.com), and I thought I recalled that this service would give me map positions. However, a trip to MapBlast! now lands you at MapPoint, Microsoft’s mapping service, which apparently acquired MapBlast! in the not-too-distant past.

At this point, I’ll spare you the details of my research and cut to the chase:

  • If you want the position for a Zip Code, it’s easy; there are lots of sites and even some Perl packages that will do this automatically for you.

  • The major mapping services will take a position and present you with a map, but they won’t give you a position if you give them an address.

  • Microsoft has a nice set of web service APIs in addition to MapPoint, and they can be used to find the position of an address. Unfortunately, it’s a subscription service.

  • Pay services (search http://www.geocode.com to find a few) can turn an address into a position.

  • Whether intentional or not, MapPoint does publish the position for an address in its publicly accessible web interface. It’s not published on the page; it’s published in the URL.

I found that last item in the list most intriguing. I discovered it quite by accident. I had mapped my address and by chance took a look at the URL. I recognized some numbers that looked suspiciously like my latitude and longitude. I played around a bit and found the behavior was consistent; MapPoint returns a latitude/longitude position in its URL whenever it maps an address. Try it. Go to http://mappoint.msn.com/ and map an address. Then, look closely at the URL for the parameter whose name is C. It’s the latitude and longitude of the address you just looked up. Now, all I needed to do was find a way to make MapPoint give that data up to a Perl script!

Hacking the Latitude Out of MapPoint

Getting MapPoint to respond to a Perl script as it would to a browser was a bit more difficult than a straightforward GET or POST. My first few quick attempts earned me return data that contained messages like “Function not allowed,” “ROBOT-NOINDEX,” and “The page you are looking for does not exist.” In the end, I grabbed my trusty packet analyzer and monitored the traffic between IE and MapPoint, ultimately learning what it would take to make MapPoint think it was talking to a browser and not to a script. Here’s what happens:

  1. The first GET request to http://mappoint.msn.com/ earns you a Location: HTTP header in return. The new location redirects to home.aspx, prefixed by a pathname that includes a long string of arbitrary characters, presumably a session ID or some other form of tracking information.

  2. A GET on the new location retrieves the “Find a Map” form. Among the obvious fields—street, city, and so on—are some hidden ones. In particular, one hidden field named _ _VIEWSTATE contains about 1 KB of encoded data. It turns out that returning the exact _ _VIEWSTATE is important when sending the address query.

  3. Next, we do a POST to send MapPoint the address we want mapped. In addition to the address information and the _ _VIEWSTATE field, there are a few other hidden fields to send. In the present code, we send a request specifically for an address in the United States. MapPoint supports other countries, as well as “Place” queries for the entire world, and it wouldn’t be too much work to extend the program to handle these as well.

  4. In response to the POST, we get another Location: HTTP header, this time redirecting to map.aspx. The URL contains several arguments, and among them is the latitude/longitude data that we want.

  5. If you perform a GET on the new location, now you get the map. Our script doesn’t do this last GET, however, because the data we want is in the URL, not on the result page.

The Code

If you take a look at the GetPosition function in the code, you’ll see that it follows the five steps in the previous section exactly. The code also includes a simple routine to parse an address—to make the thing user-friendly, not because we had to—and a mainline to glue it all together and report the results. I used a nice package named Geo::Distance to perform the actual distance calculations. Time for some Perl!

Save the following code as geodist.pl:

#!/usr/bin/perl -w

# Usage: geodist.pl --from="fromaddr" --to="toaddr" [--unit="unit"]
# See ParseAddress(  ) below for the format of addresses. Default unit is
# "mile". Other units are yard, foot, inch, kilometer, meter, centimeter.

use strict;
use Getopt::Long;
use Geo::Distance;
use HTTP::Request::Common;
use LWP::UserAgent;

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

my $_ADDRESS_REGEX = q<(((([^\,]+),\s*)?([^\,]+),\s*)?([A-Z]{2}))?> .
  q<(\s*(\d{5}(-\d{4})?))?>;

sub ParseAddress {

  # Moderately robust regex parse of an address of the form:
  #   Street Address, City, ST ZIP
  # Assumes that a city implies a state, and a street address implies a
  # city; otherwise, all fields are optional. Does a good job so long as
  # there are no commas in street address or city fields.
  
  my $AddrIn = shift;
  my $ComponentsOut = shift;
  $AddrIn =~ /$_ADDRESS_REGEX/;
  $ComponentsOut->{Address} = $4 if $4;
  $ComponentsOut->{City} = $5 if $5;
  $ComponentsOut->{State} = $6 if $6;
  $ComponentsOut->{Zip} = $8 if $8;
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub GetPosition {

  # Hack mappoint.msn.com to obtain the longitude and latitude of an
  # address. MapPoint doesn't actually return lon/lat as user data, but
  # it can be found in a Location header when a successful map request is
  # made. Testing has shown this to be a robust hack. Biggest caveat
  # presently is failure when MapPoint returns multiple address matches.

  my $AddressIn = shift;
  my $LatitudeOut = shift;
  my $LongitudeOut = shift;

  # Create a user agent for HTTP requests.
  my $ua = LWP::UserAgent->new;

  # First do a simple request to get the redirect that MapPoint sends us.
  my $req = GET( 'http://mappoint.msn.com/' );
  my $res = $ua->simple_request( $req );

  # Save the redirect URI and then grab the full page.
  my $uri = $res->headers->{location};
  my $req = GET( 'http://mappoint.msn.com' . $uri );
  my $res = $ua->request( $req );

  # Get the _  _VIEWSTATE hidden input from the result.
  my ( $_  _VIEWSTATE ) =
    $res->content =~ /name="_  _VIEWSTATE" value="([^\"]*)"/s;

  # Construct the form fields expected by the mapper.
  my $req = POST( 'http://mappoint.msn.com' . $uri,
    [ 'FndControl:SearchType' => 'Address',
      'FndControl:ARegionSelect' => '12',
      'FndControl:StreetText' => $AddressIn->{Address},
      'FndControl:CityText' => $AddressIn->{City},
      'FndControl:StateText' => $AddressIn->{State},
      'FndControl:ZipText' => $AddressIn->{Zip},
      'FndControl:isRegionChange' => '0',
      'FndControl:resultOffSet' => '0',
      'FndControl:BkARegion' => '12',
      'FndControl:BkPRegion' => '15',
      'FndControl:hiddenSearchType' => '',
      '__VIEWSTATE' => $_  _VIEWSTATE
    ] );

  # Works without referer, but we include it for good measure.
  $req->push_header( 'Referer' => 'http://mappoint.msn.com' . $uri );

  # Do a simple request because all we care about is the redirect URI.
  my $res = $ua->simple_request( $req );

  # Extract and return the latitude/longitude from the redirect URI.
  ( $$LatitudeOut, $$LongitudeOut ) = $res->headers->{location} =~
    /C=(-?[0-9]+\.[0-9]+)...(-?[0-9]+\.[0-9]+)/;
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub main {

  # Get the command-line options.
  my ( $FromOpt, %FromAddress, $ToOpt, %ToAddress );
  my $UnitOpt = 'mile';
  GetOptions( "from=s" => \$FromOpt,
              "to=s"   => \$ToOpt,
              "unit=s" => \$UnitOpt );

  # Parse the addresses.
  ParseAddress( $FromOpt, \%FromAddress );
  ParseAddress( $ToOpt, \%ToAddress );

  # Get latitude/longitude for the addresses.
  my ( $FromLat, $FromLon, $ToLat, $ToLon );
  GetPosition( \%FromAddress, \$FromLat, \$FromLon );
  GetPosition( \%ToAddress, \$ToLat, \$ToLon );

  # If we at least got some numbers, then find the distance.
  if ( $FromLat && $FromLon && $ToLat && $ToLon ) {
    print "($FromLat,$FromLon) to ($ToLat,$ToLon) is ";
    my $geo = new Geo::Distance;
    print $geo->distance_calc( $UnitOpt, $FromLon,
                               $FromLat, $ToLon, $ToLat );
    if ( $UnitOpt eq 'inch' ) { print " inches\n"; }
    elsif ( $UnitOpt eq 'foot' ) { print " feet\n"; }
    else { print " ", $UnitOpt, "s\n"; }
  }
  else {
    print "Latitude/Longitude lookup failed for FROM address\n"
      if !( $FromLat && $FromLon );
    print "Latitude/Longitude lookup failed for TO address\n"
      if !( $ToLat && $ToLon );
  }
}

main(  );

Running the Hack

A couple of quick examples will show how the hack would work:

% perl geodist.pl --from="Los Angeles, CA" --to="New York, NY"
(34.05466,-118.24150) to (40.71012,-74.00657) is 2448.15742500315 miles

% perl geodist.pl
                    --from="14 Horseshoe Drive, Brookfield, CT" 
                    --to="5 Mountain Orchard, Bethel, CT"
(41.46380,-73.42021) to (41.35659,-73.41078) is 7.43209675476431 miles

% perl geodist.pl --from=06804 --to=06801
(41.47364,-73.38575) to (41.36418,-73.39262) is 7.57999735385486 miles

If something goes wrong with a position lookup—either because MapPoint didn’t find the address or because it found multiple addresses—the script simply indicates which address had a problem:

% perl geodist.pl --from="Los Angeles, CA" --to="New York"
Latitude/Longitude lookup failed for TO address

In this case, "New York" is too general and needs to be refined further.

Hacking the Hack

The most obvious enhancement is to address the two shortcomings of the existing hack: it works only with addresses within the U.S., and it fails if MapPoint returns multiple address matches. Addressing the first issue is a matter of adding some options to the command line and then changing the fields sent in the POST query. Addressing the second issue is a bit more difficult; it’s easy to parse the list that comes back, but the question is what to do with it. Do you just take the first address in the list? This may or may not be what the user wants. A true solution would probably have to present the list to the user and allow him to choose.

—Ron Pacheco

Hack #78. Super Word Lookup

Working on a paper, book, or thesis and need a nerdy definition of one word, and alternatives to another?

You’re writing a paper and getting sick of constantly looking up words in your dictionary and thesaurus. As most of the hacks in this book have done, you can scratch your itch with a little bit of Perl. This script uses the dict protocol (http://www.dict.org) and Thesaurus.com (http://www.thesaurus.com) to find all you need to know about a word.

By using the dict protocol, DICT.org and several other dictionary sites make our task easier, since we do not need to filter through HTML code to get what we are looking for. A quick look through CPAN (http://www.cpan.org) reveals that the dict protocol has already been implemented as a Perl module (http://search.cpan.org/author/NEILB/Net-Dict/lib/Net/Dict.pod). Reading through the documentation, you will find it is well-written and easy to implement; with just a few lines, you have more definitions than you can shake a stick at. Next problem.

Unfortunately, the thesaurus part of our program will not be as simple. However, there is a great online thesaurus (http://www.thesaurus.com) that we will use to get the information we need. The main page of the site offers a form to look up a word, and the results take us to exactly what we want. A quick look at the URL shows this will be an easy hurdle to overcome—using LWP, we can grab the page we want and need to worry only about parsing through it.

Since some words have multiple forms (noun, verb, etc.), there might be more than one entry for a word; this needs to be kept in mind. Looking at the HTML source, you can see that each row of the data is on its own line, starting with some table tags, then the header for the line (Concept, Function, etc.), followed by the content. The easiest way to handle this is to go through each section individually, grabbing from Entry to Source, and then parse out what’s between. Since we want only synonyms for the exact word we searched for, we will grab only sections where the content for the entry line contains only the word we are looking for and is between the highlighting tag used by the site. Once we have this, we can strip out those highlighting tags and proceed to finding the synonym and antonym lines, which might not be available for every section. The easiest thing to do here is to throw it all in an array; this makes it easier to sort, remove duplicate words, and display it. In cases in which you are parsing through long HTML, you might find it easier to put the common HTML strings in variables and use them in the regular expressions; it makes the code easier to read. With a long list of all the words, we use the Sort::Array module to get an alphabetical, and unique, listing of results.

The Code

Save the following code as dict.pl:

#!/usr/bin/perl -w
#
# Dict - looks up definitions, synonyms and antonyms of words.
# Comments, suggestions, contempt? Email adam@bregenzer.net.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
use LWP;
use Net::Dict;
use Sort::Array "Discard_Duplicates";
use URI::Escape;

my $word = $ARGV[0]; # the word to look-up
die "You didn't pass a word!\n" unless $word;
print "Definitions for word '$word':\n";

# get the dict.org results.
my $dict = Net::Dict->new('dict.org');
my $defs = $dict->define($word);
foreach my $def (@{$defs}) {
    my ($db, $definition) = @{$def};
    print $definition . "\n";
}

# base URL for thesaurus.com requests
# as well as the surrounding HTML of
# the data we want. cleaner regexps.
my $base_url       = "http://thesaurus.reference.com/search?q=";
my $middle_html    = ":</b>&nbsp;&nbsp;</td><td>";
my $end_html       = "</td></tr>";
my $highlight_html = "<b style=\"background: #ffffaa\">";

# grab the thesaurus results.
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');
my $data = $ua->get("$base_url" . uri_escape($word))->content;

# holders for matches.
my (@synonyms, @antonyms);

# and now loop through them all.
while ($data =~ /Entry(.*?)<b>Source:<\/b>(.*)/) {
    my $match = $1; $data = $2;

    # strip out the bold marks around the matched word.
    $match =~ s/${highlight_html}([^<]+)<\/b>/$1/;

    # push our results into our various arrays.
    if ($match =~ /Synonyms${middle_html}([^<]*)${end_html}/) {
        push @synonyms, (split /, /, $1);
    }
    elsif ($match =~ /Antonyms${middle_html}([^<]*)${end_html}/) {
        push @antonyms, (split /, /, $1);
    }
}

# sort them with sort::array,
# and return unique matches.
if ($#synonyms > 0) {
    @synonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@synonyms,
    );

    print "Synonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@synonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n\n";
}

# same thing as above.
if ($#antonyms > 0) {
    @antonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@antonyms,
    );

    print "Antonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@antonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n";
}

Running the Hack

Invoke the script on the command line, passing it one word at a time. As far as I know, these sites know how to work with English words only. This script has a tendency to generate a lot of output, so you might want to pipe it to less or redirect it to a file.

Here is an example where I look up the word "hack“:

% perl dict.pl "hack"
Definitions for word 'hack':
<snip>
hack
 
   <jargon> 1. Originally, a quick job that produces what is
   needed, but not well.
 
   2.  An incredibly good, and perhaps very time-consuming, piece
   of work that produces exactly what is needed.

<snip>
 
   See also {neat hack}, {real hack}.
 
   [{Jargon File}]
 
   (1996-08-26)
 
Synonyms for hack:
be at, block out, bother, bug, bum, carve, chip, chisel, chop, cleave, 
crack, cut, dissect, dissever, disunite, divide, divorce, dog, drudge, 
engrave, etch, exasperate, fashion, form, gall, get, get to, grate, grave, 
greasy grind, grind, grub, grubber, grubstreet, hack, hew, hireling, incise, 
indent, insculp, irk, irritate, lackey, machine, mercenary, model, mold, 
mould, nag, needle, nettle, old pro, open, part, pattern, peeve, pester, 
pick on, pierce, pique, plodder, potboiler, pro, provoke, rend, rip, rive, 
rough-hew, sculpt, sculpture, separate, servant, sever, shape, slash, slave, 
slice, stab, stipple, sunder, tear asunder, tease, tool, trim, vex, whittle, 
wig, workhorse
 
Antonyms for hack:
appease, aristocratic, attach, calm, cultured, gladden, high-class, humor, 
join, make happy, meld, mollify, pacify, refined, sophisticated, superior, 
unite

Hacking the Hack

There are a few ways you can improve upon this hack.

Using specific dictionaries

You can either use a different dict server or you can use only certain dictionaries within the dict server. The DICT.org server uses 13 dictionaries; you can limit it to use only the 1913 edition of Webster’s Revised Unabridged Dictionary by changing the $dict->define line to:

my $defs = $dict->define($word, 'web1913');

The $dict->dbs method will get you a list of dictionaries available.

Clarifying the thesaurus

For brevity, the thesaurus section prints all the synonyms and antonyms for a particular word. It would be more useful if it separated them according to the function of the word and possibly the definition.

—Adam Bregenzer

Hack #79. Word Associations with Lexical Freenet

There will come a time when you want a little more than simple word definitions, synonyms, or etymologies. Lexical Freenet takes you beyond these simple results, providing associative data, or “paths,” from your word to others.

Lexical Freenet (http://www.lexfn.com) allows you to search for word relationships like puns, rhymes, concepts, relevant people, antonyms, and so much more. For example, a simple search for the word disease returns a long listing of word paths, each associated with other words by different types of connecting arrows: disease triggers both aids and cancer; comprises triggers symptoms; and bio triggers such relevant persons as janet elaine adkins, james parkinson, alois alzheimer, and so on. This is but a small sampling of the available and verbose output.

In combination with Super Word Lookup” [Hack #78], a command-line utility of the Lexical Freenet functionality would bring immense lookup capabilities to writers, librarians, and researchers. This hack shows you how to create said interface, with the ability to customize which relationships you’d like to see, as well as turn the visual connections into text.

The Code

Save the following code as lexfn.pl:

#!/usr/bin/perl-w
#
# Hack to query and report from www.lexfn.com
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
# by rik - ora@rikrose.net
#

######################
# support stage      #
######################

use strict;
use Getopt::Std qw(getopts);
use LWP::Simple qw(get);
use URI::Escape qw(uri_escape uri_unescape);
use HTML::TokeParser;

sub usage (  ) { print "
usage: lexfn [options] word1 [word2]
options available:
 -s Synonymous     -a Antonym        -b Birth Year
 -t Triggers       -r Rhymes         -d Death Year
 -g Generalizes    -l Sounds like    -T Bio Triggers
 -S Specialises    -A Anagram of     -k Also Known As
 -c Comprises      -o Occupation of
 -p Part of        -n Nationality

 or -x for all

word1 is mandatory, but some searches require word2\n\n"
}

######################
# parse stage        #
######################

# grab arguments, and put them into %args hash, leaving nonarguments
# in @ARGV for us to process later (where word1 and word2 would be)
# if we don't have at least one argument, we die with our usage.
my %args; getopts('stgScparlAonbdTkx', \%args);
if (@ARGV > 2 || @ARGV == 0) { usage(  ); exit 0; }

# turn both our words into queries.
$ARGV[0] =~ s/ /\+/g; $ARGV[1] ||= "";
if ($ARGV[1]) { $ARGV[1] =~ s/ /\+/g; }

# begin our URL construction with the keywords.
my $URL = "http://www.lexfn.com/l/lexfn-cuff.cgi?sWord=$ARGV[0]".
          "&tWord=$ARGV[1]&query=show&maxReach=2";

# now, let's figure out our command-line arguments. each
# argument is associated with a relevant search at LexFN,
# so we'll first create a mapping to and fro.
my %keynames = (
 s => 'ASYN', t => 'ATRG', g => 'AGEN', S => 'ASPC', c => 'ACOM', 
 p => 'APAR', a => 'AANT', r => 'ARHY', l => 'ASIM', A => 'AANA', 
 o => 'ABOX', n => 'ABNX', b => 'ABBX', d => 'ABDX', T => 'ABTR', 
 k => 'ABAK'
);

# if we want everything all matches
# then add them to our arguments hash,
# in preparation for our URL.
if (defined($args{'x'}) && $args{'x'} == 1) {
   foreach my $arg (qw/s t g l S c p a r l A o n b d T k/){
       $args{$arg} = 1; # in preparation for URL.
   } delete $args{'x'}; # x means nothing to LexFN.
}

# build the URL from the flags we want.
foreach my $arg (keys %args) { $URL .= '&' . $keynames{$arg} . '=on'; }

######################
# request stage      #
######################

# and download it all for parsing.
my $content = get($URL) or die $!;

######################
# extract stage      #
######################

# with the data sucked down, pass it off to the parser.
my $stream = HTML::TokeParser->new( \$content ) or die $!;

# skip the form on the page, then it's the first <b>
# after the form that we start extracting data from
my $tag = $stream->get_tag("/form");
while ($tag = $stream->get_tag("b")) {
    print $stream->get_trimmed_text("/b") . " ";
    $tag = $stream->get_tag("img");
    print $tag->[1]{alt} . " ";
    $tag = $stream->get_tag("a");
    print $stream->get_trimmed_text("/a") . "\n";
}

exit 0;

The code is split into four basic stages:

Support code

Such as includes and any subroutines you will need

The parsing stage

Where we work out what the user actually wants and build a URL to perform the request

The request stage itself

Where we retrieve the results

The extract stage

Where we recover the data

In this case, the Lexical Freenet site is basic enough that the request is a single URL. A typical Freenet URL looks something like this:

http://www.lexfn.com/l/lexfn-cuff.cgi?fromresub=on&
ASYN=on&ATRG=on&AGEN=on&ASPC=on&ACOM=on&APAR=on&AANT=on&
ARHY=on&ASIM=on&AANA=on&ABOX=on&ABNX=on&ABBX=on&ABDX=on&
ABTR=on&ABAK=on&sWord=lee+harvey+oswald&tWord=disobey&query=SHOW

The data we wish to extract is formed by repeatedly pulling the information from a standard and repetitive chunk of HTML in the search results. This allows us to use the simple HTML::TokeParser module [Hack #20] to retrieve chunks of data easily by parsing the HTML tags, allowing us to query their attributes and retrieve the surrounding text. As you can tell from the previous code, this is not too difficult.

Running the Hack

As you can see from the code, the hack has several switches available for you to decide which kind of word results you want. In this case, we’ll run a search for everything related to disease:

% perl lexfn.pl -x disease
disease triggers aids
disease triggers cancer
disease triggers patients
disease triggers virus
disease triggers doctor
...
disease is more general than blood disorder
disease is more general than boutonneuse fever
disease is more general than cat scratch disease
...
disease rhymes with breeze
disease rhymes with briese
disease rhymes with cheese
disease rhymes with crees
...

Or perhaps a person’s name is more to your liking:

% perl lexfn.pl -bdonT "lee harvey oswald"
lee harvey oswald was born in 1939
lee harvey oswald died in 1963
lee harvey oswald has the nationality american
lee harvey oswald has the occupation assassin
lee harvey oswald triggers 1956-1959
lee harvey oswald triggers 1959
lee harvey oswald triggers 1962
lee harvey oswald triggers attempted
lee harvey oswald triggers become
lee harvey oswald triggers book
lee harvey oswald triggers citizen
lee harvey oswald triggers communist
...

—Richard Rose

Hack #80. Reformatting Bugtraq Reports

Since Bugtraq is such an important part of a security administrator’s watch list, it’ll only be a matter of time before you’ll want to integrate it more closely with your daily habits.

In this hack, we will write some code to extract the latest Bugtraq reports from http://www.security-focus.com and then output the simplified results for your viewing pleasure. Bugtraq, if you’re not familiar with it, is a moderated discussion list devoted to security issues. Discussions are detailed accounts of new security issues and vulnerabilities, both how they’re exploited and how they can be fixed. Let’s start by examining the web page where the Bugtraq report is located: http://www.security-focus.com/archive/1.

One nice thing to notice about this page is that the data is formatted in a table, complete with column headers. We can use those headers to simplify the data-scraping process by using a handy Perl module called HTML::TableExtract (http://search.cpan.org/author/MSISK/HTML-TableExtract/). TableExtract allows us to scrape the data from the web page without tying our code to a particular layout (at least, not too much). It accomplishes this feat by using those nice column headers. As long as those column headers stay the same, then the script should continue to work, even if SecurityFocus gives the page a facelift. In addition to that nice feature, TableExtract takes all the hard work out of parsing the HTML for the data we’re after. Let’s get started.

In the end, this script will use runtime options to allow the user to choose from a number of output formats and locations. I’m not a big fan of those one-letter flags sent to scripts to choose options, so we’ll be using short words instead.

The Code

You’ll need the HTML::TableExtract and LWP::Simple modules to grab the Bugtraq page. As we add more features, you’ll also need XML::RSS, Net::AIM, and Net::SMTP. You could use other modules like URI::URL or HTML::Element to simplify this hack even further.

There are a couple of things to note about this code. We start by retrieving the arguments passed to the script that will be used to determine the output formats; we’ll discuss those later. Next, the data scraped from the Bugtraq page is stuck into a custom data structure to make accessing it easier for later additions to this hack. Also, a subroutine is added to format the data contained in the data structure to ensure minimal code duplication once we have to format for multiple types of output.

Save the following code to a file called bugtraq_hack.pl:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Net::SMTP;
use Net::AIM;
use XML::RSS;

# get params for later use.
my $RUN_STATE = shift(@ARGV);

# the base URL of the site we are scraping and
# the URL of the page where the bugtraq list is located.
my $base_url = "http://www.security-focus.com";
my $url      = "http://www.security-focus.com/archive/1";

# get our data.
my $html_file = get($url) or die "$!\n";

# create an iso date.
my ($day, $month, $year) = (localtime)[3..5];
$year += 1900; my $date = "$year-$month-$day";

# since the data we are interested in is contained in a table,
# and the table has headers, then we can specify the headers and
# use TableExtract to grab all the data below the headers in one
# fell swoop. We want to keep the HTML code intact so that we
# can use the links in our output formats. start the parse:
my $table_extract =
   HTML::TableExtract->new(
     headers   => [qw(Date Subject Author)],
     keep_html => 1 );
$table_extract->parse($html_file);

# parse out the desired info and
# stuff into a data structure.
my @parsed_rows; my $ctr = 0;
foreach my $table ($table_extract->table_states) {
   foreach my $cols ($table->rows) {
      @$cols[0] =~ m|(\d+/\d+/\d+)|;
      my %parsed_cols = ( "date" => $1 );

      # since the subject links are in the 2nd column, parse unwanted HTML
      # and grab the anchor tags. Also, the subject links are relative, so
      # we have to expand them. I could have used URI::URL, HTML::Element,
      # HTML::Parse, etc. to do most of this as well.
      @$cols[1] =~ s/ class="[\w\s]*"//;
      @$cols[1] =~ m|(<a href="(.*)">(.*)</a>)|;
      $parsed_cols{"subject_html"} = "<a href=\"$base_url$2\">$3</a>";
      $parsed_cols{"subject_url"}  = "$base_url$2";
      $parsed_cols{"subject"}      = $3;

      # the author links are in the 3rd
      # col, so do the same thing.
      @$cols[2] =~ s/ class="[\w\s]*"//;
      @$cols[2] =~ m|(<a href="mailto:(.*@.*)">(.*)</a>)|;
      $parsed_cols{"author_html"}  = $1;
      $parsed_cols{"author_email"} = $2;
      $parsed_cols{"author"}       = $3;

      # put all the information into an
      # array of hashes for easy access.
      $parsed_rows[$ctr++] = \%parsed_cols;
   }
}
 
# if no params were passed, then
# simply output to stdout.
unless ($RUN_STATE) { print &format_my_data(  ); }

# formats the actual
# common data, per format.
sub format_my_data(  ) {
   my $data = "";

   foreach my $cols (@parsed_rows)  {
      unless ($RUN_STATE) { $data .= "$cols->{'date'} $cols->{'subject'}\n"; }
   }

   return $data;
}

Running The Hack

Invoke the script on the command line to view the latest Bugtraq listings:

% perl bugtraq.pl 
07/11/2003 Invision Power Board v1.1.2
07/11/2003 LeapFTP remote buffer overflow exploit
07/11/2003 TSLSA-2003-0025 - apache
07/11/2003 W-Agora 4.1.5
...etc...

Okay, that was easy, but what if you want it in HTML, RSS, email, or sent to your AIM account? No problem.

Hacking the Hack

Before we get to the code that handles the different outputs, let’s start with the format_my_data( ) subroutine. This will be used to decide what format we want our data to be presented in, tweak the display based on that decision, and then return the results. We’ll use the $RUN_STATE variable to decide what action format_my_data( ) will take. Normally, I would try to keep the code and variables used inside a subroutine as black-boxed as possible, but in this case, to keep things simple and compact, we’ll be accessing the dreaded global variables directly. Here’s the new code:

sub format_my_data(  ) {
   my $data = "";

   foreach my $cols (@parsed_rows)  {
      unless ($RUN_STATE || $RUN_STATE eq 'file') {
         $data .= "$cols->{date} $cols->{subject}\n"; 
      }
      elsif ($RUN_STATE eq 'html') {
         $data .= "<tr>\n<td>$cols->{date}</td>\n".
                  "<td>$cols->{subject_html}</td>\n".
                  "<td>$cols->{author_html}</td>\n</tr>\n";
      }
      elsif ($RUN_STATE eq 'email') {
         $data .= "$cols->{date} $cols->{subject}\n".
                  "link: $cols->{subject_url}\n";
      }
      elsif ($RUN_STATE eq 'aim') {
         $data .= "$cols->{date} $cols->{subject} $cols->{subject_url}\n";
      }
   }

   return $data;
}

Now, let’s implement the different runtime options. We’ll set up similar conditional code from the format_my_data( ) function in the main body of the script so that the script can handle all of the various output tasks. Here’s the code for outputting to email, file, RSS, HTML, and AIM. The AIM networking code is similar to [Hack #99], so, in the interest of brevity, I’ve declined to show it here:

unless ($RUN_STATE) { print &format_my_data(  ); }
elsif ($RUN_STATE eq 'html') {
   my $html .= "<html><head><title>Bugtraq $date</title></head><body>\n";
   $html    .= "<h1>Bugtraq listings for: $date</h1><table border=0>\n";
   $html    .= "<tr><th>Date</th><th>Subject</th><th>Author</th></tr>\n";
   $html    .= &format_my_data(  ) . "</table></body></html>\n";
   print $html;
}

elsif ($RUN_STATE eq 'email') {
   my $mailer = Net::SMTP->new('your mail server here');
   $mailer->mail('your sending email address');
   $mailer->to('your receiving email address');
   $mailer->data(  );
   $mailer->datasend("Subject: Bugtraq Report for $date\n\n");
   $mailer->datasend( format_my_data );
   $mailer->dataend(  );
   $mailer->quit;
}

elsif ($RUN_STATE eq 'rss') {
   my $rss = XML::RSS->new(version => '0.91');
   $rss->channel(title           => 'SecurityFocus Bugtraq',
                 link            => $bugtraq_url,
                 language        => 'en',
                 description     => 'Latest Bugtraq listings' );
   
   # add items to the RSS object.
   foreach my $cols (@parsed_rows) {
      $rss->add_item(title       => $cols->{date},
                     link        => $cols->{subject_url},
                     description => $cols->{subject} );
   } print $rss->as_string;
}

elsif ($RUN_STATE eq 'aim') {
  # AIM-related code goes here.
}

So what else could you do to enhance this hack? How about adding support for other instant messengers or allowing multiple command-line options at once? Alternatively, what about having the AIM bot email the Bugtraq report upon request, or make it a CGI script and output the RSS to an RSS aggregator like AmphetaDesk (http://www.disobey.com/amphetadesk/) or NetNewsWire (http://ranchero.com/netnewswire)?

—William Eastler

Hack #81. Keeping Tabs on the Web via Email

If you find yourself checking your email more than cruising the Web, you might appreciate a little Perl work to bring the Web to your mailbox.

If you’re an info-junky, you have a growing list of sites that you visit daily, maybe hourly. But sometimes, no matter how many times you refresh the page, some sites just don’t update soon enough. It would be better if there were a way to be notified when the site changes, so that you could spend your browsing time better.

Some sites offer a service like this, and others offer syndication feeds that programs can monitor, but there are many sites with which you’re out of luck in this regard. In this case, you’re going to need your own robot.

Planning for Change

For this hack, we’ll choose email as the method of notification, since that seems to be the simplest yet most flexible. We can use some common Perl modules to handle email and download web pages. This just leaves us with figuring out how to determine whether a web page has changed.

Actually, it would be more useful if we could figure out how much a web page has changed. Many web pages change constantly, since some might display the current time, others might show updated comment counts on news stories, and others might include a random quote on the page or feature different headlines for each request. If we’re just interested in major differences, such as a brand new front-page story on a news site, we’d like some relative measure.

While there are likely smarter ways of doing this, one quick way is to use the GNU diff utility to compare downloads of a web page across time. Further, it would be useful if we compared only the text of pages, not the HTML, since we’re more interested in content than layout or markup changes. For this, we can employ the venerable text-based web browser lynx. lynx is commonly found with many Linux distributions and is easily acquired on most other Unix operating systems. This browser already works to format web pages for a plain text display and, with the use of a command-line option, it can redirect this text to a file.

So, given lynx and diff, we can boil web pages down to their text content and compare changes in content. As an added benefit, we can include the text version of web pages in emails we send as an alternative to HTML.

With all this in mind, let’s start our script:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTTP::Status;
use MIME::Lite;

# Locate the utility programs needed
our $lynx = '/usr/bin/lynx';
our $diff = '/usr/bin/diff';

# Define a location to store datafiles,
# and an address for notification
my $data_path = "$ENV{HOME}/.pagediff";
my $email = 'your_email@here.com';

So far, we’ve set up some safety features and loaded up our tool modules. We’ve also located our utility programs, given the script a place to store data, and chosen an email address for notifications. Next, let’s make a list of sites to visit:

my %sites =
  (
   'slashdot'     => ['http://slashdot.org/index.html', 500],
   'penny_arcade' => ['http://www.penny-arcade.com/view.php3', 20],
  );

This is a hash that consists of nicknames for sites and, for each site, a list that consists of a URL and a change threshold. This number is very fuzzy and will require some tweaking to get the right frequency of notification. Higher numbers require more changes before an email goes out. We’ll see how this works in just a minute.

Next, let’s handle each of our favorite sites:

for my $site (keys %sites) {
  my ($url, $threshold) = @{$sites{$site}};

  # Build filenames for storing the HTML content, text
  # content, as well as content from the previous notification.
  my $html_fn = "$data_path/$site.html";
  my $new_fn  = "$data_path/$site.txt";
  my $old_fn  = "$data_path/$site-old.txt";

  # Download a new copy of the HTML.
  getstore($url, $html_fn);

  # Get text content from the new HTML.
  html_to_text($html_fn, $new_fn);

  # Check out by how much the page has changed since last notification.
  my $change = measure_change($new_fn, $old_fn);

  # If the page has changed enough,
  # send off a notification.
  if ($change > $threshold) {
    send_change_notification
      ($email,
       {
        site      => $site,
        url       => $url,
        change    => $change,
        threshold => $threshold,
        html_fn   => $html_fn,
        new_fn    => $new_fn,
        old_fn    => $old_fn
       }
      );

    # Rotate the old text content for the new.
    unlink $old_fn if (-e $old_fn);
    rename $new_fn, $old_fn;
  }
}

The main loop of our script is quite simple. For each site, it does the following:

  • Downloads a new copy of the web page.

  • Saves a copy of the page’s text contents.

  • Measures the amount of change detected between this latest download and content saved from the last time an email was sent. If the change is greater than the threshold for this site, it sends an email summarizing the change and rotates out the previously saved content for the new download.

Calling In Outside Help

Now that we have the backbone of the script started, let’s work on the functions that the script uses. In particular, these first functions will make use of our external tools, diff and lynx:

sub html_to_text {
  my ($html_fn, $txt_fn) = @_;
  open(FOUT, ">$txt_fn");
  print FOUT `$lynx -dump $html_fn`;
  close(FOUT);
}

This function, by way of lynx, extracts the text content from one HTML file and writes it to another file. It just executes the lynx browser with the -dump command-line option and saves that output.

Next, let’s use diff to examine changes between text files:

sub get_changes {
  my ($fn1, $fn2) = @_;
  return `$diff $fn1 $fn2`;
}

Again, this simple function executes the diff program on two files and returns the output of that program. Now, let’s measure the amount of change between two files using this function:

sub measure_change {
  my ($fn1, $fn2) = @_;
  return 0 if ( (!-e $fn1) || (!-e $fn2) );
  my @lines = split(/\n/, get_changes($fn1, $fn2));
  return scalar(@lines);
}

If one of the files to compare doesn’t exist, this function returns no change. But if the files exist, the function calls the get_changes function on two files and counts the number of lines of output returned. This is a dirty way to measure change, but it does work. The more two versions of a file differ, the more lines of output diff will produce. This measure says nothing about the nature of the changes themselves, but it can still be effective if you supply a little human judgment and fudging.

Keep this in mind when you adjust the change thresholds defined at the beginning of this script. You might need to adjust things a few times per site to figure out how much change is important for a particular site. Compared with the complexity of more intelligent means of change detection, this method seems best for a quick script.

Send Out the News

Now that all the tools for extracting content and measuring change are working, we need to work out the payoff for all of this: sending out change notification messages. With the MIME::Lite Perl module \ (http://search.cpan.org/author/YVES/MIME-Lite/), we can send multipart email messages with both HTML and plain text sections. So, let’s construct and send an email message that includes the original HTML of the updated web page, the text content, and a summary of changes found since the last update.

First, create the empty email and set up the basic headers:

sub send_change_notification {
  my ($email, $vars) = @_;

  # Start constructing the email message
  my $msg = MIME::Lite->new
    (
     Subject => "$vars->{site} has changed.".
       "($vars->{change} > $vars->{threshold})",
     To      => $email,
     Type    => 'multipart/alternative',
    );

  # Create a separator line of '='
  my $sep = ("=" x 75);

Note that we indicate how much the page has changed with respect to the threshold in the subject, and we create a separator line for formatting the text email portion of the message.

Next, let’s build the text itself:

  # Start the text part of email
  # by dumping out the page text.
  my $out = '';
  $out .= "The page at $vars->{url} has changed. ";
  $out .= "($vars->{change} > $vars->{threshold})\n\n";
  $out .= "\n$sep\nNew page text follows:\n$sep\n";

  open(FIN, $vars->{new_fn});
  local $/; undef $/;
  $out .= <FIN>;
  close(FIN);

  # Follow with a diff summary of page changes.
  $out .= "$sep\nSummary of changes follows:\n$sep\n\n";
  $out .= get_changes($vars->{new_fn}, $vars->{old_fn})."\n";

Here, we dump the text contents of the changed web page, courtesy of lynx, followed by the output of the diff utility. It’s a little bit of Perl obscura, but we do some finessing of Perl’s file handling to simplify reading the whole text file into a variable. The variable $/ defines what Perl uses as an end-of-line character, normally set to some sort of carriage return or linefeed combination. By using undef to clear this setting, Perl considers the entire contents of the file as one long line without endings and slurps it all down into the variable.

Now that we have the text of the email, let’s add it to our message:

  # Add the text part to the email.
  my $part1 = MIME::Lite->new
    (
     Type => 'text/plain',
     Data => $out
    );
  $msg->attach($part1);

This bit of code creates a message part containing our text, gives it a header describing its contents as plain text, and adds it to the email message. Having taken care of the text, let’s add the HTML part of the email:

  # Create and add the HTML part of the email, making sure to add a
  # header indicating the base URL used for relative URLs.
  my $part2 = MIME::Lite->new
    (
     Type => 'text/html',
     Path => $vars->{html_fn}
    );
  $part2->attr('Content-Location' => $vars->{url});
  $msg->attach($part2);

  # Send off the email
  $msg->send(  );
}

This code creates an HTML part for our email, including the HTML content we last downloaded and setting the appropriate header to describe it as HTML. We also define another header that lets mail readers know the base URL for the HTML in order to resolve relative URLs. We set this to the original URL of the page so that images and links resolve properly.

Finally, we send off the message.

Hacking the Hack

You’ll probably want to use this script in conjunction with cron [Hack #90] or some other scheduler, to check for changes in pages on a periodical basis. Just be polite and don’t run it too often. Checking every hour or so should be often enough for most sites.

As for the script itself, we’re cheating a little, since external tools do most of the work. But when we’re writing hacks, it’s best to be lazy and take advantage of other smart people’s work as much as possible. In working out the amount of change between notifications, we’re pretty inexact and fuzzy, but the method works. An exercise for the reader might be to find better means for measuring change, possibly methods that also can tell what kind of changes happened, to help you make better decisions on when to send notifications.

Also note that, though this hack uses both the diff and lynx programs directly, there are more cross-platform and pure Perl solutions for finding differences between files, such as the Text::Diff (http://search.cpan.org/author/RBS/Text-Diff/) or HTML::Diff (http://search.cpan.org/author/EZRAKILTY/html-diff/) modules on CPAN. And, with a bit of work, use of lynx could be replaced as well.

—l.m.orchard

Hack #82. Publish IE’s Favorites to Your Web Site

You’re surfing at a friend’s house and think, “What is that URL? I have a link to it in my favorites. I wish I were home.” How about making your favorites available no matter where you go?

You can’t take them with you—your Internet Explorer bookmarks, I mean. They live on a particular machine, accessible only to you only when you’re at that machine. Yes, there are some online bookmarking services, but the ones worth using have started making their users ante up or live through pop-up advertising hell. Of course, we Perl hackers don’t have to settle for either.

This hack publishes the contents of your IE Favorites to any server that you can access via FTP, setting you up with a nice little navigable menu frame on the left to hold your favorites and a content area on the right to display the sites you click on. Yes, this hack is a bit Windows- and IE-specific, but before you complain too much, it’s easily extendible to process any form of bookmark data that’s stored in tree structure, and the output is templated. The template shown here generates just the simple HTML menu system, but templates for PHP, ASP, raw data—anything you like—should be a breeze!

IE’s Favorites

Let’s start by taking a quick look at IE’s Favorites folder. If you use Windows, you probably know that this folder is now used by more than IE, but most people I know, myself included, still use it mainly in the context of web browsing. On Windows NT, 2000, and XP running IE4 or later, the Favorites folder is nothing more than a directory stored within your user profile tree. The easiest and most consistent method for locating the folder is through the USERPROFILE environment variable. You’ll note at the top of the script that a configurable global that identifies the root of the Favorites tree uses precisely this environment variable by default.

The structure of the Favorites tree itself is simple. It’s a directory tree that contains folders and links. It is possible to put things other than URL links into your Favorites; since we’re interested in publishing web bookmarks, we’ll ignore everything except directories and links (in this context, links are defined as files with a .url extension). A link document contains a bit of data in addition to the actual URL; fortunately, it’s easy to ignore, because the one thing that every link document has is a line that starts with URL= and then specifies the location in question. In our hack, we’ll simply extract this one line with a regular expression.

What It Does and How It Works

The script goes through three processes:

  1. Parse the Favorites tree and load the structure.

  2. Generate the output documents.

  3. Upload the documents via FTP.

We’ll take a quick look at each and then get right to the code.

Parsing the Favorites tree is handled by walking through the tree recursively using Perl’s system-independent opendir, readdir, and closedir routines. We use File::Spec routines for filename handling, to make enhancing and porting to other systems easier. The structure itself is read into a hash of hashes, one of the basic Perl techniques for creating a tree. For each hash in the tree, subdirectories map to another hash and links map to a scalar with the link URL. Reading the entire Favorites tree into an internal data structure isn’t strictly necessary, but it simplifies and decouples the later processes, and it also provides a great deal of flexibility for enhancements to the script.

Generating the output based on the Favorites data is done with a template so that the script doesn’t lock its user into any one type of output. When you’re using Perl, Text::Template is always an excellent choice—since Perl itself is the templating language—so we use it here. The template in this hack outputs HTML, defining a simple menu based on the folders and links and using HTML anchors to open the link targets in a named frame. It is expected that the entire set of documents, one document per Favorites directory, will be published to a single output directory, so filenames are generated using each directory’s relative path from the main Favorites directory, each path component being separated by a period. The documents themselves are generated in a temp directory, which the script attempts to remove upon completion.

The upload code is straightforward and nonrobust. Upload is via FTP, and the published script requires that the FTP parameters be coded in the configuration globals at the top of the file. If anything other than an individual put fails, the code gives up. If a put itself fails, a warning is issued and we move to the next file.

The Code

You need three files. PublishFavorites.pl is the Perl code that does the work. The template for our example is favorites.tmpl.html. Finally, a simple index.html, which defines the frameset for our menus, will need to be uploaded manually just once.

First, here’s PublishFavorites.pl:

#!/usr/bin/perl -w
use strict;
use File::Spec;
use File::Temp;
use Net::FTP;
use Text::Template;

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

## Configurable Globals

## $FAV_ROOT = Location of the root of the Favorites folder
my $FAV_ROOT = File::Spec->join( $ENV{USERPROFILE}, 'Favorites' );

## $FAV_NAME = Top level name to use in favorites folder tree
my $FAV_NAME = 'Favorites';

## $FAV_TMPL = Text::Template file; output files will use same extension
my $FAV_TMPL = 'favorites.tmpl.html';

## Host data for publishing favorites via ftp
my $FAV_HOST = 'myserver.net';
my $FAV_PATH = 'favorites';
my $FAV_USER = 'username';
my $FAV_PASS = 'password';

## End of Configurable Globals

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

my $_FAV_TEMPDIR = File::Temp->tempdir( 'XXXXXXXX', CLEANUP => 1 );

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub LoadFavorites {

  # Recursively load the structure of an IE
  # Favorites directory tree into a tree of hashes.

  my $FolderIn = shift;      # Folder to process
  my $FavoritesOut = shift;  # Hashref to load with this folder's entries

  # Do a readdir into an array for a
  # quick load of the directory entries.
  opendir( FOLDER, $FolderIn ) ||
    die "Could not open favorites folder '$FolderIn'";
  my @FolderEntries = readdir( FOLDER );
  closedir( FOLDER );

  # Process each entry in the directory.
  foreach my $FolderEntry ( @FolderEntries ) {

    # Skip special names . and ..
    next if $FolderEntry eq '.' || $FolderEntry eq '..';

    # Construct the full path to the current entry.
    my $FileSpec = File::Spec->join( $FolderIn, $FolderEntry );

    # Call LoadFavorites recursively if we're processing a directory.
    if ( -d $FileSpec && !( -l $FileSpec ) ) {
      $FavoritesOut->{$FolderEntry} = {};
      LoadFavorites( $FileSpec, $FavoritesOut->{$FolderEntry} );
    }

    # If it's not a directory, check for a filename that ends with '.url'.
    # When we find a link file, extract the URL and map the favorite to it.
    elsif ( $FolderEntry =~ /^.*\.url$/i ) {
      my ( $FavoriteId ) = $FolderEntry =~ /^(.*)\.url$/i;
      next if !open( FAVORITE, $FileSpec );
      ( $FavoritesOut->{$FavoriteId} ) =
           join( '', <FAVORITE> ) =~ /^URL=([^\n]*)\n/m;
      close( FAVORITE );
    }
  }
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub MakeDocName {

  # Quick hack to generate a safe filename for a favorites entry. Replaces
  # all whitespace and special characters with underscores, concatenates
  # parent spec with the new spec, and postfixes the the whole thing with
  # the same file extension as the globally named template document.

  my $FavoriteIn = shift;        # Label of new favorites entry
  my $ParentFilenameIn = shift;  # MakeDocName of the parent level

  my ( $FileType ) = $FAV_TMPL =~ /\.([^\.]+)$/;
  $FavoriteIn =~ s/(\s+|\W)/_/g;
  $ParentFilenameIn =~ s/$FileType$//;
  return lc( $ParentFilenameIn . $FavoriteIn . '.' . $FileType );
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub GenerateFavorites {

  # Recurse through a tree of Favorites entries and generate a document for
  # each level based on the globally named template document.

  my $FavoritesIn = shift;       # Hashref to current tree level
  my $FolderNameIn = shift;      # Name of the current folder
  my $ParentFilenameIn = shift;  # MakeDocName of the parent level

  # Create shortcut identifiers for things that get reused a lot.
  my $Folder = $FavoritesIn->{$FolderNameIn};
  my $FolderFilename = MakeDocName( $FolderNameIn, $ParentFilenameIn );

  # Separate the entries in the current folder into folders and links.
  # Folders can be identified because they are hash references, whereas
  # links are mapped to simple scalars (the URL of the link).
  my (%Folders,%Links);
  foreach my $Favorite ( keys( %{$Folder} ) ) {
    if ( ref( $Folder->{$Favorite} ) eq 'HASH' ) {
      $Folders{$Favorite} = { label => $Favorite,
        document => MakeDocName( $Favorite, $FolderFilename ) };
    }
    else {
      $Links{$Favorite}={label => $Favorite, href => $Folder->{$Favorite} };
    }
  }

  # Set up Text::Template variables, fill in the template with the folders
  # and links at this level of the favorites tree, and then output the
  # processed document to our temporary folder.
  my $Template = Text::Template->new( TYPE => 'FILE',
    DELIMITERS => [ '<{', '}>' ], SOURCE => $FAV_TMPL );
  my %Vars = (
    FAV_Name => $FAV_NAME,
    FAV_Home => MakeDocName( $FAV_NAME ),
    FAV_Folder => $FolderNameIn,
    FAV_Parent => $ParentFilenameIn,
    FAV_Folders => \%Folders,
    FAV_Links => \%Links
  );
  my $Document = $Template->fill_in( HASH => \%Vars );
  my $DocumentFile = File::Spec->join( $_FAV_TEMPDIR, $FolderFilename );
  if ( open( FAVORITES, ">$DocumentFile" ) ) {
    print( FAVORITES $Document );
    close( FAVORITES );
  }

  # Generate Favorites recursively for each of this folder's subfolders.
  foreach my $Subfolder ( keys( %Folders ) ) {
    GenerateFavorites( $Folder, $Subfolder, $FolderFilename );
  }
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub PublishFavorites {

  # Publish the generated documents via FTP. Pretty
  # much just gives up if something goes wrong.

  my $ftp = Net::FTP->new( $FAV_HOST ) ||
    die( "Cannot connect to '$FAV_HOST'" );
  $ftp->login( $FAV_USER, $FAV_PASS ) ||
    die( "Authorization for user '$FAV_USER' failed" );
  $ftp->cwd( $FAV_PATH ) ||
    die( "Could not CWD to '$FAV_PATH'" );
  opendir( FOLDER, $_FAV_TEMPDIR ) ||
    die( "Cannot open working directory '$_FAV_TEMPDIR'" );
  my @FolderEntries = readdir( FOLDER );
  closedir( FOLDER );
  foreach my $FolderEntry ( @FolderEntries ) {
    next if $FolderEntry eq '.' || $FolderEntry eq '..';
    $ftp->put( File::Spec->join( $_FAV_TEMPDIR, $FolderEntry ) ) ||
      warn( "Could not upload '$FolderEntry'...skipped" );
  }
  $ftp->quit;
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub main {
  my %Favorites;
  $Favorites{$FAV_NAME} = {};
  LoadFavorites( $FAV_ROOT, $Favorites{$FAV_NAME} );
  GenerateFavorites( \%Favorites, $FAV_NAME, '' );
  PublishFavorites(  );
}

main(  );

Here’s our example template, favorites.tmpl.html:

<html>
<body>
  <h1><a href="<{$FAV_Home}>"><{$FAV_Name}></a></h1>
  <select onChange="location.replace(this[this.selectedIndex].value)">
    <{
      $OUT .= '<option selected>' . $FAV_Folder . '</option>' . "\n";
      if ( $FAV_Parent ne '' ) {
        $OUT .= '<option value="' . $FAV_Parent . '">..</option>' . "\n";
      }
      foreach my $folder ( sort( keys( %FAV_Folders ) ) ) {
        $OUT .= '<option value="' . $FAV_Folders{$folder}->{document} .
          '">&gt;' . $FAV_Folders{$folder}->{label} . '</option>' . "\n";
      }
    }>
  </select>
  <table>
    <{
      foreach my $link ( sort( keys( %FAV_Links ) ) ) {
        $OUT .= '<tr><td><a target="net" href="' .
          $FAV_Links{$link}->{href} . '">' .
          $FAV_Links{$link}->{label} . '</a></td></tr>' . "\n";
      }
    }>
  </table>
</body>
</html>

And, finally, here’s the simple index.html:

<html>
<head>
  <title>Favorites</title>
</head>
<frameset cols="250,*">
  <frame name="nav" scrolling="yes" src="favorites.html" />
  <frame name="net" src="http://refdesk.com"/>
</frameset>
</html>

Running the Hack

Before you run the code, you need to take care of a few configuration items.

First, let’s make sure that your Favorites directory is where the script thinks it will be. At a command prompt, execute the following:

dir "%USERPROFILE%"\Favorites

If you get a directory listing with lots of names that appear to match things in your IE Favorites, then you’re good to go. If this directory doesn’t exist or if its contents don’t appear to be your Favorites, then you’ll have to find out where on your disk your Favorites are really stored and then change the $FAV_ROOT variable at the top of the script to match.

Second, you need to define your FTP information through the $FAV_HOST, $FAV_PATH, $FAV_USER, and $FAV_PASS variables at the top of the script.

Third, just once, you need to manually upload the index.html document to the directory on your server where you’re going to publish your Favorites. Of course, you are free to rename this document and publish your Favorites to a directory that already contains other files, but we suggest setting aside a separate directory. You are also welcome to change the default page that the index.html file initially shows in the net frame.

Okay, now simply run the script as follows:

% perl PublishFavorites.pl

The script runs quietly unless it encounters a problem. For most problems it might encounter, it just gives up and outputs an error message.

That’s it. Suppose you publish to the Favorites directory on http://www.myserver.net. Just point your browser to http://www.myserver.net/favorites, and you should have a web-accessible menu of all your IE Favorites! An example is available at http://www.ronpacheco.net/favorites/.

Hacking the Hack

There’s a ton of room for enhancement and modification to this hack. Most changes will probably fall into one of the hack’s three major processing tasks: loading the bookmark data, generating output, and publishing.

First, you can make it read something other than the IE Favorites tree. Maybe you want to read Mozilla bookmarks, or suck links off a web site, or read your own tree or bookmarks—whatever. If you can read it into the simple tree structure that the script already uses, you’ll have a plug-and-play subroutine.

Second, you can change the output. You can pretty up the existing HTML template, you can write new templates for things beyond simple HTML, or you can completely rip out the output section and replace it with something new. The framework for the code to traverse the bookmark tree is already in place. You can use the templating tools as is, or you can use the framework to build something new.

Finally, you can get more sophisticated about publishing. If someone were to ask me if, in practice, I’d really hardcode my username and password into a script and then use that script to publish stuff via an unsecured FTP session, I’d probably have to say no. I’m fairly comfortable putting the access information in the script, as long as I have good control over the system where it’s located—I’ve been doing it for a couple decades now without any incidents—but I would be reluctant to use cleartext FTP. In fact, I use FTP to my servers all the time, including a variation of this script, but I tunnel all the connections through SSH. For more sophistication, you could add SSH support directly to the script, and you could consider methods of publication other than FTP.

Like I said, there’s a ton of possibilities, limited only by the imagination of the hacker!

—Ron Pacheco

Hack #83. Spidering GameStop.com Game Prices

Looking to get notification when “Army Men: Quest for Some Semblance of Quality” goes on sale at $5.99? With this hack, you’ll be able to keep an eye on your most desired (or derisive) video game titles.

All work and no play makes Jack a dull geek. Of course, having to hunt down game prices to figure out what he can afford to play on his PlayStation 2 makes Jack even duller. It’s so much better to get a spider to do it for him.

We like GameStop.com (http://www.gamestop.com), a retail site for console and PC video games, so we came up with a simple spider that gathers up information about a certain platform of games—the way the script is written, it gathers information on XBox games—but, as you’ll see, it’s easy to adapt the script to other uses.

The Code

Save the following code as gamestop.pl:

#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;
use LWP::Simple;

# the magical URL.
my $url = "http://www.gamestop.com/search.asp?keyword=&platform=26".
          "&lookin=title&range=all&genre=0&searchtype=adv&sortby=title";

# the magical data.
my $data = get($url) or die $!;

# the magical parser.
my $p = HTML::TokeParser->new(\$data);

# now, find every table that's 510 and 75.
while (my $token = $p->get_tag("table")) {
    next unless defined($token->[1]{height});
    next unless defined($token->[1]{width});
    next unless $token->[1]{height} == 75;
    next unless $token->[1]{width} == 510;

    # get our title.
    $p->get_tag("font"); $p->get_tag("a");
    my $title = $p->get_trimmed_text;

    # and our price.
    $p->get_tag("font"); $p->get_tag("/b");
    my $ptoken = $p->get_token;
    my $price = $ptoken->[1];
    $price =~ s/\$//;

    # comma spliced.
    print "\"$title\",$price\n";
}

Running the Hack

The hack is simple enough. It gathers information about XBox games, sorted by title, and puts that information into a comma-delimited file, as per the following output:

% perl gamestop.pl 
"4x4 Evolution 2 - Preowned",16.99
"Aggressive Inline - Preowned",16.99
"Air Force Delta Storm - Preowned",27.99
"Alias",49.99
...etc...

It’s very basic right now, but there’s some fun stuff we can build in.

Hacking the Hack

Let’s start by making the request keyword-based instead of platform-based; maybe you’re interested in racing games and don’t care about the platform.

GameStop by keyword

Add these two lines to the top of the script, after the use statements:

# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;

Then, change your magical URL, like this:

 # the magical URL.
  my $url = "http://www.gamestop.com/search.asp?keyword=$query&platform=".
            "&lookin=title&range=all&genre=0&searchtype=adv&sortby=title";

This’ll give you 10 results based on your keyword. For example:

% perl gamestop.pl racing
"All Star Racing",7.99
"Andretti Racing - Preowned",9.99
"Andretti Racing - Preowned",7.99
"Antz Extreme Racing - Preowned",16.99
"Antz Racing",4.99
"Antz Racing - Preowned",29.99
"ATV Quad Power Racing 2 - Preowned",24.99
"ATV Quad Power Racing 2 - Preowned",17.99
"ATV Quad Power Racing 2 - Preowned",17.99
"ATV: Quad Power Racing 2",19.99
"Batman: Gotham City Racer - Preowned",27.99
"Beetle Adventure Racing - Preowned",29.99

Putting the results in a different format

Of course, getting the results in a comma-delimited format might not be what you want. How about sorting results by price and saving them to an RSS file, so you can have an RSS feed of the cheapest games that match a keyword? (Unabashed capitalist hackers could even add an affiliate code to the link URL.)

Here’s how to do it. The first thing you want to do is add use XML::RSS to the use lines at the top of the script. Then, as in the first example, you can add the query word from the command line, or you can hardcode it into the query. In this example, I hardcode it into the query, with the idea that you can add this to your server and run it as a cron job periodically:

  # the magical URL.
  my $url = "http://www.gamestop.com/search.asp?".
            "keyword=your search keyword here&platform=".
            "&lookin=title&range=all&genre=0&searchtype=adv&sortby=title";

Now, you want to change the output from a comma-delimited file to an RSS feed. Remove these lines:

 # comma spliced.
 print "\"$title\",$price\n";

and add these lines above the magical URL line:

# start the RSS feed.
my $rss = XML::RSS->new(version => '0.91');
$rss->channel(
    'link'       => http://www.gamestop.com,
     title        => "Game Prices from GameStop",
     description  => "Great Games and Stuff!"
);

Then, add the lines that create the RSS feed itself:

# add this item
# to our RSS feed.
$rss->add_item(
   title       => "$title, $price...", 
   'link'      => "http://www.gamestop.com/search.asp?keyword=$title".
                  "&platform=0&lookin=title&range=all&genre=0&sortby=title"
);

Finally, add this as the last lines of the script, to save your output as a feed:

# and save our RSS.
$rss->save("gamestop.rdf");

There are several minor hacks you can try with this script. GameStop.com offers several different search options; try experimenting with the different searches and see how they impact the result URLs. Experimenting with the URL options in the magical URL lines can get you lots of different results. Likewise, as written, the script reports on the first page of results; catering to the entire listing of search results can be done with WWW::Mechanize [Hack #21] or a manual loop.

Hack #84. Bargain Hunting with PHP

If you’re always on the lookout for the best deals, coupons, and contests, a little bit of PHP-scraping code can help you stay up-to-date.

Scraping content is a task than can be handled by most programming languages. PHP is quickly becoming one of the most popular scripting languages, and it is particularly well-suited for scraping work. With a moderate grasp of PHP, programmers can write scrapers in a matter of minutes. In this section, we’ll work through some of the basic code and concepts for scraping with PHP.

There are a handful of useful functions that most scraping tasks will need, which will make writing customized scrapers almost painless. For the sake of simplicity, we won’t use regular expressions here, but the more agile programmers will quickly note where regular expressions might make these functions work better.

The first function that we want uses PHP’s fopen( ) function to fetch individual pages from a web server. For more sophisticated scrapers, a direct socket connection is probably more desirable, but that’s another matter. For now, we’ll go the simple way:

function getURL( $pURL ) {
   $_data = null;
   if( $_http = fopen( $pURL, "r" ) ) {
      while( !feof( $_http ) ) {
         $_data .= fgets( $_http, 1024 );
      }
      fclose( $_http );
   }
   return( $_data );
}

Calling this function is done simply, like this:

$_rawData = getURL( "http://www.example.com/" );

If $_rawData is null, the function wasn’t able to fetch the page. If $_rawData contains a string, we’re ready for the next step.

Because every author codes her HTML slightly different, it’s useful to normalize the raw HTML data that getURL( ) returns. We can do this with the cleanString( ) function. This function simply removes newline, carriage return, tab, and extra space characters. Regular expressions could simplify this function a bit, if you are comfortable with them.

function cleanString( $pString ) {
   $_data = str_replace( array( chr(10), chr(13), chr(9) ), chr(32), [RETURN]
$pString );
      while( strpos( $_data, str_repeat( chr(32), 2 ), 0 ) != false ) {
         $_data = str_replace( str_repeat( chr(32), 2 ), chr(32), $_data );
      }
      return( trim( $_data ) );
}

We’ll clean up the raw HTML source with the following code:

$_rawData = cleanString( $_rawData );

Now, we have some data that is easy to parse. Two other useful functions will parse out particular pieces of the source and get data from individual HTML tags:

function getBlock( $pStart, $pStop, $pSource, $pPrefix = true ) {
   $_data = null;
   $_start = strpos( strtolower( $pSource ), strtolower( $pStart ), 0 );
   $_start = ( $pPrefix == false ) ? $_start + strlen( $pStart ) : $_start;
   $_stop = strpos( strtolower( $pSource ), strtolower( $pStop ), $_start );
   if( $_start > strlen( $pElement ) && $_stop > $_start ) {
      $_data = trim( substr( $pSource, $_start, $_stop - $_start ) );
   }
   return( $_data );
}

function getElement( $pElement, $pSource ) {
   $_data = null;
   $pElement = strtolower( $pElement );
   $_start = strpos( strtolower( $pSource ), chr(60) . $pElement, 0 );
   $_start = strpos( $pSource, chr(62), $_start ) + 1;
   $_stop = strpos( strtolower( $pSource ), "</" . $pElement . [RETURN]
   chr(62), $_start );
   if( $_start > strlen( $pElement ) && $_stop > $_start ) {
      $_data = trim( substr( $pSource, $_start, $_stop - $_start ) );
   }
   return( $_data );
}

We can use each of these functions with the following code:

$_rawData = getBlock( start_string, end_string, raw_source, [RETURN]
include_start_string );
$_rawData = getElement( html_tag, raw_source );

Let’s assume for a moment that we have source code that contains the string "Total of 13 results“, and we want just the number of results. We can use getBlock( ) to get that number with this code:

$_count = getBlock( "Total of", "results", $_rawData, false );

This returns "13“. If we set $pPrefix to true, $_count will be "Total of 13“. Sometimes, you might want the start_string included, and other times, as in this case, you won’t.

The getElement( ) function works basically the same way, but it is specifically designed for parsing HTML-style tags instead of dynamic strings. Let’s say our example string is "Total of <b>13</b> results“. In this case, it’s easier to parse out the bold element:

$_count = getElement( "b", $_rawData );

This returns "13" as well.

It’s handy to put the scraping functions into an includable script, because it keeps you from having to copy/paste them into all your scraping scripts. In the next example, we save the previous code into scrape_func.php.

Now that we have the basics covered, let’s scrape a real page and see it in action. For this example, we’ll scrape the latest deals list from TechDeals.net (http://www.techdeals.net).

The Code

Save the following code as bargains.php:

/* include the scraping functions script:  */
include( "scrape_func.php" ); 

/* Next, we'll get the raw source code of
   the page using our getURL(  ) function:  */
$_rawData = getURL( "http://www.techdeals.net/" ); 

/* And clean up the raw source for easier parsing:  */
$_rawData = cleanString( $_rawData ); 

/* The next step is a little more complex. Because we've already
   looked at the HTML source, we know that the items start and
   end with two particular strings. We'll use these strings to
   get the main data portion of the page:*/
$_rawData = getBlock( "<div class=\"NewsHeader\">",
                      "</div> <div id=\"MenuContainer\">", $_rawData ); 

/* We now have the particular data that we want to parse into
   an itemized list. We do that by breaking the code into an
   array so we can loop through each item: */
$_rawData = explode( "<div class=\"NewsHeader\">", $_rawData ); 

/* While iterating through each value, we 
   parse out the individual item portions:  /*
foreach( $_rawData as $_rawBlock ) {
   $_item = array(  );
   $_rawBlock = trim( $_rawBlock );
   if( strlen( $_rawBlock ) > 0 ) {

      /*   The title of the item can be found in <h2> ... </h2> tags   */
      $_item[ "title" ] = strip_tags( getElement( "h2", $_rawBlock ) );

      /*   The link URL can is found between
           http://www.techdeals.net/rd/go.php?id= and "   */
      $_item[ "link" ] = getBlock( "http://www.techdeals.net/rd/go.php?id=",
                                   chr(34), $_rawBlock );

      /*   Posting info is in <span> ... </span> tags   */
      $_item[ "post" ] = strip_tags( getElement( "span", $_rawBlock ) );

      /*   The description is found between an </div> and a <img tag   */
      $_item[ "desc" ] = cleanString( strip_tags( getBlock( "</div>",
                                      "<img", $_rawBlock ) ) );

      /*   Some descriptions are slightly different,
           so we need to clean them up a bit   */
      if( strpos( $_item[ "desc" ], "Click here for the techdeal", 0 ) [RETURN]
      > 0 ) {
         $_marker = strpos( $_item[ "desc" ], "Click here for the techdeal", [RETURN]
         0 );
         $_item[ "desc" ] = trim( substr( $_item[ "desc" ], 0, $_marker ) );
      }

      /*   Print out the scraped data   */
      print( implode( chr(10), $_item ) . chr(10) . chr(10) );

      /*   Save the data as a string (used in the mail example below)   */
      $_text .= implode( chr(10), $_item ) . chr(10) . chr(10);
   }
}

Running the Hack

Invoke the script from the command line, like so:

% php -q bargains.php

Values on Video
http://www.techdeals.net/rd/go.php?id=28
Posted 08/06/03 by david
TigerDirect has got the eVGA Geforce FX5200 Ultra 128MB video card
with TV-Out & DVI for only $124.99+S/H after a $20 rebate. 

Potent Portable
http://www.techdeals.net/rd/go.php?id=30
Posted 08/06/03 by david
Best Buy has got the VPR Matrix 220A5 2.2Ghz Notebook for just
$1049.99 with free shipping after $250 in rebates.

...etc...

Hacking the Hack

This output could be emailed easily, or you could even put it into an RSS feed. If you want to email it, you can use PHP’s mail( ) function:

mail( "me@foo.com", "Latest Tech Deals", $_text );

But how do you output RSS in PHP? While there are many ways to go about it, we’ll use the simplest to keep everything concise. Creating an RSS 0.91 feed is a matter of three small sections of code—the channel metadata, the item block, and the closing channel tags:

<rss version="0.91">
   <channel>
      <title><?= htmlentities( $_feedTitle ) ?></title>
      <link><?= htmlentities( $_feedLink ) ?></link>
      <description><?= htmlentities( $_feedDescription ) ?></description>
      <language>en-us</language> 

      <item>
         <title><?= htmlentities( $_itemTitle ) ?></title>
         <link><?= htmlentities( $_itemLink ) ?></link>
         <description><?= htmlentities( $_itemDescription ) ?></description>
      </item> 

   </channel>
</rss>

By putting together these three simple blocks, we can quickly output a full RSS feed. For example, let’s use our scraper and output RSS instead of plain text:

<rss version="0.91">
   <channel>
      <title>TechDeals: Latest Deals</title>
      <link>http://www.techdeals.net/</link>
      <description>Latest deals from TechDeals.net (scraped)</description>
      <language>en-us</language>
<?
   include( "scrape_func.php" );
   $_rawData = getURL( "http://www.techdeals.net/" );
   $_rawData = cleanString( $_rawData );
   $_rawData = getBlock( "<div class=\"NewsHeader\">",
                         "</div> <div id=\"MenuContainer\">", $_rawData );
   $_rawData = explode( "<div class=\"NewsHeader\">", $_rawData );
   foreach( $_rawData as $_rawBlock ) {
      $_item = array(  );
      $_rawBlock = trim( $_rawBlock );
      if( strlen( $_rawBlock ) > 0 ) {
         $_item[ "title" ] = strip_tags( getElement( "h2", $_rawBlock ) );
         $_item[ "link" ] 
         = getBlock( "http://www.techdeals.net/rd/go.php?id=", 
         chr(34), $_rawBlock );
         $_item[ "post" ] = strip_tags( getElement( "span", $_rawBlock ) );
         $_item[ "desc" ] = cleanString( strip_tags( getBlock( "</div>",
                                      "<img", $_rawBlock ) ) );
         if( strpos($_item[ "desc" ], "Click for the techdeal", 0 ) > 0 ) {
            $_marker = strpos($_item[ "desc" ], "Click for the techdeal",0 );
            $_item[ "desc" ] = trim(substr( $_item[ "desc" ], 0, $_marker) );
         }
?>
      <item>
         <title><?= $_item ["title" ] ?></title>
         <link><?=  $_item[ "link" ] ?></link>
         <description>
            <?= $_item[ "desc" ] . " (" . $_item[ "post" ] . ")" ?>
         </description>
      </item>
<?
      }
   }
?>
   </channel>
</rss>

Keep in mind that this is the quick-and-dirty way to create RSS. If you plan on generating a lot of RSS, look into RSS 1.0 and build yourself a PHP class for the RSS-generating code.

As you can see, a few simple functions and a few lines of code are all that is needed to make a usable scraper in PHP. Customizing the script and the output are a matter of personal whim. In this particular example, you could also parse out information about the comments that are included in the items, or you could merge in other bargain sites, like AbleShoppers (http://www.ableshopper.com) or Ben’s Bargains (http://www.bensbargains.net).

—James Linden

Hack #85. Aggregating Multiple Search Engine Results

Even though Google may solve all your searching needs on a daily basis, there may come a time when you need a “super search”—something that queries multiple search engines or databases at once.

Google is still the gold standard for search engines and still arguably the most popular search spot on the Web. But after years of stagnation, the search engine wars are firing up again. AlltheWeb.com (http://www.alltheweb.com) in particular is working hard to offer new search syntax, a larger web index (over 3.2 billion URLs at the time of this writing), and additional interface options. If you want to keep up with searching on the Web, it behooves you to try search engines other than Google, if only to get an idea of how the other engines are evolving.

This hack builds a meta-search engine, querying several search engines in turn and displaying the aggregated results. Actually, it can query more than just search engines; it can request data from anything to which you can submit a search request. It does so by using a set of plug-ins—each of which knows the details of a particular search engine or site’s search request syntax and the format of its results—that perform the search and return the results. The main script, then, does nothing more than farm out the request to these plug-ins and let them perform their magic. This is an exercise in hacking together a client/server protocol. The protocol I use is simple: each plug-in needs to return URL and text pairs. How do we delimit one from the other? By finding a character that’s illegal in URLs, such as the common tab, and using that to separate our data.

The protocol runs as follows:

  1. The server starts up a plug-in as an executable program, with the search terms as command-line parameters.

  2. The client responds by printing one result per new line, in the format of URL, tab, then text.

  3. The server receives the data, formats it a little before printing, and then moves on to the next available plug-in.

Note that because we have a simple call and response pattern, the plug-ins can query anything, including your own local databases with Perl’s DBI, Python scripts that grok FTP servers, or PHP concoctions that do reverse lookups on phone numbers. As long as the plug-in returns the data in URL-tab-text format, what it does and how it’s programmed don’t matter.

The Code

The following short piece of code demonstrates the server portion, which searches for a ./plugins directory and executes all the code within:

#!/usr/bin/perl -w

# aggsearch - aggregate searching engine
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

######################
# support stage      #
######################

use strict;

# change this, if neccessary.
my $pluginDir = "plugins";

# if the user didn't enter any search terms, yell at 'em.
unless (@ARGV) { print 'usage: aggsearch "search terms"', "\n"; exit; }

# this routine actually executes the current
# plug-in, receives the tabbed data, and sticks
# it into a result array for future printing.
sub query {
    my ($plugin, $args, @results) = (shift, shift);
    my $command = $pluginDir . "/" . $plugin . " " . (join " ", @$args);
    open RESULTS, "$command |" or die "Plugin $plugin failed!\n";
    while (<RESULTS>) {
        chomp; # remove new line.
        my ($url, $name) = split /\t/;
        push @results, [$name, $url];
    } close RESULTS;

    return @results;
}

######################
# find plug-ins stage #
######################

opendir PLUGINS, $pluginDir
   or die "Plugin directory \"$pluginDir\"".
     "not found! Please create, and populate\n";
my @plugins = grep {
    stat $pluginDir . "/$_"; -x _ && ! -d _ && ! /\~$/;
} readdir PLUGINS; closedir PLUGINS;


######################
# query stage        #
######################

for my $plugin (@plugins){
    print "$plugin results:\n";
    my @results = query $plugin, \@ARGV;
    for my $listref (@results){
        print " $listref->[0] : $listref->[1] \n"
    } print "\n";
}

exit 0;

The plug-ins themselves are even smaller than the server code, since their only purpose is to return a tab-delimited set of results. Our first sample looks through the freshmeat.net (http://freshmeat.net) software site:

#!/usr/bin/perl -w

# Example freshmeat searching plug-in
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

use strict;
use LWP::UserAgent;
use HTML::TokeParser;

# create the URL from our incoming query.
my $url = "http://freshmeat.net/search-xml?q=" . join "+", @ARGV;

# download the data.
my $ua = LWP::UserAgent->new(  );
$ua->agent('Mozilla/5.0');
my $response = $ua->get($url);
die $response->status_line . "\n"
  unless $response->is_success;

my $stream = HTML::TokeParser->new (\$response->content) or die "\n";
while (my $tag = $stream->get_tag("match")){
    $tag = $stream->get_tag("projectname_full");
    my $name = $stream->get_trimmed_text("/projectname_full");
    $tag = $stream->get_tag("url_homepage");
    my $url = $stream->get_trimmed_text("/url_homepage");
    print "$url\t$name\n";
}

Our second sample uses the Google API:

#!/usr/bin/perl -w

# Example Google searching plug-in

use strict;
use warnings;
use SOAP::Lite;

# all the Google information
my $google_key  = "your API key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");
my $query       = join "+", @ARGV;

# do the search...
my $result = $gsrch->doGoogleSearch($google_key, $query,
                          1, 10, "false", "",  "false",
                          "lang_en", "", "");

# and print the results.
foreach my $hit (@{$result->{'resultElements'}}){
   print "$hit->{URL}\t$hit->{title}\n";
}

Our last example covers AlltheWeb.com:

#!/usr/bin/perl -w

# Example alltheweb searching plug-in
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

use strict;
use LWP::UserAgent;
use HTML::TokeParser;

# create the URL from our incoming query.
my $url = "http://www.alltheweb.com/search?cat=web&cs=iso-8859-1" .
          "&q=" . (join "+", @ARGV) . "&_sb_lang=en";

print $url;
# download the data.
my $ua = LWP::UserAgent->new(  );
$ua->agent('Mozilla/5.0');
my $response = $ua->get($url);
die $response->status_line . "\n"
  unless $response->is_success;

my $stream = HTML::TokeParser->new (\$response->content) or die "\n";
while (my $tag = $stream->get_tag("p")){
    $tag = $stream->get_tag("a");
    my $name = $stream->get_trimmed_text("/a");
    last if $name eq "last 10 queries";
    my $url = $tag->[1]{href};
    print "$url\t$name\n";
}

Running the Hack

Invoke the script from the command line, like so:

% perl aggsearch.pl spidering 
alltheweb results:
 Google is now better at spidering dynamic sites. : [long url here] 
 Submitting sites to search engines : [long url here]
 WebcamCrawler.com  : [long url here]
 ...etc...

freshmeat results:
 HouseSpider : http://freshmeat.net/redir/housespider/28546/url_homepage/ 
 PhpDig : http://freshmeat.net/redir/phpdig/15340/url_homepage/
 ...etc...

google results:
 What is Spidering? : http://www.1afm.com/optimization/spidering.html
 SWISH-Enhanced Manual: Spidering : http://swish-e.org/Manual/spidering.html
 ...etc...

The power of combining data from many sources gives you more scope for working out trends in the information, a technique commonly known as data mining.

—Richard Rose

Hack #86. Robot Karaoke

Who says people get to have all the fun? With this hack, you can let your computer do a little singing, by scraping the LyricsFreak.com web site and sending the results to a text-to-speech translator.

There are things that are text-only and things that are multimedia. Then there’s this hack, which turns boring old text into multimedia—specifically, a .wav file.

This hack, as it stands, is actually pretty silly. It searches the lyric collections at LyricsFreak.com for the keywords you specify, then sends the matching lyrics to yet another site (http://naturalvoices.com ) that turns them into a .wav file. If you’re running a Win32 system, the code will then automatically play the .wav file (“sing” the lyrics, for some narrow definition of sing) via the Win32::Sound module (http://search.cpan.org/author/ACALPINI/Win32-Sound/).

Listening to your computer’s rendition of the Spider-Man theme song can be detrimental to your health.

As you’re playing with this code, you might want to think of more sublime and less ridiculous implementations. Do you have a site read by low-vision people? Are there short bits of text, such as a local weather forecast, that would be useful for them to have read aloud? Would it be helpful to have a button that would convert a story summary to a .wav file for later download?

The Code

One of the modules used with this code, Win32::Sound, is for Win32 machines only. Since it’s used to play back the generated .wav file, you will not get a “singing” robot if you’re on a non-Win32 machine; you’ll just get a .wav file, suitable for playing through your preferred music player.

Save this script as robotkaroake.pl:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use URI::Escape;
use Win32::Sound;
use SOAP::Lite;

# use your own Google API key here!
my $google_key  = "your Google key here";
my $google_wdsl = "GoogleSearch.wsdl";

# load in our lyrics phrase from the command line.
my $lyrics_phrase = shift or die "Usage: robot-karaoke.pl <phrase>\n";

# and perform the search on Google.
my $google_search_term = "intitle:\"$lyrics_phrase\" site:lyricsfreak.com";
my $googleSearch = SOAP::Lite->service("file:$google_wdsl");
my $result = $googleSearch->doGoogleSearch(
                      $google_key, $google_search_term,
                      0, 10, "false", "", "false",
                      "", "", "");

# if there are no matches, then say so and die.
die "No LyricsFreak matches were found for '$lyrics_phrase'.\n"
          if $result->{estimatedTotalResultsCount} == 0;
  
# and take the first Google result as
# the most likely location on LyricsFreak.com.
my @results         = @{$result->{'resultElements'}};
my $first_result    = $results[0];
my $lyricsfreak_url = $first_result->{'URL'};
print "Downloading lyrics from:\n $lyricsfreak_url\n";

# and download the data from LyricsFreak.com.
my $content = get($lyricsfreak_url) or die $!;
print "Connection to LyricsFreak was successful.\n";

# we have the data, so let's parse it.
# all lyrics are stored in a pre tag,
# so we delete everything before and after.
$content =~ s/.*<pre><b>.*<\/b><br>//mgis;
$content =~ s/<\/pre>.*//mgis;
my @lyrics_lines = split("\x0d", $content);

# AT&T's demo TTS service takes a maximum of 30 words,
# so we'll create a mini chunk of the lyrics to send off.
# each of these chunks will be sent to the TTS server
# then saved seperately as multiple mini-wav files.
my (@lyrics_chunks, $current_lyrics_chunk); my $line_counter = 0;
for (my $i = 0; $i <= scalar(@lyrics_lines) - 1; ++$i) {
    next if $lyrics_lines[$i] =~ /^\s*$/;
    $current_lyrics_chunk .= $lyrics_lines[$i] . "\n";

    if (($line_counter == 5) || ($i == scalar(@lyrics_lines) - 1) ) {
        push(@lyrics_chunks, $current_lyrics_chunk);
        $current_lyrics_chunk = ''; $line_counter = 0;
    } $line_counter++;
}

# now, we'll go through each chunk,
# and send it off to our TTS server.
my @temporary_wav_files;
foreach my $lyrics_chunk (@lyrics_chunks) {

    # and download the data.
    my $url = 'http://morrissey.naturalvoices.com/tts/cgi-bin/nph-talk';
    my $req = HTTP::Request->new('POST', $url); # almost there!
    $req->content('txt=' . uri_escape($lyrics_chunk) .
                  '&voice=crystal&speakButton=SPEAK');
    $req->content_type('application/x-www-form-urlencoded');
    my $res = LWP::UserAgent->new->simple_request($req);

    # incorrect server response? then die.
    unless ($res->is_success || $res->code == 301) {
       die "Error connecting to TTS server: " . $res->status_line . ".\n"; }

    # didn't get the response we wanted? die.
    if ($res->content !~ /can be found <A HREF=([^>]*)>here<\/A>/i) {
       die "Response from TTS server not understood. Odd.\n"; }

    # side effect of error checking above is to set $1 to
    # the actual wav file that was generated. this is good.
    my $wav_url  = "http://morrissey.naturalvoices.com$1";
    my $wav_file = $1; # for use in saving to disk.
    $wav_file =~ s/.*?\/(\w+.wav)/$1/;
    getstore($wav_url, "$wav_file") or
     die "Download of $wav_file failed: $!";
    push(@temporary_wav_files, $wav_file);
}

# with all our files downloaded, play them in
# order with the Win32::Sound module. else, they
# just sit there in hopes of the user playing them.
print  "Playing downloaded wav files...\n";
foreach my $temporary_wav_file (@temporary_wav_files) {
    print " Now Playing: $temporary_wav_file\n";
    Win32::Sound::Play("$temporary_wav_file");

}

Running the Hack

Invoke the script on the command line, passing it the phrase you’re interested in; the script will search for that phrase in the titles of pages on LyrisFreak.com. If it doesn’t find the phase, it’ll just stop:

% perl robotkaroake.pl "fish heads"
No LyricsFreak matches were found for 'fish heads'.

If it does find the phrase, it’ll download the lyrics and generate the .wav file:

% perl robotkaroake.pl "born never asked"
Downloading lyrics from:
 http://www.lyricsfreak.com/l/laurie-anderson/81556.html
Connection to LyricsFreak was successful.
Playing downloaded wav files...
 Now Playing: 7a0c0093f2f531ac98691152d1f74367.wav

The previous example shows the output of a rather short entry. Longer songs will result in more .wav files saved to the current directory, each representing a small chunk (a single chunk representing one request to the TTS server):

% perl robotkaroake.pl "under the moon"
Downloading lyrics from:
 http://www.lyricsfreak.com/i/insane-clown-posse/67657.html
Connection to LyricsFreak was successful.
Playing downloaded wav files...
 Now Playing: fe34e081ab8a3abaeecdb1e50b030209.wav
 Now Playing: 80709499765f9bfe75d3c7234c435a79.wav
 Now Playing: f1ca99233f9cdc6a78f311db887914f1.wav
 Now Playing: fd6b61421f3fc56510cf4b9e0d3a0e12.wav
 Now Playing: b954f58f906d53ec312bbcc6579ebe12.wav
 Now Playing: 407415e685260754174cf45338ba4d10.wav
 Now Playing: 8a2ade6e7f8fe950ddcb58747d241694.wav
 Now Playing: 22ed038190b9ed0fb4e3077655503422.wav

Hack #87. Searching the Better Business Bureau

Is that new company offering to build your house, deliver your groceries, and walk your dog legit and free of complaint? Find out with an automated query of the Better Business Bureau’s web site.

If you’re a citizen of the United States, you’re probably aware of the Better Business Bureau (http://www.bbb.org), a nonprofit organization that acts as a neutral party in resolving complaints between businesses and consumers. There are over 125 local Better Business Bureaus across the country.

The Better Business Bureau (BBB) company database is searchable by URL. This hack runs a BBB search by URL and provides information on a business if one is found. Further, the hack searches PlanetFeedback.com for any additional online feedback about that company.

Links to feedback and basic company information is provided, but a tally of customer complaints from the BBB is not. Why? Each of the 125 local bureaus provides varying amounts of data and formats that data in slightly different ways; adding the code to handle them all would be, we suspect, a monumental undertaking. So, we are not going to provide that here; instead, we’ll stick to basic company information only.

The Code

Save this script as bbbcheck.pl:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use URI::Escape;

# $MAX_BBB_SEARCH_RETRIES is the number of times that the
# script will attempt to look up the URL on the BBB web site. 
# (Experimentally, the BBB web site appeared to give "database
# unavailable" error messages about 30% of the time.)
my $MAX_BBB_SEARCH_RETRIES = 3;

# $MAX_BBB_REFERRAL_PAGE_RETRIES is the number of times the
# script will attempt to download the company information
# from the URL provided in the search results.
my $MAX_BBB_REFERRAL_PAGE_RETRIES = 3;

# suck in our business URL, and append it to the BBB URL.
my $business_url = shift || die "You didn't pass a URL for checking!\n";
my $search_url   = "http://search.bbb.org/results.html?tabletouse=".
                   "url_search&url=" . $business_url;
my %company; # place we keep company info.

# look for the results until requested.
for (my $i = 1; $i <= $MAX_BBB_SEARCH_RETRIES; ++$i) {
    my $data = get($search_url); # gotcha, bugaboo!

    # did we have a problem? pause if so.
    if ($data =~ /apologize.*delay/ or !defined($data)) {
       print "Connection to BBB failed. Waiting 5 seconds to retry.\n";
       sleep(5); next; # let's try this again, shall we?
    }

    # die if there's no data to yank.
    die "There were no companies found for this URL.\n"
         if $data =~ /There are no companies/i;

    # get the company name, address, and redirect.
    if ($data =~ /<!-- n -->.*?href="(.*?)">(.*)<!--  -->.*?">(.*)<\/f/i) {
       $company{redir}   = "http://search.bbb.org/$1";
       $company{name}    = $2; $company{address} = $3;
       $company{address} =~ s/<br>/\n/g;
       print "\nCompany name and address:\n";
       print "$company{name}\n$company{address}\n\n";
    }

    # if there was no redirect, then we can't
    # move on to the local BBB site, so we die.
    unless ($company{redir}) {
      die "Unable to process the results returned. You can inspect ".
          "the results manually at the following url: $search_url\n"; }

    last if $data;
}

# now that we have the redirect for the local BBB site,
# we'll try to download its contents and parse them.
for (my $i = 1; $i <= $MAX_BBB_REFERRAL_PAGE_RETRIES; ++$i) {
    my $data = get($company{redir}); 

    # did we have a problem? pause if so.
    unless (defined $data) {
       print "Connection to BBB failed. Waiting 5 seconds to retry.\n";
       sleep(5); next; # let's try this again, shall we?
    }
    
    $data =~ s/\n|\f|\r//g; # grab even more information.
    $data =~ s/\n|\f|\r//g; # grab even more information.
    if ($data=~/Date:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{start}=$1;}
    if ($data=~/Entity:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{entity}=$1;}
    if ($data=~/l ?:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{principal}=$1;}
    if ($data=~/Phone.*?:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{phone}=$1;}
    if ($data=~/Fax.*?:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{fax}=$1;}
    if ($data=~/Status:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{mbr}=$1;}
    if ($data=~/BBB:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{joined}=$1;}
    if ($data=~/sification:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{type}=$1;}
    last if $data;
}

# print out the extra data we've found.
print "Further information (if any):\n";
foreach (qw/start_date entity principal phone fax mbr joined type/) {
   next unless $company{$_}; # skip blanks.
   print " Start Date: " if $_ eq "start_date";
   print " Type of Entity: " if $_ eq "entity";
   print " Principal: " if $_ eq "principal";
   print " Phone Number: " if $_ eq "phone";
   print " Fax Number: " if $_ eq "fax";
   print " Membership Status: " if $_ eq "mbr";
   print " Date Joined BBB: " if $_ eq "joined";
   print " Business Classification: " if $_ eq "type";
   print "$company{$_}\n";
} print "\n";

# alright. we have all our magic data that we can get from the 
# BBB, so let's see if there's anything on PlanetFeedback.com to display.
my $planetfeedback_url = "http://www.planetfeedback.com/sharedLetters".
                         "Results/1,2933,,00.html?frmCompany=".
                         uri_escape($company{name})."&frmFeedbackType".
                         "One=0&frmIndustry=0&frmFeedbackTypeTwo=0".
                         "&frmMaxValue=20&buttonClicked=submit1".
                         "&frmEventType=0";
my $data = get($planetfeedback_url) or # go, speed
  die "Error downloading from PlanetFeedback: $!"; # racer, go!

# did we get anything worth showing?
if ($data =~ /not posted any Shared Letters/i) {
   print "No feedback found for company '$company{name}'\n";
} else { print "Feedback available at $planetfeedback_url\n"; }

Running the Hack

Invoke the script on the command line with the URL of a business site you’d like to check. If there’s no match at the BBB—a distinct possibility, since it doesn’t contain every known business URL—the script will stop:

% perl bbbcheck.pl http://www.oreilly.com
There were no companies found for this URL.

If there is a match, it’ll give you some information about the company, then check PlanetFeedback.com for additional data. If they’ve received any comments on the business at hand, you’ll be provided a URL for further reading.

Let’s do a little checking up on Microsoft, shall we?

% perl bbbcheck.pl http://www.microsoft.com
Company name and address:
MICROSOFT CORPORATION
9255 Towne Center Dr 4th Fl
SAN DIEGO, CA

Further information (if any):
 Start Date: January 1975
 Type of Entity: Corporation

 Principal: Ms Shaina Houston FMS
 Phone Number: January 1975
 Fax Number: (858) 909-3838
 Membership Status: Yes
 Date Joined BBB: May 2003
 Business Classification: Computer Sales & Service

Feedback available at http://www.planetfeedback.com/sharedLettersResults/
1,2933,,00.html?frmCompany=MICROSOFT%20CORPORATION&frmFeedbackTypeOne=0& 
frmIndustry=0&frmFeedbackTypeTwo=0&frmMaxValue=20&buttonClicked=submit1& 
frmEventType=0

Hacking the Hack

The script here is extensive in what it does. After all, it visits two sites and provides you with a fair amount of information. But despite that, it’s still pretty bare-bones. The output is sent only to the screen, and the amount of information it scrapes is limited because of the multiple formats of the various local BBBs.

So, when you’re planning on improving the script, focus on two different things. First, think about how you might scrape more information if it were presented in a more standard format. For example, say you want to search only businesses in San Francisco. The BBB search site allows for that, though you’ll have to search by business name instead of URL (see the first search option at http://search.bbb.org/search.html). If you search for businesses only in San Francisco, you’ll get results only from the Golden Gate BBB. With one data format, you can access more information, including any complaint numbers and the company’s standing in the BBB.

The second thing you’ll want to improve is output. Currently, this hack sends out only plain text, but, as you saw previously, the PlanetFeedback.com URL is extensive. To fix this, you might want to spit out HTML instead, allowing you to simply click a link instead of copying and pasting. For that matter, you could set up an array with several business URLs and send all their results to the same file.

Hack #88. Searching for Health Inspections

How healthy are the restaurants in your neighborhood? And when you find a good one, how do you get there? By combining databases with maps!

You don’t have to scrape a site to build a URL that leads to their resources! This hack searches Seattle’s King County database of restaurant inspections (http://www.decadeonline.com/main.phtml?agency=skc), which can be queried with a complete restaurant name or just a single word. The script returns a list of the restaurants found, links to the restaurant’s health inspection information, and also adds a direct link to a MapQuest map of the restaurant’s location.

What? Isn’t scraping MapQuest against its TOS? Yes, but this program doesn’t touch the MapQuest site; instead, it builds a direct link to a relevant MapQuest map. So, while a user might access a MapQuest page based on this program’s output, we never programmatically access the site and thus never violate the TOS.

The Code

Save this script as kcrestaurants.pl:

#!/usr/bin/perl -w
use strict;
use HTML::TableExtract;
use LWP::Simple;
use URI::Escape;

# get our restaurant name from the command line.
my $name = shift || die "Usage: kcrestaurants.pl <string>\n";

# and our constructed URL to the health database.
my $url = "http://www.decadeonline.com/results.phtml?agency=skc".
          "&forceresults=1&offset=0&businessname=" . uri_escape($name) .
          "&businessstreet=&city=&zip=&soundslike=&sort=FACILITY_NAME";

# download our health data.
my $data = get($url) or die $!;
die "No restaurants matched your search query.\n"
    if $data =~ /no results were found/;
 
# and suck in the returned matches.
my $te = HTML::TableExtract->new(keep_html => 1, count => 1);
$te->parse($data) or die $!; # yum, yum, i love second table!

# and now loop through the data.
foreach my $ts ($te->table_states) {
  foreach my $row ($ts->rows) {
     next if $row->[1] =~ /Site Address/; # skip if this is our header.
     foreach ( qw/ 0 1 / ) { # remove googly poofs.
        $row->[$_] =~ s/^\s+|\s+|\s+$/ /g; # remove whitespace.
        $row->[$_] =~ s/\n|\f|\r/ /g; # remove newlines.
     } 

     # determine name/addresses.
     my ($url, $name, $address, $mp_url); 
     if ($row->[0] =~ /href="(.*?)">.*?2">(.*?)<\/font>/) {
         ($url, $name) = ($1, $2); # almost there.
     } if ($row->[1] =~ /2">(.*?)<\/font>/) { $address = $1; }

     # and the MapQuest URL.
     if ($address =~ /(.*), ([^,]*)/) {
         my $street = $1; my $city = $2;
         $mp_url = "http://www.mapquest.com/maps/map.adp?".
                   "country=US&address=" . uri_escape($street) .
                   "&city=" . $city . "&state=WA&zipcode=";
     }

     print "Company name: $name\n";
     print "Company address: $address\n";
     print "Results of past inspections:\n ".
           "http://www.decadeonline.com/$url\n";
     print "MapQuest URL: $mp_url\n\n";
  }
}

Running the Hack

To run the hack, just specify the restaurant name or keyword you want to search for. If there’s no restaurant found based on your query, it’ll say as much:

% perl kcrestaurants.pl perlfood
No restaurants matched your search query.

A matching search returns health inspection and MapQuest links:

% perl kcrestaurants.pl "restaurant le gourmand"
Company name: RESTAURANT LE GOURMAND
Company address: 425 NW MARKET ST , Seattle
Results of past inspections:
 http://www.decadeonline.com/fac.phtml?
   agency=skc&forceresults=1&facid=FA0003608
MapQuest URL: http://www.mapquest.com/maps/map.adp?country=US&address
   =425%20NW%20MARKET%20ST%20&city=Seattle&state=WA&zipcode=

Or, if there are a number of results, it returns a complete list:

% perl kcrestaurants.pl restaurant
Company name: RESTAURANT EL TAPATIO
Company address: 3720 FACTORIA BL , Bellevue
Results of past inspections:
 http://www.decadeonline.com/fac.phtml?
   agency=skc&forceresults=1&facid=FA0003259
MapQuest URL: http://www.mapquest.com/maps/map.adp?country=US&address
   =3720%20FACTORIA%20BL%20&city=Bellevue&state=WA&zipcode=

Company name: RESTAURANT ICHIBAN
Company address: 601 S MAIN ST , Seattle
Results of past inspections:
 http://www.decadeonline.com/fac.phtml?
   agency=skc&forceresults=1&facid=FA0001743
MapQuest URL: http://www.mapquest.com/maps/map.adp?country=US&address
   =601%20S%20MAIN%20ST%20&city=Seattle&state=WA&zipcode=

...

Hacking the Hack

If you don’t live in Seattle, you might not personally have much use for this particular example. But if you live anywhere within the United States, the code can be adapted to suit you. Many counties in the United States have posted their restaurant inspection scores online. Go to your state or county’s official web site (the county site is better if you know what it is) and search for restaurant inspections. From there, you should be able to find restaurant scores from which you can build a script like this. Bear in mind that different counties have different levels of information.

You don’t have to use MapQuest either. If you have the name, city, and state of a restaurant, you can build a URL to get the phone number from Google. (However, you can’t use the Google API to perform this search, because it does not yet support the phonebook: syntax.)

Let’s take our previous example of the Restaurant Le Gourmand, located in Seattle, Washington. The Google search syntax for a phonebook query would be:

bphonebook:Restaurant Le Gourmand Seattle WA

And the URL to lead to the result would look like this:

http://www.google.com/search?q=bphonebook:Restaurant+Le+Gourmand+Seattle+WA

You might want to use that instead of, or in addition to, a link to MapQuest.

Hack #89. Filtering for the Naughties

Use search engines to construct your own parental control ratings for sites.

As we’ve attempted to show several times in this book, your scripts don’t have to start and end with simple Perl spidering. You can also incorporate various web APIs (such as Technorati [Hack #66]). In this hack, we’re going to add some Google API magic to see if a list of domains pulled off a page contain prurient (i.e., naughty) content—as determined by Google’s SafeSearch filtering mechanism.

As the hack is implemented, a list of domains is pulled off Fark (http://www.fark.com), a site known for its odd selection of daily links. Each domain has 50 of its URLs (generated by a Google search) put into an array, and each array item is checked to see if it appears in a Google search with SafeSearch enabled. If it does, it’s considered to be a good URL. If it doesn’t, it’s put under suspicion of being a not-so-good URL. The idea is to get a sense of how much of an entire domain is being filtered, instead of just one URL.

Filtering mechanisms are not perfect. Sometimes they filter things that aren’t bad at all, while sometimes they miss objectionable content. While the goal of this script is to give you a good and general idea of a domain’s content on the naughtiness scale, it won’t be perfect.

The Code

Save the following code as purity.pl:

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use SOAP::Lite;

# fill in your google.com API information here.
my $google_key  = "your Google API key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");

# get our data from Fark's "friends".
my $fark = get("http://www.fark.com/") or die $!;
$fark =~ m!Friends:</td></tr>(.*?)<tr><td class=\"lmhead\">Fun Games:!migs; 
my $farklinks = $1; # all our relevances are in here.

# and now loop through each entry.
while ($farklinks =~ m!href="(.*?)"!gism) {
   my $farkurl = $1; next unless $farkurl;
   my @checklist; # urls to check for safety.
   print "\n\nChecking $farkurl.\n";

   # getting the full result count for this URL.
   my $count = $gsrch->doGoogleSearch($google_key, $farkurl,
                        0, 1, "false", "",  "false", "", "", "");
   my $firstresult = $count->{estimatedTotalResultsCount};
   print "$firstresult matching results were found.\n";
   if ($firstresult > 50) { $firstresult = 50; }

   # now, get a maximum of 50 results, with no safe search.
   # getting the full result count for this URL.
   my $counter = 0; while ($counter < $firstresult) {

       my $urls = $gsrch->doGoogleSearch($google_key, $farkurl,
                           $counter, 10, "false", "",  "false", "", "", "");

       foreach my $hit (@{$urls->{resultElements}}) {
           push (@checklist, $hit->{URL}); 
       } $counter = $counter +10; 
   }

   # and now check each of the matching URLs.
   my (@goodurls, @badurls); # storage.
   foreach my $urltocheck (@checklist) {
       $urltocheck =~ s/http:\/\///;

       my $firstcheck = $gsrch->doGoogleSearch($google_key, $urltocheck,
                                 0, 1, "true", "",  "true", "", "", "");

       # check our results. if no matches, it's naughty.
       my $firstnumber = $firstcheck->{estimatedTotalResultsCount} || 0;
       if ($firstnumber == 0) { push @badurls, $urltocheck; }
       else { push @goodurls, $urltocheck; }
   }

   # and spit out some results.
   my ($goodcount, $badcount) = (scalar(@goodurls), scalar(@badurls));
   print "There are $goodcount good URLs and $badcount ".
         "possibly impure URLs.\n"; # wheeEeeeEE!

   # display bad domains if there are only a few.
   unless ( $badcount >= 10 || $badcount == 0) {
       print "The bad URLs are\n";
       foreach (@badurls) {
          print " http://$_\n"; 
       }
    }

   # happy percentage display.
   my $percent = $goodcount * 2; my $total = $goodcount+$badcount;
   if ($total==50) { print "This URL is $percent% pure!"; }

}

Running the Hack

The hack requires no variables. Simply run it from the command line as you would any Perl script, and it’ll return a list of domains and each domain’s purity percentage (as determined by Google’s SafeSearch):

% perl purity.pl

Checking http://www.aprilwinchell.com/.
161 matching results were found.
There are 36 good URLs and 14 possibly impure URLs.
This URL is 72% pure!

Checking http://www.badjocks.com/.
47 matching results were found.
There are 36 good URLs and 9 possibly impure URLs.
The bad URLs are
 http://www.thepunchline.com/cgi-bin/links/bad_link.cgi?ID=4052&d=1
 http://www.ilovebacon.com/020502/i.shtml
 http://www.ilovebacon.com/022803/l.shtml
...

Hacking the Hack

You might find something else you want to scrape, such as the links on your site’s front page. Are you linking to something naughty by mistake? How about performing due diligence on a site you’re thinking about linking to; will you inadvertently be leading readers to sites of a questionable nature via a seemingly innocent intermediary? Perhaps you’d like to check entries from a specific portion of the Yahoo! or DMOZ directories [Hack #47]? Anything that generates a list of links is fair game for this script.

As it stands, the script checks a maximum of 50 URLs per domain. While this makes for a pretty thorough check, it also makes for a long wait, especially if you have a fair amount of domains to check. You may decide that checking 10 domains is a far better thing to do. In that case, just change this line:

if ($firstresult > 10) { $firstresult = 10; }

When Tara originally wrote the code, she was a little concerned that it might be used to parse naughty sites and generate lists of naughty URLs for porn peddling. So, she chose not to display the list of naughty URLs generated, unless they were a significantly minor proportion of the final results (currently, the threshold is set to no more than 10 of the 50 URLs). You might want to change that, especially if you’re using this script to check links from your own site and you want to get an idea of the kind of content you might be linking to. In this case you’ll need to change just one line:

unless ( $badcount >= 50 || $badcount == 0) {

By increasing the count to 50, you’ll be informed of all the bad sites associated with the current domain. Just be forewarned: certain domains may return nothing but the naughties, and even the individual words that make up the returned URLs can be downright disturbing.

Get Spidering Hacks now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.