4. Gleaning Data from Databases

Spidering Hacks

perl yahoo2mbox.pl [options] [-o <mbox>] <groupname>
--help          give the usage message showing the program options
--version       show the program version and exit
--verbose       give verbose informational messages (default)
--quiet         be silent, only error messages are given
-o mbox         save the message to mbox instead of file named groupname
--start=n       start retrieving messages at index n instead of 1
--end=n         stop retrieving messages at index n instead of the last one
--noresume      don't resume, **overwrites** the existing output file if any
--user=name     login to eGroups using this username (default: guest login)
--pass=pass     the password to use for login (default: none)
--cookies=xxx   file to use to store cookies (default: none,
                'netscape' uses netscape cookies file).
--proxy=url     use the given proxy; if 'no', don't use proxy 
                at all (not even the environment variable http_proxy, 
                which is used by default), may use http://username:password\
                @full.host.name/ notation
--country=xx    use the given country code to access localized yahoo
% perl yahoo2mbox.pl --start=3258 weirdalclub2
Logging in anonymously... ok.
Getting number of messages in group weirdalclub2...
Retrieving messages 3258..3287: .............................. done!
Saved 30 message(s) in weirdalclub2.
#!/usr/bin/perl -w

use constant USERNAME => 'your username';
use constant PASSWORD => 'your password';

use strict;
use File::Path;
use Getopt::Long;
use WWW::Yahoo::Groups;
$SIG{PIPE} = 'IGNORE';

# define the command-line options, and 
# ensure that a group has been passed.
my ($debug, $group, $last, $first, $stats);
GetOptions(
    "debug"     => \$debug,
    "group=s"   => \$group,
    "stats"     => \$stats,
    "first=i"   => \$first,
    "last=i"    => \$last,
); (defined $group) or die "Must specify a group!\n";

# sign into Yahoo! Groups.
my $w = WWW::Yahoo::Groups->new(  );
$w->debug( $debug );
$w->login( USERNAME, PASSWORD );
$w->list( $group );
$w->agent->requests_redirectable( [] ); # no redirects now

# first and last IDs of group.
my $first_id = $w->first_msg_id(  );
my $last_id = $w->last_msg_id(  );
print "Messages in $group: $first_id to $last_id\n";
exit 0 if $stats; # they just wanted numbers.

# default our IDs to the first and last
# of the $group in question, else use the
# passed command-line options.
$first = $first_id unless $first;
$last  = $last_id  unless $last;
warn "Fetching $first to $last\n";

# get our specified messages.
for my $msgnum ($first..$last) {
    fetch_message( $w, $msgnum );
}

sub fetch_message {
    my $w = shift;
    my $msgnum = shift;

    # Put messages in directories by 100.
    my $dirname = int($msgnum/100)*100;

    # Create the dir if necessary.
    my $dir = "$group/$dirname";
    mkpath( $dir ) unless -d $dir;

    # Don't pull down the message
    # if we already have it...
    my $filename = "$dir/$msgnum";
    return if -f $filename;

    # pull down the content and check for errors.
    my $content = eval { $w->fetch_message($msgnum) };
    if ( $@ ) {
        if ( $@->isa('X::WWW::Yahoo::Groups') ) {
            warn "Could not handle message $msgnum: ",$@->error,"\n";
        } else { warn "Could not get content for message $msgnum\n"; }
    } else {
        open(FH, ">$filename") 
          or return warn "Can't create $filename: $!\n";
        print FH $content; close FH; # data has been saved.
        $w->autosleep( 5 ); # so now sleep to prevent saturation.
    }
}
% perl yahoogroups.pl --group=milwpm --stats
Messages in milwpm: 1 to 721
% perl yahoogroups.pl --group=milwpm --first=717
Messages in milwpm: 1 to 721
Fetching 717 to 721
% ls -al milwpm/700
-rw-r--r--    1 andy     staff        2814 Jul 16 23:04 700
-rw-r--r--    1 andy     staff        4005 Jul 16 23:05 717
-rw-r--r--    1 andy     staff        1511 Jul 16 23:05 718
-rw-r--r--    1 andy     staff        5576 Jul 16 23:05 719
-rw-r--r--    1 andy     staff        5862 Jul 16 23:05 720
-rw-r--r--    1 andy     staff        6632 Jul 16 23:05 721
% perl yahoogroups.pl --group=milwpm --last=5 --debug
Fetching http://groups.yahoo.com/
Fetching http://login.yahoo.com/config/login?.intl=us&.src=ygrp&....
Fetching http://groups.yahoo.com/group/milwpm/messages/1
Messages in milwpm: 1 to 721
Fetching 1 to 5
Fetching http://groups.yahoo.com/group/milwpm/message/1?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/2?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/3?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/4?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/interrupt?st=2&m=1&done=%2...
Fetching /group/milwpm/message/4?source=1&unwrap=1
Fetching http://groups.yahoo.com/group/milwpm/message/5?source=1&unwrap=1
               # Keep track of popular subjects
               my %subjects;

sub fetch_message {
    my $w = shift;
        } else { warn "Could not get content for message $msgnum\n"; }
    } else {

        # and add one to our subject line counter.
        $content =~ /Subject: (.*)/ig; $subjects{$1}++ if $1;

        open(FH, ">$filename") 
          or return warn "Can't create $filename: $!\n";
               # now, print our totals.
my @sorted = sort { $subjects{$b} <=> $subjects{$a} } keys %subjects;
               foreach (@sorted) { print "$subjects{$_}: $_\n"; }
#!/usr/bin/perl -w
# ybgoogled.pl
# Pull the top item from the Yahoo Buzz Index and query
# the last three day's worth of Google's index for it.
# Usage: perl ybgoogled.pl
use strict;
use SOAP::Lite;
use LWP::Simple;
use Time::JulianDay;

# Your Google API developer's key.
my $google_key='insert key here';

# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";

# Number of days back to
# go in the Google index.
my $days_back = 3;

# Grab a copy of http://buzz.yahoo.com.
my $buzz_content = get("http://buzz.yahoo.com/overall/") 
  or die "Couldn't grab the Yahoo Buzz: $!";

# Find the first item on the Buzz Index list.
$buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!;
my $buzziest = $1; # assign our match as our search term.
die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest;

# Figure out today's Julian date.
my $today = int local_julian_day(time);

# Build the Google query and say hi.
my $query = "\"$buzziest\" daterange:" . ($today - $days_back) . "-$today"; 
print "The buzziest item on Yahoo Buzz today is: $buzziest\n",
      "Querying Google for: $query\n", "Results:\n\n";

# Create a new SOAP::Lite instance, feeding it GoogleSearch.wsdl.
my $google_search = SOAP::Lite->service("file:$google_wdsl");

# Query Google.
my $results = $google_search->doGoogleSearch( 
                  $google_key, $query, 0, 10, "false",
                  "",  "false", "", "", ""
              );

# No results?
die "No results" unless @{$results->{resultElements}};

# Loop through the results.
foreach my $result (@{$results->{'resultElements'}}) {
    my $output = join "\n", $result->{title} || "no title",
                 $result->{URL}, $result->{snippet} || 'none',"\n";
    $output =~ s!<.+?>!!g; # drop all HTML tags sloppily.
    print $output; # woo, we're done!
}
$buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!;
% perl ybgoogled.pl

The buzziest item on Yahoo Buzz today is: Gregory Hines
Querying Google for: "Gregory Hines" daterange:2452861-2452864
Results:

 Celebrities @ Hollywood.com-Featuring Gregory Hines. Celebrities ... 
 http://www.hollywood.com/celebs/detail/celeb/191902
 Gregory Hines Vital Stats: Born: February 14, 1946 Birth Place: New York,
 New York 

 Gregory Hines
 http://www.rottentomatoes.com/p/GregoryHines-1007016/
  ... Gregory Hines. CELEB QUIK BROWSER &gt; Select A Celebrity. ...

...
#!/usr/bin/perl -w
# ybdaypopped
# Pull the top item from the Yahoo! Buzz Index and query 
# Daypop's News search engine for relevant stories
use strict;
use LWP::Simple;

# Grab a copy of http://buzz.yahoo.com.
my $buzz_content = get("http://buzz.yahoo.com/") 
  or die "Couldn't grab the Yahoo Buzz: $!";

# Find the first item on the Buzz Index list.
$buzz_content =~ m!<b>1</b>.*?&cs=bz"><b>(.*?)</b></a>&nbsp;</font>!;
my $buzziest = $1; # assign our match as our search term.
die "Couldn't figure out the Yahoo! buzz\n" unless $buzziest;

# Build a Daypop Query.
my $dpquery = "http://www.daypop.com/search?q=$buzziest&t=n"; 
print "Location: $dpquery\n\n";
my $dpquery = "http://www.daypop.com/search?q=$buzziest&t=n&o=rss";
#!/usr/bin/perl -w
#
# yspider.pl
#
# Yahoo! Spider - crawls Yahoo! sites, collects links from each 
# downloaded HTML page, searches each downloaded page, and prints a
# list of results when done.
# http://www.artymiak.com/software/ or contact jacek@artymiak.com
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.

use strict;
use Getopt::Std;            # parse command-line options.
use LWP::UserAgent;         # download data from the Net.
use HTML::LinkExtor;        # get links inside an HTML document.
use URI::URL;               # turn relative links into absolutes.

my $help = <<"EOH";
----------------------------------------------------------------------------
Yahoo! Spider.

Options: -s    list of sites you want to crawl,
               e.g. -s 'us china denmark'
         -h    print this help

Allowed values of -s are:

   argentina, asia, australia, brazil, canada,
   catalan, china, denmark, france, germany, hongkong,
   india, ireland, italy, japan, korea, mexico,
   newzealand, norway, singapore, spain, sweden, taiwan,
   uk, us, us_chinese, us_spanish 

Please, use this code responsibly.  Flooding any site
with excessive queries is bad net citizenship.
----------------------------------------------------------------------------
EOH

# define our arguments and
# show the help if asked.
my %args; getopts("s:h", \%args); 
die $help if exists $args{h};

# The list of code names, and
# URLs for various Yahoo! sites.
my %ys = (
   argentina => "http://ar.yahoo.com", asia => "http://asia.yahoo.com",
   australia => "http://au.yahoo.com", newzealand => "http://au.yahoo.com",
   brazil    => "http://br.yahoo.com", canada   => "http://ca.yahoo.com",
   catalan   => "http://ct.yahoo.com", china    => "http://cn.yahoo.com",
   denmark   => "http://dk.yahoo.com", france   => "http://fr.yahoo.com",
   germany   => "http://de.yahoo.com", hongkong => "http://hk.yahoo.com",
   india     => "http://in.yahoo.com", italy    => "http://it.yahoo.com",
   korea     => "http://kr.yahoo.com", mexico   => "http://mx.yahoo.com",
   norway    => "http://no.yahoo.com", singapore => "http://sg.yahoo.com",
   spain     => "http://es.yahoo.com", sweden   => "http://se.yahoo.com",
   taiwan    => "http://tw.yahoo.com", uk       => "http://uk.yahoo.com",
   ireland  => "http://uk.yahoo.com",  us       => "http://www.yahoo.com",
   japan    => "http://www.yahoo.co.jp",
   us_chinese => "http://chinese.yahoo.com",
   us_spanish => "http://espanol.yahoo.com"
);

# if the -s option was used, check to make
# sure it matches one of our existing codes
# above. if not, or if no -s was passed, help.
my @sites; # which locales to spider.
if (exists $args{'s'}) {
    @sites = split(/ /, lc($args{'s'}));
    foreach my $site (@sites) {
        die "UNKNOWN: $site\n\n$help" unless $ys{$site};
    }
} else { die $help; }

# Defines global and local profiles for URLs extracted from the
# downloaded pages. These profiles are used to determine if the
# URLs extracted from each new document should be placed on the
# TODO list (%todo) or rejected (%rejects). Profiles are lists
# made of chunks of text, which are matched against found URLs.
# Any special characters, like slash (/) or dot (.), must be properly
# escaped. Remember that globals have precedence over locals. 
my %rules = (
   global     => { allow => [], deny => [ 'search', '\*' ] },
   argentina  => { allow => [ 'http:\/\/ar\.' ], deny => [] },
   asia       => { allow => [ 'http:\/\/(aa|asia)\.' ], deny => [] },
   australia  => { allow => [ 'http:\/\/au\.' ], deny => [] },
   brazil     => { allow => [ 'http:\/\/br\.' ], deny => [] },
   canada     => { allow => [ 'http:\/\/ca\.' ], deny => [] },
   catalan    => { allow => [ 'http:\/\/ct\.' ], deny => [] },
   china      => { allow => [ 'http:\/\/cn\.' ], deny => [] },
   denmark    => { allow => [ 'http:\/\/dk\.' ], deny => [] },
   france     => { allow => [ 'http:\/\/fr\.' ], deny => [] },
   germany    => { allow => [ 'http:\/\/de\.' ], deny => [] },
   hongkong   => { allow => [ 'http:\/\/hk\.' ], deny => [] },
   india      => { allow => [ 'http:\/\/in\.' ], deny => [] },
   ireland    => { allow => [ 'http:\/\/uk\.' ], deny => [] },
   italy      => { allow => [ 'http:\/\/it\.' ], deny => [] },
   japan      => { allow => [ 'yahoo\.co\.jp' ], deny => [] },
   korea      => { allow => [ 'http:\/\/kr\.' ], deny => [] },
   mexico     => { allow => [ 'http:\/\/mx\.' ], deny => [] },
   norway     => { allow => [ 'http:\/\/no\.' ], deny => [] },
   singapore  => { allow => [ 'http:\/\/sg\.' ], deny => [] },
   spain      => { allow => [ 'http:\/\/es\.' ], deny => [] },
   sweden     => { allow => [ 'http:\/\/se\.' ], deny => [] },
   taiwan     => { allow => [ 'http:\/\/tw\.' ], deny => [] },
   uk         => { allow => [ 'http:\/\/uk\.' ], deny => [] },
   us         => { allow => [ 'http:\/\/(dir|www)\.' ], deny => [] },
   us_chinese => { allow => [ 'http:\/\/chinese\.' ], deny => [] },
   us_spanish => { allow => [ 'http:\/\/espanol\.' ], deny => [] },
);

my %todo = (  );       # URLs to parse
my %done = (  );       # parsed/finished URLs
my %errors = (  );     # broken URLs with errors
my %rejects = (  );    # URLs rejected by the script

# print out a "we're off!" line, then
# begin walking the site we've been told to.
print "=" x 80 . "\nStarted Yahoo! spider...\n" . "=" x 80 . "\n";
our $site; foreach $site (@sites) {

    # for each of the sites that have been passed on the
    # command line, we make a title for them, add them to
    # the TODO list for downloading, then call walksite(  ),
    # which downloads the URL, looks for more URLs, etc.
    my $title = "Yahoo! " . ucfirst($site) . " front page";
    $todo{$ys{$site}} = $title; walksite(  ); # process.

}

# once we're all done with all the URLs, we print a
# report about all the information we've gone through.
print "=" x 80 . "\nURLs downloaded and parsed:\n" . "=" x 80 . "\n";
foreach my $url (keys %done) { print "$url => $done{$url}\n"; }
print "=" x 80 . "\nURLs that couldn't be downloaded:\n" . "=" x 80 . "\n";
foreach my $url (keys %errors) { print "$url => $errors{$url}\n"; }
print "=" x 80 . "\nURLs that got rejected:\n" . "=" x 80 . "\n";
foreach my $url (keys %rejects) { print "$url => $rejects{$url}\n"; }

# this routine grabs the first entry in our TODO
# list, downloads the content, and looks for more URLs.
# we stay in walksite until there are no more URLs
# in our TODO list, which could be a good long time.
sub walksite {

    do {
        # get first URL to do.
        my $url = (keys %todo)[0];

        # download this URL.
        print "-> trying $url ...\n";
        my $browser = LWP::UserAgent->new;
        my $resp = $browser->get( $url, 'User-Agent' => 'Y!SpiderHack/1.0' );

        # check the results.
        if ($resp->is_success) {
            my $base = $resp->base || '';
            print "-> base URL: $base\n";
            my $data = $resp->content; # get the data.
            print "-> downloaded: " . length($data) . " bytes of $url\n";

            # find URLs using a link extorter. relevant ones
            # will be added to our TODO list of downloadables.
            # this passes all the found links to findurls(  )
            # below, which determines if we should add the link
            # to our TODO list, or ignore it due to filtering.
            HTML::LinkExtor->new(\&findurls, $base)->parse($data);

            ###########################################################
            # add your own processing here. perhaps you'd like to add #
            # a keyword search for the downloaded content in $data?   #
            ###########################################################

        } else {
            $errors{$url} = $resp->message(  );
            print "-> error: couldn't download URL: $url\n";
            delete $todo{$url};
        }

        # we're finished with this URL, so move it from
        # the TODO list to the done list, and print a report.
        $done{$url} = $todo{$url}; delete $todo{$url};
        print "-> processed legal URLs: " . (scalar keys %done) . "\n";
        print "-> remaining URLs: " . (scalar keys %todo) . "\n";
        print "-" x 80 . "\n";
    } until ((scalar keys %todo) == 0);
}

# callback routine for HTML::LinkExtor. For every
# link we find in our downloaded content, we check
# to see if we've processed it before, then run it
# through a bevy of regexp rules (see the top of
# this script) to see if it belongs in the TODO.
sub findurls {
    my($tag, %links) = @_;
    return if $tag ne 'a';
    return unless $links{href};
    print "-> found URL: $links{href}\n";

    # already seen this URL, so move on.
    if (exists $done{$links{href}} ||
        exists $errors{$links{href}} || 
        exists $rejects{$links{href}}) {
        print "--> I've seen this before: $links{href}\n"; return;
    }

    # now, run through our filters.
    unless (exists($todo{$links{href}})) {
        my ($ga, $gd, $la, $ld); # counters.
        foreach (@{$rules{global}{'allow'}}) { 
            $ga++ if $links{href} =~ /$_/i; 
        }
        foreach (@{$rules{global}{'deny'}}) { 
            $gd++ if $links{href} =~ /$_/i; 
        }
        foreach (@{$rules{$site}{'allow'}}) { 
            $la++ if $links{href} =~ /$_/i; 
        }
        foreach (@{$rules{$site}{'deny'}}) { 
            $ld++ if $links{href} =~ /$_/i; 
        }

        # if there were denials or NO allowances, we move on.
        if ($gd or $ld) { print "-> rejected: $links{href}\n"; return; }
        unless ($ga or $la) { print "-> rejected: $links{href}\n"; return; }

        # we passed our filters, so add it on the barby.
        print "-> added $links{href} to my TODO list\n";
        $todo{$links{href}} = $links{href};
    }
}
% perl yspider.pl -s "us uk"
============================================================================
Started Yahoo! spider...
============================================================================
-> trying http://www.yahoo.com ...
-> base URL: http://www.yahoo.com/
-> downloaded: 28376 bytes of http://www.yahoo.com
-> found URL: http://www.yahoo.com/s/92802
-> added http://www.yahoo.com/s/92802 to my TODO list
-> found URL: http://www.yahoo.com/s/92803
... etc ...
-> added http://www.yahoo.com/r/pv to my TODO list
-> processed legal URLs: 1
-> remaining URLs: 244
----------------------------------------------------------------------------
-> trying http://www.yahoo.com/r/fr ...
-> base URL: http://fr.yahoo.com/r/
-> downloaded: 32619 bytes of http://www.yahoo.com/r/fr
-> found URL: http://fr.yahoo.com/r/t/mu00
-> rejected URL: http://fr.yahoo.com/r/t/mu00
...
% perl yspider.pl -h

...
Allowed values of -s are:

   argentina, asia, australia, brazil, canada,
   catalan, china, denmark, france, germany, hongkong,
   india, ireland, italy, japan, korea, mexico,
   newzealand, norway, singapore, spain, sweden, taiwan,
   uk, us, us_chinese, us_spanish
#!/usr/bin/perl-w

use strict;
use Date::Manip;
use LWP::Simple;
use Getopt::Long;

$ENV{TZ} = "GMT" if $^O eq "MSWin32";

# the homepage for Yahoo!'s "What's New".
my $new_url = "http://dir.yahoo.com/new/";

# the major categories at Yahoo!. hashed because
# we'll use them to hold our counts string.
my @categories = ("Arts & Humanities",    "Business & Economy",
                  "Computers & Internet", "Education",
                  "Entertainment",        "Government",
                  "Health",               "News & Media",
                  "Recreation & Sports",  "Reference",
                  "Regional",             "Science", 
                  "Social Science",       "Society & Culture");
my %final_counts; # where we save our final readouts.

# load in our options from the command line.
my %opts; GetOptions(\%opts, "c|count=i");
die unless $opts{c}; # count sites from past $i days.

# if we've been told to count the number of new sites,
# then we'll go through each of our main categories
# for the last $i days and collate a result.

# begin the header
# for our import file.
my $header = "Category";

# from today, going backwards, get $i days.
for (my $i=1; $i <= $opts{c}; $i++) {

   # create a Data::Manip time that will
   # be used to construct the last $i days.
   my $day; # query for Yahoo! retrieval.
   if ($i == 1) { $day = "yesterday"; }
   else { $day = "$i days ago"; }
   my $date = UnixDate($day, "%Y%m%d");

   # add this date to
   # our import file.
   $header .= "\t$date";

   # and download the day.
   my $url = "$new_url$date.html";
   my $data = get($url) or die $!;

   # and loop through each of our categories.
   my $day_count; foreach my $category (sort @categories) {
       $data =~ /$category.*?(\d+)/; my $count = $1 || 0;
       $final_counts{$category} .= "\t$count"; # building our string.
   }
}

# with all our counts finished,
# print out our final file.
print $header . "\n";
foreach my $category (@categories) {
   print $category, $final_counts{$category}, "\n";
}
% perl hoocount.pl --count 2
Category        20030807        20030806
Arts & Humanities       23      23
Business & Economy      88      141
Computers & Internet    2       9
Education       0       4
Entertainment   43      29
Government      3       4
Health  2       7
News & Media    1       1
Recreation & Sports     8       27
Reference       0       0
Regional        142     114
Science 1       2
Social Science  3       0
Society & Culture       7       8
#!/usr/bin/perl-w
#
# Scattersearch -- Use the search suggestions from
# Yahoo! to build a series of intitle: searches at Google. 

use strict;

use LWP;
use SOAP::Lite;
use CGI qw/:standard/;

# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;

# Your Google API developer's key.
my $google_key = 'insert key here';

# Location of the GoogleSearch WSDL file.
my $google_wdsl = "./GoogleSearch.wsdl";

# search Yahoo! for the query.
my $ua  = LWP::UserAgent->new;
my $url = URI->new('http://search.yahoo.com/search');
$url->query_form(rs => "more", p => $query);
my $yahoosearch = $ua->get($url)->content;
$yahoosearch =~ s/[\f\t\n\r]//isg;

# and determine if there were any results.
$yahoosearch =~ m!Related:(.*?)<spacer!migs; 
die "Sorry, there were no results!\n" unless $1;
my $recommended = $1;

# now, add all our results into
# an array for Google processing.
my @googlequeries;
while ($recommended =~ m!<a href=".*?">(.*?)</a>!mgis) {
    my $searchitem = $1; $searchitem =~ s/nobr|<|>|\///g;
    push (@googlequeries, $searchitem); 
}

# print our header for the results page.
print join "\n",
start_html("ScatterSearch");
     h1("Your Scattersearch Results"),
     p("Your original search term was '$query'"),
     p("That search had " . scalar(@googlequeries). " recommended terms."),
     p("Here are result numbers from a Google search"),
     CGI::start_ol(  );

# create our Google object for API searches.
my $gsrch = SOAP::Lite->service("file:$google_wdsl");

# running the actual Google queries.
foreach my $googlesearch (@googlequeries) {
    my $titlesearch = "allintitle:$googlesearch"; 
    my $count = $gsrch->doGoogleSearch($google_key, $titlesearch,
                                       0, 1, "false", "",  "false",
                                       "", "", "");
    my $url = $googlesearch; $url =~ s/ /+/g; $url =~ s/\"/%22/g;
    print li("There were $count->{estimatedTotalResultsCount} ".
             "results for the recommended search <a href=\"http://www.".
             "google.com/search?q=$url&num=100\">$googlesearch</a>");
}

print CGI::end_ol(  ), end_html;
% perl scattersearch.pl "siamese" > ~/Sites/scattersearch.html
$url->query_form(rs => "more", p => $query);
#!/usr/bin/perl-w
use strict; 
use LWP;

# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;

# search Prisma for the query.
my $ua  = LWP::UserAgent->new;
my $url = URI->new('http://www.altavista.com/web/results');
$url->query_form('q' => $query);

my $prismasearch = $ua->get($url)->content;
$prismasearch =~ s/[\f\t\n\r]//isg;

while ($prismasearch =~ m!title="Add.*?to your.*?">(.*?)</a>!mgis) {
    my $searchitem = $1; print "$searchitem\n";
}
#!/usr/bin/perl-w
use strict; use LWP;

# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;

# search Prisma for the query.
my $ua  = LWP::UserAgent->new;
$ua->agent('Mozilla/4.76 [en] (Win98; U)');
my $url = URI->new('http://www.alltheweb.com/search');
$url->query_form('q' => $query, cat => 'web');

my $atwsearch = $ua->get($url)->content;
$atwsearch =~ s/[\f\t\n\r]//isg;

while ($atwsearch =~ m!<li>(.*?)">(.*?)</a>!mgis) {
    my ($searchlink, $searchitem) = ($1, $2);
    next if $searchlink !~ /c=web/;
    print "$searchitem\n";
}
#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".
                  "eXtensible_Markup_Language_/RSS/News_Aggregators/";

# download the Yahoo! directory.
my $data = get("http://dir.yahoo.com" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.
HTML::LinkExtor->new(\&mindshare)->parse($data);
sub mindshare { # for each link we find...

    my ($tag, %attr) = @_;

    # continue on only if the tag was a link,
    # and the URL matches Yahoo!'s redirectory.
    return if $tag ne 'a';
    return unless $attr{href} =~ /srd.yahoo/;
    return unless $attr{href} =~ /\*http/;

    # now get our real URL.
    $attr{href} =~ /\*(http.*)/; my $url = $1;

    # and process each URL through Google.
    my $results = $google_search->doGoogleSearch(
                        $google_key, "link:$url", 0, 1,
                        "true", "", "false", "", "", ""
                  ); # wheee, that was easy, guvner.
    $urls{$url} = $results->{estimatedTotalResultsCount};
}

# now sort and display.
my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;
foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }
% perl mindshare.pl "/Entertainment/Humor/Procrastination/"
340: http://www.p45.net/
246: http://www.ishouldbeworking.com/
81: http://www.india.com/
33: http://www.jlc.net/~useless/
23: http://www.geocities.com/SouthBeach/1915/
18: http://www.eskimo.com/~spban/creed.html
13: http://www.black-schaffer.org/scp/
3: http://www.angelfire.com/mi/psociety
2: http://www.geocities.com/wastingstatetime/
my $dmoz_dir = shift || "/Reference/Libraries/Library_and_Information_[RETURN]
Science/Technical_Services/Cataloguing/Metadata/RDF/Applications/RSS/[RETURN] 
News_Readers/";
# download the Dmoz.org directory.
my $data = get("http://dmoz.org" . $dmoz_dir) or die $!;
return unless $attr{href} =~ /srd.yahoo/;
return unless $attr{href} =~ /\*http/;
return unless $attr{href} =~ /^http/;
return if $attr{href} =~ /dmoz|google|altavista|lycos|yahoo|alltheweb/;
# now get our real URL.
$attr{href} =~ /\*(http.*)/; my $url = $1;
# now get our real URL.
my $url = $attr{href};
% perl mindshare.pl | head 1
print "\nMost popular URLs for the strongest mindshare:\n";
my $most_popular = shift @sorted_urls;
my $results = $google_search->doGoogleSearch(
                    $google_key, "$most_popular", 0, 10,
                    "true", "", "false", "", "", "" );

foreach my $element (@{$results->{resultElements}}) {
   next if $element->{URL} eq $most_popular;
   print " * $element->{URL}\n";
   print "   \"$element->{title}\"\n\n";
}
% perl mindshare.pl
27800: http://radio.userland.com/
6670: http://www.oreillynet.com/meerkat/
5460: http://www.newsisfree.com/
3280: http://ranchero.com/software/netnewswire/
1840: http://www.disobey.com/amphetadesk/
847: http://www.feedreader.com/
797: http://www.serence.com/site.php?page=prod_klipfolio
674: http://bitworking.org/Aggie.html
492: http://www.newzcrawler.com/
387: http://www.sharpreader.net/
112: http://www.awasu.com/
102: http://www.bloglines.com/
67: http://www.blueelephantsoftware.com/
57: http://www.blogtrack.com/
50: http://www.proggle.com/novobot/

Most popular URLs for the strongest mindshare:
 * http://groups.yahoo.com/group/radio-userland/
   "Yahoo! Groups : radio-userland"

 * http://groups.yahoo.com/group/radio-userland-francophone/message/76
   "Yahoo! Groupes : radio-userland-francophone Messages : Message 76 ... "

 * http://www.fuzzygroup.com/writing/radiouserland_faq.htm
   "Fuzzygroup :: Radio UserLand FAQ"
...
<form action="googletech.cgi" method="POST">
Your query: <input type="text" name="q">
<input type="submit" name="Search!" value="Search!">
</form>
#!/usr/bin/perl -w
# googletech.cgi
# Getting Google results
# without getting weblog results.
use strict;
use SOAP::Lite;
use XML::Simple;
use CGI qw(:standard);
use HTML::Entities (  );
use LWP::Simple qw(!head);

my $technoratikey = "your technorati key here";
my $googlekey = "your google key here";

# Set up the query term
# from the CGI input.
my $query = param("q");

# Initialize the SOAP interface and run the Google search.
my $google_wdsl = "http://api.google.com/GoogleSearch.wsdl";
my $service = SOAP::Lite->service->($google_wdsl);

# Start returning the results page -
# do this now to prevent timeouts
my $cgi = new CGI;

print $cgi->header(  );
print $cgi->start_html(-title=>'Blog Free Google Results');
print $cgi->h1('Blog Free Results for '. "$query");
print $cgi->start_ul(  );

# Go through each of the results
foreach my $element (@{$result->{'resultElements'}}) {

    my $url = HTML::Entities::encode($element->{'URL'});

    # Request the Technorati information for each result.
    my $technorati_result = get("http://api.technorati.com/bloginfo?".
                                "url=$url&key=$technoratikey");

    # Parse this information.
    my $parser = new XML::Simple;
    my $parsed_feed = $parser->XMLin($technorati_result);

    # If Technorati considers this site to be a weblog,
    # go onto the next result. If not, display it, and then go on.
    if ($parsed_feed->{document}{result}{weblog}{name}) { next; }
    else {
        print $cgi-> i('<a href="'.$url.'">'.$element->{title}.'</a>');
        print $cgi-> l("$element->{snippet}");
    }
}
print $cgi -> end_ul(  );
print $cgi->end_html;
Your query: <input type="text" name="key">
# Set up the query term
# from the CGI input.
my $query = param("q");
$technoratikey = param("key");
perl ( site:oreilly.com | site:perl.com | site:mit.edu | site:yahoo.com)
#!/usr/bin/perl -w
# textbooks.pl
# Generates a list of O'Reilly books used
# as textbooks in the top 20 universities.
# Usage: perl textbooks.pl

use strict;
use SOAP::Lite;

# all the Google information
my $google_key  = "your google key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");

my @toptwenty = ("site:cmu.edu", "site:mit.edu", "site:stanford.edu",
       "site:berkeley.edu", "site:uiuc.edu","site:cornell.edu",
       "site:utexas.edu", "site:washington.edu", "site:caltech.edu",
       "site:princeton.edu", "site:wisc.edu", "site:gatech.edu",
       "site:umd.edu", "site:brown.edu", "site:ucla.edu",
       "site:umich.edu", "site:rice.edu", "site:upenn.edu",
       "site:unc.edu", "site:columbia.edu");

my $twentycount = 0;
open (OUT,'>top20.txt')
 or die "Couldn't open: $!";

while ($twentycount < 20) {

   # our five universities
   my $arrayquery =
      "( $toptwenty[$twentycount] | $toptwenty[$twentycount+1] ".
      "| $toptwenty[$twentycount+2] | $toptwenty[$twentycount+3] ".
      "| $toptwenty[$twentycount+4] )";

   # our search term.
   my $googlequery = "\"o'reilly * associates\" syllabus $arrayquery"; 
   print "Searching for $googlequery\n"; 

   # and do it, up to a maximum of 50 results.
   my $counter = 0; while ($counter < 50) {
       my $result = $gsrch->doGoogleSearch($google_key, $googlequery,
                            $counter, 10, "false", "",  "false",
                            "lang_en", "", "");
       # foreach result.
       foreach my $hit (@{$result->{'resultElements'}}){
           my $urlcheck = $hit->{'URL'};
           my $titlecheck = $hit->{'title'}; 
           my $snip = $hit->{'snippet'};

           # if the URL or title has a three-digit
           # number in it, we clean up the snippet
           # and print it out to our file.
           if ($urlcheck =~/http:.*?\/.*?\d{3}.*?/
                 or $titlecheck =~/\d{3}/) {
              $snip =~ s/<b>/ /g;
              $snip =~ s/<\/b>/ /g;
              $snip =~ s/&#39;/'/g;
              $snip =~ s/&quot;/"/g;
              $snip =~ s/&amp;/&/g;
              $snip =~ s/<br>/ /g;
              print OUT "$hit->{title}\n";
              print OUT "$hit->{URL}\n";
              print OUT "$snip\n\n";
           }
        }

        # go get 10 more
        # search results.
        $counter += 10;
   }

   # our next schools.
   $twentycount += 5; 
}
% perl textbooks.pl
Programming Languages and Compilers CS 164 - Spring 2002 
http://www-inst.eecs.berkeley.edu/~cs164/home.html 
... Tentative  Syllabus  & Schedule of Assignments.  ... you might find 
useful is "Unix in  a Nutshell (System V Edition)" by Gilly, published by  O 
' Reilly   & ...

CS378 (Spring 03): Linux Kernel Programming 
http://www.cs.utexas.edu/users/ygz/378-03S/course.html 
 ...  Guide, 2nd Edition By Olaf Kirch & Terry Dawson  O ' Reilly &   
Associates, ISBN 1-56592  ...  Please  visit Spring 02 homepage for 
information on  syllabus, projects, and  ...    

LIS 530: Organizing Information Using the Internet 
http://courses.washington.edu/lis541/syllabus-intro.html 
Efthimis N. Efthimiadis' Site LIS-541  Syllabus  Main Page Syllabus  - Aims  
& Objectives.  ...  Jennifer Niederst.  O'Reilly   and   Associates , 1999.

LIS415B * Spring98 * Class Schedule 
http://alexia.lis.uiuc.edu/course/spring1998/415B/lis415.spring98.schedule.
html 
LIS415 (section B): Class Schedule. Spring 98.  Syllabus ...  In Connecting 
to the Internet:  A buyer's guide. Sebastapol, California:  O ' Reilly &   
Associates .

Implementation of Information Storage and Retrieval 
http://alexia.lis.uiuc.edu/~dubin/429/429.pdf 
...  In addition to this  syllabus , this course is governed by the rules 
and  ... Advanced  Perl Programming , first edition ( O'Reilly   and   
Associates , Inc.,

INET 200: HTML, Dynamic HTML, and Scripting 
http://www.outreach.washington.edu/dl/courses/inet200/ 
...  such as HTML & XHTML: the Definitive Guide, 4 th edition, O'Reilly   
and  Associates   (which I  ... are assigned, and there is one on the course
syllabus  as Appendix B  ...
($urlcheck =~/http:.*?\/.*?lib.*?/) or ($titlecheck =~/.*?lib.*?/)
#!/usr/bin/perl -w
# get_reviews.pl
#
# A script to scrape Amazon, retrieve
# reviews, and write to a file.
# Usage: perl get_reviews.pl <asin>
use strict;
use LWP::Simple;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

# Loop through the HTML, looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)\n.
*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {

    my($rating,$title,$date,$reviewer,$review) = 
                      ($1||'',$2||'',$3||'',$4||'',$5||'');
    $reviewer =~ s!<.+?>!!g;   # drop all HTML tags
    $reviewer =~ s!\(.+?\)!!g; # remove anything in parenthesis
    $reviewer =~ s!\n!!g;      # remove newlines
    $review =~ s!<.+?>!!g;     # drop all HTML tags
    $review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

    # Print the results
    print "$title\n" . "$date\n" . "by $reviewer\n" .
          "$rating stars.\n\n" . "$review\n\n";

}
% perl get_reviews.pl 

                  asin

               > reviews.txt
#!/usr/bin/perl-w
# review_monitor.pl
#
# Monitors products, sending email when a new review is added.
# Usage: perl review_monitor.pl <asin>
use strict;
use LWP::Simple;
use XML::Simple;

# Your Amazon developer's token.
my $dev_token='insert developer token';

# Your Amazon affiliate code. Optional.
# See http://associates.amazon.com/.
my $af_code='insert affiliate tag';

# Location of sendmail and your email.
my $sendmailpath = "insert sendmail location";
my $emailAddress = "insert your email address";

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl review_monitor.pl <asin>\n";

# Get the number of reviews the last time this script ran.
open (ReviewCountDB, "<reviewCount_$asin.db");
my $lastReviewCount = <ReviewCountDB> || 0;
close(ReviewCountDB); # errors?! bah!

# Assemble the query URL (RESTian).
my $url = "http://xml.amazon.com/onca/xml2?t=$af_code" . 
          "&dev-t=$dev_token&type=heavy&f=xml" .
          "&AsinSearch=$asin";

# Grab the content...
my $content = get($url);
die "Could not retrieve $url" unless $content;

# And parse it with XML::Simple.
my $response = XMLin($content);

# Send email if a review has been added.
my $currentReviewCount = $response->{Details}->{Reviews}->[RETURN]
{TotalCustomerReviews};
my $productName        = $response->{Details}->{ProductName};
if ($currentReviewCount > $lastReviewCount) {
    open (MAIL, "|$sendmailpath -t") || die "Can't open mail program!\n";
    print MAIL "To: $emailAddress\n";
    print MAIL "From: Amazon Review Monitor\n";
    print MAIL "Subject: A Review Has Been Added!\n\n";
    print MAIL "Review count for $productName is $currentReviewCount.\n";
    close (MAIL);

    # Write the current review count to a file.
    open(ReviewCountDB, ">reviewCount_$asin.db");
    print ReviewCountDB $currentReviewCount;
    close(ReviewCountDB);
}
% perl review_monitor.pl 

                  ASIN

0 12 * * 1-5 perl review_monitor.pl ASIN
http://amazon.com/o/tg/detail/-/insert ASIN/?vi=advice
http://amazon.com/o/tg/detail/-/0596004605/?vi=advice
#!/usr/bin/perl -w
# get_advice.pl
#
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>
use strict; use LWP::Simple;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_advice.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=advice";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

# Get our matching data.
my ($inAddition) = (join '', $content) [RETURN]
    =~ m!in addition to(.*?)(instead of)?</td></tr>!mis;
my ($instead)    = (join '', $content) [RETURN]
    =~ m!recommendations instead of(.*?)</table>!mis;

# Look for "in addition to" advice.
if ($inAddition) { print "-- In Addition To --\n\n";
   while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/[RETURN]
(.*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
       my ($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
       $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
       print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
   }
}

# Look for "instead of" advice.
if ($instead) { print "-- Instead Of --\n\n";
    while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.[RETURN]
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
        my ($place,$thisAsin,$title,$number) [RETURN]
          = ($1||'',$2||'',$3||'',$4||'');
        $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
        print "$place $title ($thisAsin)\n(Recommendations: $number)\n\n";
    }
}
% perl get_advice.pl 0596004605
-- In Addition To --

1. Mac OS X: The Missing Manual, Second Edition (0596004508)
(Recommendations: 1)

2. Mac Upgrade and Repair Bible, Third Edition (0764525948)
(Recommendations: 1)
% perl get_advice.pl 0596004478 > advice.txt
#!/usr/bin/perl -w
# get_earnings_report.pl
#
# Logs into Amazon, downloads earning report,
# and writes an HTML version for your site.
# Usage: perl get_earnings_report.pl
use strict;
use URI::Escape;
use HTTP::Cookies;
use LWP::UserAgent;

# Set your Associates account info.
my $email = 'insert email address';
my $pass = 'insert password';
my $aftag = 'insert associates tag';

# Create a user agent object
# and fake the agent string.
my $ua = LWP::UserAgent->new;
$ua->agent("(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)");
$ua->cookie_jar({}); # in-memory cookie jar.

# Request earning reports, logging in as one pass.
my $rpturl  = "http://associates.amazon.com/exec/panama/login/".
              "attempt/customer/associates/no-customer-id/25/".
              "associates/resources/reporting/earnings/";
my $rptreq  = HTTP::Request->new(POST => $rpturl);
my $rptdata = "report-type=shipments-by-item".   # get individual items
              "&date-selection=qtd".             # all earnings this quarter
              "&login_id=".uri_escape($email).   # our email address.
              "&login_password=".uri_escape($pass).  # and password.
              "&submit.download=Download my report". # get downloadble.
              "&enable-login-post=true"; # log in and post at once.
$rptreq->content_type('application/x-www-form-urlencoded');
$rptreq->content($rptdata); my $report = $ua->request($rptreq);

# Uncomment the following line to see
# the report if you need to debug.
# print $report->content;

# Set the report to array.
my @lines = split(/\n/, $report->content);

# Get the time period.
my @fromdate = split(/\t/, $lines[1]);
my @todate = split(/\t/, $lines[2]);
my $from = $fromdate[1];
my $to = $todate[1];

# Print header...
print "<html><body>";
print "<h2>Items Purchased Through This Site</h2>";
print "from $from to $to <br><br>\n";
print "<ul>";

# Loop through the rest of the report.
splice(@lines,0,5);
foreach my $line (@lines) {
    my @fields  = split(/\t/, $line);
    my $title   = $fields[1];
    my $asin    = $fields[2];
    my $edition = $fields[4];
    my $items   = $fields[8];

    # Format items as HTML for display.
    print "<li><a href=\"http://www.amazon.com/o/ASIN/$asin/ref=nosim/".
          "$aftag\">$title</a> ($items) $edition <br>\n";
}
print "</ul></body></html>";
% perl get_earnings_report.pl
% perl get_earnings_report.pl > amazon_report.html
http://www.amazon.com/o/tg/stores/recs/instant-recs/-/books/0/
#!/usr/bin/perl  -w
# get_recommendations.pl
#
# A script to log on to Amazon, retrieve
# recommendations, and sort by highest rating.
# Usage: perl get_recommendations.pl

use strict;
use HTTP::Cookies;
use LWP::UserAgent;

# Amazon email and password.
my $email = 'insert email address';
my $password = 'insert password';

# Amazon login URL for normal users.
my $logurl = "http://www.amazon.com/exec/obidos/flex-sign-in-done/";

# Now log into Amazon.
my $ua = LWP::UserAgent->new;
$ua->agent("(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)");
$ua->cookie_jar( HTTP::Cookies->new('file' => 'cookies.lwp','autosave' => 1));
my %headers = ( 'content-type' => "application/x-www-form-urlencoded" );
$ua->post($logurl, 
  [ email       => $email,
    password    => $password,
    method      => 'get', opt => 'oa',
    page        => 'recs/instant-recs-sign-in-standard.html',
    response    => "tg/recs/recs-post-login-dispatch/-/recs/pd_rw_gw_r",
    'next-page' => 'recs/instant-recs-register-standard.html',
    action      => 'sign-in checked' ], %headers);

# Set some variables to hold
# our sorted recommendations.
my (%title_list, %author_list);
my (@asins, @ratings, $done);

# We're logged in, so request the recommendations.
my $recurl = "http://www.amazon.com/exec/obidos/tg/". 
             "stores/recs/instant-recs/-/books/0/t";

# Set all Amazon recommendations in
# an array/title and author in hashes.
until ($done) {

     # Send the request for the recommendations.
     my $content = $ua->get($recurl)->content;

     # Loop through the HTML, looking for matches.
     while ($content =~ m!<td colspan=2 width=100%>.*?detail/-/(.*?)/ref.[RETURN]
*?<b>(.*?)</b>.*?by (.*?)\n.*?Average Customer Review&#58;.*?(.*?)out of 5 [RETURN]
stars.*?<td colspan=3><hr noshade size=1></td>!mgis) {
         my ($asin,$title,$author,$rating) = ($1||'',$2||'',$3||'',$4||'');
         $title  =~ s!<.+?>!!g; # drop all HTML tags, cheaply.
         $rating =~ s!\n!!g;    # remove newlines from the rating.
         $rating =~ s! !!g;     # remove spaces from the rating.
         $title_list{$asin} = $title;    # store the title.
         $author_list{$asin} = $author;  # and the author.
         push (@asins, $asin);           # and the ASINs.
         push (@ratings, $rating);       # and the ... OK!
     }

     # See if there are more results. If so, continue the loop.
     if ($content =~ m!<a href=(.*?instant-recs.*?)>more results.*?</a>!i) {
        $recurl = "http://www.amazon.com$1"; # reassign the URL.
     } else { $done = 1; } # nope, we're done.
}

# Sort the results by highest star rating and print!
for (sort { $ratings[$b] <=> $ratings[$a] } 0..$#ratings) {
    next unless $asins[$_]; # skip el blancos.
    print "$title_list{$asins[$_]}  ($asins[$_])\n" . 
          "by $author_list{$asins[$_]} \n" .
          "$ratings[$_] stars.\n\n";
}
% perl get_recommendations.pl > top_rated_recommendations.txt
#!/usr/bin/perl -w
use strict;
use URI;
use LWP::Simple;
use Net::Amazon;
use XML::Simple;
use constant AMAZON_TOKEN => 'your token here';
use constant DEBUG => 0;

# get our arguments. the first argument is the
# URL to fetch, and the second is the output.
my $url = shift || die "$0 <url> [<output>]\n";
my $output = shift || '/www/htdocs/cloud.html';

# we'll need to fetch the Alexa XML at some point, and
# we'll do it a few different times, so we create a 
# subroutine for it. Using the URI module, we can
# correctly encode a URL with a query. In fact, you'll
# notice the majority of this function is involved with
# this, and at the end we use LWP::Simple to actually
# download and return the XML.
#####################################################
sub fetch_xml {
    my $url = shift;
    $url = "http://$url" unless $url =~ m[^http://];
    warn "Fetching Alexa data for $url\n" if DEBUG;

    my @args = (
        cli => 10,     dat => 'snba',
        ver => '7.0',  url => $url,
    );

    my $base = 'http://data.alexa.com/data';
    my $uri = URI->new( $base );
    $uri->query_form( @args );
    $uri = $uri->as_string;

    return get( $uri );
}

# raw XML is no good for us, though, as we want to extract
# particular items of interest. we use XML::Simple to turn
# the XML into Perl data structures, because it's easier
# than fiddling with event handling (as with XML::Parser
# or XML::SAX), and we know there's only a small amount of
# data. we want the list of related sites and the list of
# related products. we extract and return both.
#####################################################
sub handle_xml {
    my $page = shift;
    my $xml = XMLin( $page );
    my @related = map {
        {
            asin => $_->{ASIN},
            title => $_->{TITLE},
            href => $xml->{RLS}{PREFIX}.$_->{HREF},
        }
    } @{ $xml->{RLS}{RL} };

    my @products;
    if (ref $xml->{SD}{AMZN}{PRODUCT} eq 'ARRAY') {
        @products = map { $_->{ASIN} } @{ $xml->{SD}{AMZN}{PRODUCT} };
    } else { @products = $xml->{SD}{AMZN}{PRODUCT}{ASIN}; }

    return ( \@related, \@products );
}

# Functions done; now for the program:
warn "Start URL is $url\n" if DEBUG;
my @products; # running accumulation of product ASINs

{
    my $page = fetch_xml( $url );
    my ($related, $new_products) = handle_xml( $page );
    @products = @$new_products; # running list

    for (@$related) {
        my $xml = fetch_xml( $_->{href} );
        my ($related, $new_products) = handle_xml( $page );
        push @products, @$new_products;
    }
}

# We now have a list of products in @products, so
# we'd best do something with them. Let's look
# them up on Amazon and see what their titles are.
my $amazon = Net::Amazon->new( token => AMAZON_TOKEN );
my %products = map { $_ => undef } @products;

for my $asin ( sort keys %products ) {
    warn "Searching for $asin...\n" if DEBUG;
    my $response = $amazon->search( asin => $asin );
    my @products = $response->properties;
    die "ASIN is not unique!?" unless @products == 1;
    my $product = $products[0];
    $products{$asin} = {
        name => $product->ProductName,
        price => $product->OurPrice,
        asin => $asin,
    };
}

# Right. We now have name, price, and
# ASIN. Let's output an HTML report:
{
    umask 022;
    warn "Writing to $output\n" if DEBUG;
    open my $fh, '>', $output or die $!;
    print $fh "<html><head><title>Cloud around $url</title></head><body>";
    if (keys %products) {
        print $fh "<table>";
        for my $asin (sort keys %products) {
            my $data = $products{$asin};
            printf $fh "<tr><td>".
                       "<a href=\"http://amazon.com/exec/obidos/ASIN/%s\">".
                       "%s</a></td> <td>%s</td></tr>",
                       @{$data}{qw( asin name price )};
        }
        print $fh "</table>";
    }
    else { print $fh "No related products found.\n"; }
    print $fh "</body></html>\n";
}
% perl alexa.pl http://www.gamegrene.com/ testing.html
Start URL is http://www.gamegrene.com/
Fetching Alexa data for http://www.gamegrene.com/
Fetching Alexa data for http://www.elvesontricycles.com/
Fetching Alexa data for http://www.chimeramag.com/
Fetching Alexa data for http://pages.infinit.net/raymondl
Fetching Alexa data for http://www.beyond-adventure.com/
Fetching Alexa data for http://strcat.com/News
Fetching Alexa data for http://members.aol.com/stocdred
Fetching Alexa data for http://lost-souls.hk.st/
Fetching Alexa data for http://www.gamerspulse.com/
Fetching Alexa data for http://www.gignews.com/
Fetching Alexa data for http://www.gamesfirst.com/
Searching for 0070120102...
Searching for 0070213631...
Searching for 0070464081...
Searching for 0070465886...
..etc..
Searching for 1879239027...
Writing to testing.html
<form method="GET" action="alexa.pl">
URL: <input type="text" name="url" />
</form>
# get our arguments. the first argument is the
# URL to fetch, and the second is the output.
my $url = shift || die "$0 <url> [<output>]\n";
my $output = shift || '/www/htdocs/cloud.html';
use LWP::Simple qw(!head);
use CGI qw/:standard/;
my $url = param('url');
warn "Writing to $output\n" if DEBUG;
open my $fh, '>', $output or die $!;
my $fh = *STDOUT; # redirect.
print $fh "Content-type: text/html\n\n";
#!/usr/bin/perl -w
use strict;
use LWP::Simple;

# Settings for our Amazon developer account
our $amazon_affilate_id = "your affiliate ID, if any";
our $amazon_api_key     = "your amazon api key";

# Location of a FreeDB mirror web interface
our $freedb_url  = 'http://freedb.freedb.org/~cddb/cddb.cgi';

# Get the discid of the current CD
my $discid = get_discid(  );

# Search for the CD details on FreeDB
my $cd_info = freedb_search($discid);

# Given the artist, look for music on Amazon
my @amazon_rec = amazon_music_search($cd_info->{artist});

# Try to match the FreeDB title up
# with Amazon to find current playing.
my $curr_rec = undef;
my @other_recs = (  );
for my $rec (@amazon_rec) {
  if ( !defined $curr_rec && $cd_info->{title} eq $rec->{title} ) {
    $curr_rec = $rec;
  } else {
    push @other_recs, $rec;
  }
}

print html_template({current=>$curr_rec, others=>\@other_recs});
sub get_discid {
  # For Linux
  my $cd_discid = '/usr/local/bin/cd-discid';
  my $cd_dev    = '/dev/cdrom';
  return `$cd_discid $cd_dev`;
}
sub get_discid {
  # For Mac OS X
  my $cd_discid = '/sw/bin/cd-discid';
  my ($cd_dev)  = '/dev/'.
    join '', map { /= "(.*?)"$/ }
      grep { /"BSD Name"/ }
        split(/\n/, `ioreg -w 0 -c IOCDMedia`);
  return `$cd_discid $cd_dev`;
}
sub get_discid {
  # If all else fails... use Weird Al's "Alapalooza"
  return "a60a840c+12 150 17795 37657 54225 72617 87907 106037 ".
    "125857 141985 164055 165660 185605 2694";
}
sub freedb_search {
  my $discid = shift;

  # Get the discid for the current
  # CD and make a FreeDB query with it.
  $discid =~ s/ /\+/;
  my $disc_query = get("$freedb_url?cmd=cddb+query+$discid&".
                       "hello=joe_random+www.asdf.com+freebot+2.1&proto=1");
  my ($code, $cat, $id, @rest) = split(/ /, $disc_query);
  # Using the results of the discid query, look up the CD's details.
  # Create a hash from the name/value pairs in the detail response.
  # (Note that we clean up EOF characters in the data.)
  my %freedb_data =
    map { s/\r//; /(.*)=(.*)/ }
      split(/\n/,
            get("$freedb_url?cmd=cddb+read+$cat+$id&".
                "hello=deusx+www.decafbad.com+freebot+2.1&proto=1"));
  # Rework the FreeDB result data into
  # a more easily handled structure.
  my %disc_info = ( );

  # Artist and title are separated by ' / ' in DTITLE.
  ($disc_info{artist}, $disc_info{title}) =
  split(/ \/ /, $freedb_data{DTITLE});

  # Extract series of tracks from
  # TTITLE0..TTITLEn; stop at
  # first empty title.
  my @tracks = (  );
  my $track_no = 0;
  while ($freedb_data{"TTITLE$track_no"}) {
    push @tracks, $freedb_data{"TTITLE$track_no"};
    $track_no++;
  }
  $disc_info{tracks} = \@tracks;

  return \%disc_info;
}
sub trim_space {
  my $val = shift;
  $val=~s/^\s+//;
  $val=~s/\s+$//g;
  return $val;
}

sub clean_name {
  my $name = shift;
  $name=lc($name);
  $name=trim_space($name);
  $name=~s/[^a-z0-9 ]//g;
  $name=~s/ /_/g;
  return $name;
}
# Search for authors via the Amazon search API.
sub amazon_music_search {
  my ($artist) = @_;
  $artist =~ s/[^A-Za-z0-9 ]/ /;

  # Construct the base URL for Amazon author searches.
  my $base_url = "http://xml.amazon.com/onca/xml3?t=$amazon_affilate_id&".
    "dev-t=$amazon_api_key&mode=music&type=lite&f=xml".
      "&ArtistSearch=$artist";
  # Get the first page of search results.
  my $content = get($base_url."&page=1");

  # Find the total number of search results pages to be processed.
  $content =~ m{<totalpages>(.*?)</totalpages>}mgis;
  my ($totalpages) = ($1||'1');
  # Grab all pages of search results.
  my @search_pages = ($content);
  if ($totalpages > 1) {
    push @search_pages,
      map { sleep(1); get($base_url."&page=$_") } (2..$totalpages);
  }
  # Extract data for all the records
  # found in the search results.
  my @records;
  for my $content (@search_pages) {

    # Grab the content of all <details> tags
    while ($content [RETURN]
        =~ m{<details(?!s) url="(.*?)".*?>(.*?)</details>}mgis) {
      # Extract the URL attribute and tag body content.
      my($url, $details_content) = ($1||'', $2||'');
  # Extract all the tags from the detail record, using
  # tag name as hash key and tag contents as value.
  my %record = (_type=>'amazon', url=>$url);
  while ($details_content =~ m{<(.*?)>(.*?)</\1>}mgis) {
    my ($name, $val) = ($1||'', $2||'');
    $record{clean_name($name)} = $val;
  }
      # Further process the artists list to extract author
      # names, and standardize on product name as title.
      my $artists = $record{artists} || '';
      $record{artists} =
        [ map { $_ } ( $artists =~ m{<artist>(.*?)</artist>}mgis ) ];
      $record{title} = $record{productname};

      push @records, \%record;
    }
  }
  return @records;
}
sub html_template {
  my $vars = shift;

  my $out = '';

  $out .= qq^
    <html>
      <head><title>Now Playing</title></head>
      <body>
        <div align="center">
          <h1>Now playing:</h1>
  ^;
  $out .= format_album($vars->{current}, 1);
  $out .= qq^
          <h1>Also by this artist:</h1>\n";
          <table border="1" cellspacing="0" cellpadding="8">
  ^;
  my $col = 0;
  my $row = '';
  for my $rec (@{$vars->{others}}) {
    $row .= '<td align="center" width="33%">';
    $row .= format_album($rec, 0);
    $row .= "</td>\n";
    $col++;
    if (($col % 3) == 0) {
      $out .= "<tr>\n$row\n</tr>\n";
      $row = '';
    }
  }
  $out .= qq^
          </table>
        </div>
    </body></html>
  ^;

  return $out;
}
sub format_album {
  my ($rec, $large) = @_;

  my $out = '';

  my $img = ($large) ? 'imageurllarge' : 'imageurlmedium';

  $out .= qq^<a href="$rec->{url}"><img src="$rec->{$img}"/></a><br/>^;
  $out .= qq^<b><a href="$rec->{url}">$rec->{title}</a></b><br />^;

  if (defined $rec->{releasedate}) {
    $out .= qq^Released: $rec->{releasedate}^;
  }

  if (ref($rec->{artists}) eq 'ARRAY') {
    $out .= '<br />by <b>'.join(', ', @{$rec->{artists}}).'</b>';
  }
}
#!/usr/bin/perl -w
#
# AudioScrobble - Finds artists similar to those you already like.
# Comments, suggestions, contempt? Email adam@bregenzer.net.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";

# make sure we have the modules we need, else die peacefully.
eval("use LWP 5.6.9;"); die "[err] LWP 5.6.9 or greater required.\n" if $@;

# base URL for all requests
my $base_url = "http://www.audioscrobbler.com/modules.php?".
               "op=modload&name=top10&file=scrobblersets";

my $counter = 0;         # counter of artists displayed
my $max_count = 10;      # maximum number of artists to display
my ($a1, $a2, $a3) = ''; # artist input variables

# Reminder: this code checks for arguments, therefore if a band
# name has multiple words make sure you put it in quotes.
# Also, Audioscrobbler accepts at most three band names so we
# will only look at the first three arguments.
$a1 = $ARGV[0] || die "No artists passed!\n";
$a2 = $ARGV[1] || ""; $a3 = $ARGV[2] || "";

# create a downloader, faking the User-Agent to get past filters.
print "Retrieving data for your matches... ";
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');
my $data = $ua->get("$base_url&a1=$a1&a2=$a2&a3=$a3")->content;
print "done.\n";

# print up a nice header.
print "Correlation\tArtist\n";
print "-" x 76, "\n";

# match on the URL before the artist's name through to
# the width of the bar image (to determine correlation).
while ($counter < $max_count && $data =~ /href="modules\.php\[RETURN]
?op=modload&name=top10&file=artistinfo&artist=[^"]+">([^<]+)<\/a>[^<]+<\[RETURN]
/td><td[^>]+><img[^>]+\/><img[^>]+width="([0-9]+)">(.*)/) {

    # print the correlation factor and the artist's name.
    printf "%1.2f", ($2 / 300); print "\t\t" . $1 . "\n";

    # continue with the
    # data that is left.
    $data = $3; $counter++;
}

if ($counter == 0) {print "No matches.\n";}
print "-" x 76, "\n";
% perl audioscrobble.pl "Aphex Twin" "Autechre"
Retrieving data for your matches... done.
Correlation     Artist
--------------------------------------------------------------------------
1.00            Boards Of Canada
1.00            Plaid
0.83            Underworld
0.83            Radiohead
0.83            Chemical Brothers
0.83            Orbital
0.67            Mu-Ziq
0.67            Led Zeppelin
0.67            AFX
0.67            Squarepusher
# Check for a '-c' argument first
# specifying the number of
# results to return.
if ($ARGV[0] =~ /-c/) {
    shift @ARGV;
    $max_count = shift @ARGV;
}
% perl audioscrobble.pl -c 5 "Aphex Twin" "Autechre"
Retrieving data...done.
Correlation     Artist
--------------------------------------------------------------------------
1.00            Boards Of Canada
1.00            Plaid
0.83            Underworld
0.83            Radiohead
0.83            Chemical Brothers
#!/usr/bin/perl -w
#
# HoroPod - save your daily horoscope to the iPod.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";
use File::Spec::Functions;

# make sure we have the modules we need, else die peacefully.
eval("use LWP;"); die "[err] LWP is not installed.\n" if $@;

# really cheap Perl-only way of finding the path to
# the currently mounted iPod. searches the mounted
# Volumes for an iPod_Control folder and uses that.
my $ipod = glob("/Volumes/*/iPod_Control");
unless ($ipod) { die "[err] Could not find an iPod: $!\n"; }
$ipod =~ s/iPod_Control//g;  # we want one directory higher.
my $ipod_dir = catdir($ipod, "Notes", "Horoscopes");
mkdir $ipod_dir;  # no error checking by intention.

# create a downloader, faking the User-Agent to get past filters.
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');

# now, load up our horoscopes. first, define all the
# signs - these are used throughout the forloop.
my @signs = qw( aries taurus gemini cancer leo virgo libra
                scorpio sagittarius capricorn aquarius pisces );

# loop through each sign.
foreach my $sign (@signs) {

    # make it purdier for humans.
    my $display_sign = ucfirst($sign);

    # the Yahoo! URL, specific to the current sign.
    print "Grabbing horoscope for $display_sign...\n";
    my $url = "http://astrology.yahoo.com/us/astrology/".
                "today/$sign"."dailyhoroscope.html";

    # suck down the data or die.
    my $data = $ua->get($url)->content
      or die "[err] Could not download any data: $!\n";

    # snag the date by signature, not design.
    $data =~ /(\w{3} \w{3}\.? \d{1,2}, \d{4})/; my $date = $1;

    # and get the relevance. we could use an
    # HTML parser, but this is mindlessly easier.
    my $preface = '<font face="Arial" size="-1" color=black>';
    my $anteface = '</font></TD></TR></table>'; # ante up!
    $data =~ /$preface(.*)$anteface/i; my $proverb = $1;

    # save this proverb to our file.
    my $ipod_file = catfile($ipod_dir, $display_sign);
    open(IPOD_FILE, ">$ipod_file") or die "[err] Could not open file: $!\n";
    print IPOD_FILE "$display_sign\n$date\n\n";
    print IPOD_FILE "$proverb\n"; close(IPOD_FILE);

}
my $url = "http://astrology.yahoo.com/us/astrology/".
                "today/$sign"."dailyhoroscope.html";
my $preface = '<font face="Arial" size="-1" color=black>';
my $anteface = '</font></TD></TR></table>'; # ante up!
% rrdtool create salesrank.rrd --start 1057241523  --step 86400 
              DS:rank:GAUGE:86400:1:U  RRA:AVERAGE:0.5:1:31  RRA:AVERAGE:0.5:7:10
% rrdtool update salesrank.rrd 1057241524:3689
% rrdtool update salesrank.rrd 1057327924:3629
...etc...
% rrdtool update salesrank.rrd 1059833523:2900
% rrdtool fetch salesrank.rrd AVERAGE --start 1057241524 --end 1059833524
1057190400: nan
1057276800: 3.6290017008e+03
1057363200: 3.6094016667e+03
...etc...
% rrdtool graph osxhacks.png --start 1057241524 --end 1059833524
              --imgformat PNG --units-exponent 0 DEF:myrank=salesrank.rrd:rank:AVERAGE  
              LINE1:myrank#FF0000:"Mac OS X Hacks"
#!/usr/bin/perl -w
#
# grabrank.pl
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as perl
#

use strict;
use LWP::Simple;
my $time=time(  );

# path to our local RRDTOOL.
my $rrd = '/usr/local/bin/rrdtool';

# Get the Amazon.com page for Mac OS X Hacks
my $data = get("http://www.amazon.com/exec/obidos/ASIN/0596004605/");
$data =~ /Amazon.com Sales Rank: <\/b> (.*) <\/span><br>/;
my $salesrank=$1; # and now the sales rank is ours! Muahh!

# Get rid of commas.
$salesrank =~ s/,//g;

# Update our rrdtool database.
`$rrd update salesrank.rrd $time:$salesrank`;

# Update our graph.
my $cmd= "$rrd graph osxhacks.png --imgformat PNG --units-exponent ".
         "0 DEF:myrank=salesrank.rrd:rank:AVERAGE LINE1:myrank#FF0000:".
         "'Mac OS X Hacks' --start ".($time-31*86400)." --end $time";
`$cmd`; # bazam! we're done.
5 0 * * *       /path/to/your/grabrank.pl
DS:otherrank:GAUGE:86400:1:U
DEF:myotherrank=salesrank.rrd:rank:AVERAGE
LINE1:myotherrank#11EE11:"My other book"
#!/usr/bin/perl
use Finance::Quote;
my $q = Finance::Quote->new;
my $quotes = $q->fetch("nasdaq","IBM");
print "Price range: $quotes->{'IBM','year_range'}\n";
% perl finance.pl
Price range: 54.01 - 90.404
use Data::Dumper;
print Dumper($quotes);
$VAR1 = {
          'IBM{avg_vol' => 7264727,
          'IBM{div' => '0.64',
          'IBM{ask' => undef,
          'IBM{date' => '7/22/2003',
          'IBM{method' => 'yahoo',
          'IBM{div_yield' => '0.78',
          'IBM{low' => '81.65',
          'IBM{symbol' => 'IBM',
          'IBM{cap' => '141.2B',
          'IBM{day_range' => '81.65 - 83.06',
          'IBM{open' => '82.50',
          'IBM{bid' => undef,
          'IBM{eps' => '3.86',
          'IBM{time' => '1:40pm',
          'IBM{currency' => 'USD',
          'IBM{success' => 1,
          'IBM{volume' => 6055000,
          'IBM{last' => '81.70',
          'IBM{year_range' => '54.01 - 90.404',
          'IBM{close' => '82.50',
          'IBM{high' => '83.06',
          'IBM{net' => '-0.80',
          'IBM{p_change' => '-0.97',
          'IBM{ex_div' => 'May  7',
          'IBM{price' => '81.70',
          'IBM{pe' => '21.37',
          'IBM{name' => 'INTL BUS MACHINE',
          'IBM{div_date' => 'Jun 10'
        };
% rrdtool update stocks.rrd N:12345
use RRDs;
RRDs::update ("stocks.rrd","N:12345");
#!/usr/bin/perl -w
use strict; use RRDs;
use Finance::Quote qw/asx/;

# Declare basic variables.
my @stocks       = ('IBM','MSFT','LNUX');
my @stock_prices = (0,0,0);
my $workdir      = "./stocks";
my $db           = "$workdir/stocks.rrd";
my $now          = time(  );

# if the database hasn't been created,
# do so now, or die with an error.
if (!-f $db) {
    RRDs::create ($db, "--start", $now-1,
          "DS:IBM:ABSOLUTE:900:0:U",
          "DS:MSFT:ABSOLUTE:900:0:U",
          "DS:LNUX:ABSOLUTE:900:0:U",
          "RRA:AVERAGE:0.5:1:4800",
          "RRA:AVERAGE:0.5:4:4800",
          "RRA:AVERAGE:0.5:24:3000",
    );

    if (my $ERROR = RRDs::error) { die "$ERROR\n"; }
}

# now, get the quote information
# for IBM, Microsoft, and Linux.
my $q      = Finance::Quote->new(  );
my %quotes = $q->fetch("usa",@stocks);

# for each of our stocks, check to 
# see if we got data, and if so, 
# add it to our stock prices.
foreach my $code (@stocks) {
    my $count = 0; # array index.
    unless ($quote{$code, "success"}) {
        warn "$code lookup failed: ".$quote{$code,"errormsg"}."\n";
        $count++; next; # well, that's not a good sign.
    }

    # update the stock price, and move to the next.
    $stock_prices[$count] = $quote{$code,'last'}; $count++;
}

# we have our stock prices; update our database.
RRDs::update($db, "--template=" . join(':',@stocks),
                  "$now:" . join(':',@stock_prices));
if (my $ERROR = RRDs::error) { die "$ERROR\n"; }

# Generate weekly graph.
RRDs::graph("$workdir/stocks-weekly.png",
  "--title",     'Finance::Quote example',
  "--start",     "-1w",
  "--end",       $now+60,
  "--imgformat", "PNG",
  "--interlace", "--width=450",
  "DEF:ibm=$db:IBM:AVERAGE",
  "DEF:msft=$db:MSFT:AVERAGE",
  "DEF:lnux=$db:LNUX:AVERAGE",
  "LINE1:ibm#ff4400:ibm\\c",
  "LINE1:msft#11EE11:msft\\c",
  "LINE1:lnux#FF0000:lnux\\c"
); if (my $ERROR = RRDs::error) { die "$ERROR\n"; }

# Generate monthly graph.
RRDs::graph ("$workdir/stocks-weekly.png",
  "--title",     'Finance::Quote example',
  "--start",     "-1m",
  "--end",       $now+60,
  "--imgformat", "PNG",
  "--interlace", "--width=450",
  "DEF:ibm=$db:IBM:AVERAGE",
  "DEF:msft=$db:MSFT:AVERAGE",
  "DEF:lnux=$db:LNUX:AVERAGE",
  "LINE1:ibm#ff4400:ibm\\c",
  "LINE1:msft#11EE11:msft\\c",
  "LINE1:lnux#FF0000:lnux\\c"
); if (my $ERROR = RRDs::error) { die "$ERROR\n"; }
*/4 * * * Mon-Fri /path/to/your/grabstocks.pl
#!/usr/bin/perl-w
use strict;
use Data::Dumper qw(Dumper);

use LWP::Simple;
use WWW::RobotRules;
use WWW::Mechanize;
use HTML::Tree;

our $rules = WWW::RobotRules->new('AuthorSearchSpider/1.0');
our $amazon_affilate_id = "your affiliate ID here";
our $amazon_api_key     = "your key here";

my $author = $ARGV[0] || 'dumas, alexandre';

my @book_records = sort {$a->{title} cmp $b->{title}}
  (amazon_search($author), loc_gov_search($author), pg_search($author));

our %item_formats =
  (
   default => \&default_format,
   amazon  => \&amazon_format,
   loc     => \&loc_format,
   pg      => \&pg_format
  );

print html_wrapper($author,
                   join("\n", map { format_item($_) } @book_records));
# Get web content,
# obeying robots.txt
sub get_content {
  my $url = shift;
  return ($rules->allowed($url)) ? get($url) : undef;
}

# Get web content via WWW::
# Mechanize, obeying robots.txt
sub get_mech {
  my $url = shift;
  if ($rules->allowed($url)) {
    my $a = WWW::Mechanize->new(  );
    $a->get($url);
    return $a;
  } else { return undef }
}

# Remove whitespace from
# both ends of a string
sub trim_space {
  my $val = shift;
  $val=~s/^\s+//;
  $val=~s/\s+$//g;
  return $val;
}

# Clean up a string to be used
# as a field name of alphanumeric
# characters and underscores.
sub clean_name {
  my $name = shift;
  $name=lc($name);
  $name=trim_space($name);
  $name=~s/[^a-z0-9 ]//g;
  $name=~s/ /_/g;
  return $name;
}
sub loc_gov_search {
  my $author = shift;

  # Submit search for author's name
  my $url = 'http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First';
  my $a = get_mech($url);
  $a->submit_form
    (
     form_number => 1,
     fields => { Search_Arg=>$author, Search_Code=>'NAME_', CNT=>70}
    );
  # Data structure for book data records
  my @hit_links = grep { $_->text() =~ /$author/i } $a->links(  );
  my @book_records = (  );
  for my $hit_link (@hit_links) {
    my $a = get_mech
      ('http://catalog.loc.gov/cgi-bin/Pwebrecon.cgi?DB=local&PAGE=First');
    $a->submit_form
      (
       form_number => 1,
       fields => { Search_Arg=>$author, Search_Code=>'NAME_', CNT=>70}
      );
    $a->follow_link(text=>$hit_link->text(  ));
    # Build a tree from the HTML
    my $tree = HTML::TreeBuilder->new(  );
    $tree->parse($a->content(  ));

    # Find the search results table: first, look for a header
    # cell containing "#", then look for the parent table tag.
    my $curr;
    ($curr) = $tree->look_down
      (_tag => 'th', sub { $_[0]->as_text(  ) eq '#' } );
    next if !$curr;
    ($curr) = $curr->look_up(_tag => 'table');
    my ($head, @rows) = $curr->look_down
      (_tag => 'tr', sub { $_[0]->parent(  ) == $curr } );
    # Extract and process the search
    # results from the results table.
    my @book_records = (  );
    while (@rows) {

      # Take the results in row pairs; extract 
      # the title and year cells from the first row.
      my ($r1, $r2) = (shift @rows, shift @rows);
      my (undef, undef, undef, undef, $td_title, $td_year, undef) =
        $r1->look_down(_tag => 'td', sub { $_[0]->parent(  ) == $r1 });

      # Get title link from the results; extract the detail URL.
      my ($a_title) = $td_title->look_down(_tag=>'a');
      my $title_url = "http://catalog.loc.gov".$a_title->attr("href");

      # Get the book detail page; follow the link to the Full record.
      $a->follow_link(url => $title_url);
      $a->follow_link(text => "Full");
      # Find table containing book detail data by looking
      # for table containing a header with text "LC Control Number".
      my $t2 = HTML::TreeBuilder->new(  );
      $t2->parse($a->content(  ));
      my ($c1) = $t2->look_down
        (_tag=>'th', sub { $_[0]->as_text(  ) =~ /LC Control Number/ }) ||
          next;
      $c1 = $c1->look_up(_tag=>"table");
      # Now that we have the table, look
      # for the rows and extract book data.
      my %book_record = (_type => 'loc', url=>$title_url);
      my @trs = $c1->look_down(_tag=>"tr");
      for my $tr (@trs[1..$#trs]) {

        # Grab the item name and value table
        # cells; skip to next if empty.
        my ($th_name)  = $tr->look_down(_tag=>"th");
        my ($td_value) = $tr->look_down(_tag=>"td");
        next if (!$th_name) || (!$td_value);

        # Get and clean up the item name and value
        # table data; skip to next if the name is empty.
        my $name  = clean_name($th_name->as_text(  ));
        my $value = trim_space($td_value->as_text(  ));
        next if ($name eq '');

        $book_record{$name} = $value;
      }
      ($book_record{title}, undef) 
         = split(/ \//, $book_record{main_title});

      push @book_records, \%book_record;

      # Back up to the search results page.
      $a->back(); $a->back(  );
    }
  }
  return @book_records;
}
# Search Project Gutenberg
# for books by an author
sub pg_search {
  my $author = shift;

  my $pg_base = 'http://www.ibiblio.org/gutenberg/cgi-bin/sdb';
  my @book_records = (  );

  # Submit an author search at Project Gutenberg
  my $a1 = get_mech("$pg_base/t9.cgi/");
  $a1->submit_form
    (
     form_number => 1,
     fields => { author => $author }
    );
  # Extract all the book details
  # pages from the search results
  my $t1 = HTML::TreeBuilder->new(  );
  $t1->parse($a1->content(  ));
  my (@hit_urls) =
    map { "$pg_base/".$_->attr('href') }
      map { $_->look_down(_tag=>'a') }
        $t1->look_down(_tag=>'li');
  # Process each book detail
  # page to extract book info
  for my $url (@hit_urls) {
    my $t2 = HTML::TreeBuilder->new(  );
    $t2->parse(get_content($url));
     # Find the table of book data: look for a table
     # cell containing 'download' and find its parent table.
     my ($curr) = $t2->look_down
       (_tag=>"td",
        sub { $_[0]->as_text(  ) =~ /download/i });
     ($curr) = $curr->look_up(_tag=>"table");
     # Find the names of book data items: look for
     # all the <tt> tags in the table that contain ':'
     my (@hdrs) = $curr->look_down 
       (_tag=>'tt',
        sub { $_[0]->as_text(  ) =~ /\:/});
     # Extract name/value data from book details page.
     my %book_record = (_type=>'pg', url=>$url);
     for my $hdr (@hdrs) {

       # Name is text of <tt> tag.
       my $name = clean_name($hdr->as_text(  ));
       next if ($name eq '');

       # Find the field value by finding the parent
       # table row, then the child table data cell.
       my ($c2) = $hdr->look_up(_tag=>'tr');
       (undef, $c2) = $c2->look_down(_tag=>'td');
       # Extract the value. For most fields, simply use the text of the
       # table cell. For the download field, find the URLs of all links.
       my $value;
       if ($name eq 'download') {
         my (@links) = $c2->look_down
           (_tag=>"a",
           sub { $_[0]->as_text(  ) =~ /(txt|zip)/} );
        $value = [ map { $_->attr('href') } @links ];
      } else {
        $value = $c2->as_text(  );
      }

      # Store the field name and value in the record.
      $book_record{$name} = $value;
    }
    push @book_records, \%book_record;
  }
  return @book_records;
}
# Search for authors via
# the Amazon search API.
sub amazon_search {
  my $author = shift;

  # Construct the base URL for Amazon author searches.
  my $base_url = "http://xml.amazon.com/onca/xml3?t=$amazon_affilate_id&".
    "dev-t=$amazon_api_key&AuthorSearch=$author&".
      "mode=books&type=lite&f=xml";
  # Get the first page of search results.
  my $content = get_content($base_url."&page=1");

  # Find the total number of search results pages to be processed.
  $content =~ m{<totalpages>(.*?)</totalpages>}mgis;
  my ($totalpages) = ($1||'1');
  # Grab all pages of search results.
  my @search_pages = ($content);
  if ($totalpages > 1) {
    push @search_pages,
      map { sleep(1); get_content($base_url."&page=$_") } (2..$totalpages);
  }
  # Extract data for all the books
  # found in the search results.
  my @book_records;
  for my $content (@search_pages) {

    # Grab the content of all <details> tags.
    while ($content=~ m{<details(?!s) url="(.*?)".*?>(.*?)</details>}mgis) {

      # Extract the URL attribute and tag body content.
      my($url, $details_content) = ($1||'', $2||'');

      # Extract all the tags from the detail record, using
      # tag name as hash key and tag contents as value.
      my %book_record = (_type=>'amazon', url=>$url);
      while ($details_content =~ m{<(.*?)>(.*?)</\1>}mgis) {
        my ($name, $val) = ($1||'', $2||'');
        $book_record{clean_name($name)} = $val;
      }
      # Further process the authors list to extract author
      # names, and standardize on product name as title.
      my $authors = $book_record{authors} || '';
      $book_record{authors} =
        [ map { $_ } ( $authors =~ m{<author>(.*?)</author>}mgis ) ];
      $book_record{title} = $book_record{productname};

      push @book_records, \%book_record;
    }
  }

  return @book_records;
}
sub html_wrapper {
  my ($author, $content) = @_;

  return qq^
    <html>
      <head><title>Search results for $author</title></head>
      <body>
        <h1>Search results for $author</h1>
        <ul>$content</ul>
      </body>
    </html>
    ^;
}
sub format_item {
  my $item = shift;
  return "<li>".((defined $item_formats{$item->{_type}})
    ? $item_formats{$item->{_type}}->($item)
    : $item_formats{default}->($item))."</li>";
}

sub default_format {
  my $rec = shift;
  return qq^<a href="$rec->{url}">$rec->{title}</a>^;
}
sub field_layout {
  my ($rec, $fields) = @_;
  my $out = '';
  for (my $i=0; $i<scalar(@$fields); $i+=2) {
    my ($name, $val) = ($fields->[$i+1], $rec->{$fields->[$i]});
    next if !defined $val;
    $out .= qq^<tr><th align="right">$name:</th><td>$val</td></tr>^;
  }
  return $out;
}
sub loc_format {
  my $rec = shift;
  my $out = qq^[LoC] <a href="$rec->{url}">$rec->{title}</a><br /><br />^;
  $out .= qq^<table border="1" cellpadding="4" cellspacing="0" [RETURN]
      width="50%">^;
  $out .= field_layout
    ($rec,
      [
        'publishedcreated'  => 'Published',
        'type_of_material'  => 'Type of material',
        'description'       => 'Description',
        'dewey_class_no'    => 'Dewey class no.',
        'call_number'       => 'Call number',
        'lc_classification' => 'LoC classification',
        'lc_control_number' => 'LoC control number',
        'isbn'              => 'ISBN',
      ]
    );
  $out .= "</table><br />";
  return $out;
}
sub pg_format {
  my $rec = shift;
  my $out = qq^[PG] <a href="$rec->{url}">$rec->{title}</a><br /><br />^;
  $out .= qq^<table border="1" cellpadding="4" cellspacing="0" [RETURN]
      width="50%">^;
  $out .= field_layout($rec, ['language' => 'Language']);
  $out .= qq^
    <tr><th align="right">Download:</th>
      <td>
  ^;
  for my $link (@{$rec->{download}}) {
    $out .= qq^<a href="$link">$link</a><br />^;
  }
  $out .= qq^</td></tr></table><br />^;
  return $out;
}
sub amazon_format {
  my $rec = shift;
  my $out = qq^[Amazon] <a href="$rec->{url}">$rec->{title}</a>[RETURN]
<br /><br />^;
  $out .= qq^
    <table border="1" cellpadding="4" cellspacing="0" width="50%">
      <tr><th align="center" colspan="2">
        <img src="$rec->{imageurlmedium}" />
      </th></tr>
  ^;
  $out .= field_layout
    ($rec,
      [
        'releasedate'  => 'Date',
        'manufacturer' => 'Manufacturer',
        'availability' => 'Availability',
        'listprice'    => 'List price',
        'ourprice'     => "Amazon's price",
        'usedprice'    => 'Used price',
        'asin'         => 'ASIN'
      ]
    );
  $out .= "</table><br />";
  return $out;
}
#!/usr/bin/perl-w
use strict;
use LWP::Simple;
use SOAP::Lite;

# All the Google information.
my $google_key  = "your Google API key";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");
my $bestsellers = get("http://www.oreilly.com/catalog/top25.html");

# Since we're getting a list of best sellers,
# we don't have to scrape the rank. Instead
# we'll just start a counter and increment
# it every time we move to the next book. 
my $rank = 1; 
while ($bestsellers =~ m!\[<a href="(.*?)">Read it on Safari!mgis) {
   my $bookurl = $1; $bookurl =~ m!http://safari.oreilly.com/(\w+)!;
   my $oraisbn = $1; next if $oraisbn =~ /^http/;

   # Here we'll search the RIT library for the book's ISBN. Notice
   # the lovely URL that allows us to get the book information.
   my $ritdata = get("http://albert.rit.edu/search/i?SEARCH=$oraisbn"); 
   $ritdata =~ m!field C -->&nbsp;<A HREF=.*?>(.*?)</a>!mgs; 
   my $ritloc = $1; # now we've got the LOC number.

   # Might as well get the title too, eh?
   $ritdata =~ m!<STRONG>\n(.*?)</STRONG>!ms; my $booktitle = $1; 

   # Check and see if the LOC code was found for the book.
   # In a few cases it won't be. If it was, keep on going.
   if ($ritloc =~ /^Q/ or $ritloc =~ /^Z/) {

      # The first search we're doing is for the entire LOC call number. 
      my $results = $gsrch ->doGoogleSearch($google_key, "\"$ritloc\"",
                             0, 1, "false", "",  "false", "", "", "");
      my $firstcount = $results->{estimatedTotalResultsCount};

      # Now, remove the date and check for all editions.
      $ritloc =~ m!(.*?) 200\d{1}!ms; my $ritlocall = $1; 
      $results = $gsrch ->doGoogleSearch($google_key, "\"$ritlocall\"",
                          0, 1, "false", "",  "false", "", "", "");
      my $secondcount = $results->{estimatedTotalResultsCount};

      # Now we print everything out.
      print "The book's title is $booktitle. \n"; 
      print "The book's O'Reilly bestseller rank is $rank.\n"; 
      print "The book's LOC number is $ritloc. \n";
      print "Searching for $ritloc on Google gives $firstcount results. \n"; 
      print "Searching for all editions on Google ($ritlocall) gives ".
            "$secondcount results.\n \n";  
   } 
   $rank++;
}
% perl isbn2loc.pl
The book's title is Learning Perl.
The book's O'Reilly bestseller rank is 8.
The book's LOC number is QA76.73.P33 S34 2001.
Searching for QA76.73.P33 S34 2001 on Google gives 0 results.
Searching for all editions on Google (QA76.73.P33 S34) gives 9 results.

The book's title is Running Linux.
The book's O'Reilly bestseller rank is 13.
The book's LOC number is QA76.76.O63 W465 2002.
Searching for QA76.76.O63 W465 2002 on Google gives 1 results.
Searching for all editions on Google (QA76.76.O63 W465) gives 20 results.

The book's title is Programming Perl.
The book's O'Reilly bestseller rank is 14.
The book's LOC number is QA76.73.P22 W348 2000.
Searching for QA76.73.P22 W348 2000 on Google gives 1 results.
Searching for all editions on Google (QA76.73.P22 W348) gives 10 results.
#!/usr/bin/perl  -w
# display_weekly_list_with_soap.cgi
use strict; 

use SOAP::Lite +autodispatch => 
    uri => 'http://www.allconsuming.net/AllConsumngAPI',
    proxy => 'http://www.allconsuming.net/soap.cgi';

# optional values for the API.
my ($hour,$day,$month,$year) = qw( 12 05 28 2003 );

my $AllConsumingObject = 
AllConsumingAPI->new(
                         $hour,  # optional
                         $day,   # optional
                         $month, # optional
                         $year   # optional
                       );
my $HourlyData = $AllConsumingObject->GetHourlyList;
my $WeeklyData = $AllConsumingObject->GetWeeklyList;
my $ArchivedData = $AllConsumingObject->GetArchiveList;
my $CurrentlyReading = $AllConsumingObject->GetCurrentlyReadingList('insert 
                  [RETURN]

                  name');
my $FavoriteBooks = $AllConsumingObject->GetFavoriteBooksList('insert 
                  [RETURN]

                  name');
my $PurchasedBooks = $AllConsumingObject->GetPurchasedBooksList('insert 
                  [RETURN]

                  name');
my $CompletedBooks = $AllConsumingObject->GetCompletedBooksList('insert 
                  [RETURN]

                  name');
my $Metadata = $AllConsumingObject->GetMetadataForBook('insert ISBN');
my $WeblogMentions = $AllConsumingObject->GetWeblogMentionsForBook('insert 
                  [RETURN]

                  ISBN');
my $Friends = $AllConsumingObject->GetFriends('insert URL');
my $Recommendations = $AllConsumingObject->GetRecommendations('insert URL');
# The array here may differ depending
# on the type of data being returned.
if (ref($WeeklyData->{'asins'}) eq 'ARRAY') {
    foreach my $item (@{$WeeklyData->{'asins'}}) {
        print "TITLE: $item->{'title'}\n",
        "AUTHOR: $item->{'author'}\n\n";
    }
}
#!/usr/bin/perl -w
# display_weekly_list_with_rest.cgi
use strict;
use LWP::Simple;
use XML::Simple;

# Any of the URLs mentioned below can replace this one.
my $URLToGet = 'http://allconsuming.net/rest.cgi?weekly=1';

# Download and parse.
my $XML = get($URLToGet);
my $ParsedXML = XMLin($XML, suppressempty => 1);

# The array here may differ depending
# on the type of data being returned.
if (ref($ParsedXML->{'asins'}) eq 'ARRAY') {
    foreach my $item (@{$ParsedXML->{'asins'}}) {
        print "TITLE: $item->{'title'}\n",
        "AUTHOR: $item->{'author'}\n\n";
    }
}
http://allconsuming.net/rest.cgi?hourly=1
http://allconsuming.net/rest.cgi?weekly=1
http://allconsuming.net/rest.cgi?archive=1&hour=12&day=12&month=5&year=2003
http://allconsuming.net/rest.cgi?currently_reading=1&username=insert name
http://allconsuming.net/rest.cgi?favorite_books=1&username=insert name
http://allconsuming.net/rest.cgi?purchased_books=1&username=insert name
http://allconsuming.net/rest.cgi?completed_books=1&username=insert name
http://allconsuming.net/rest.cgi?metadata=1&isbn=insert ISBN
http://allconsuming.net/rest.cgi?weblog_mentions_for_book=1&isbn=insert ISBN
http://allconsuming.net/rest.cgi?friends=1&url=insert URL
http://allconsuming.net/rest.cgi?recommendations=1&url=insert URL
% perl display_weekly_list_with_rest.cgi
TITLE: Peer-to-Peer : Harnessing the Power of Disruptive Technologies
AUTHOR: Andy Oram

TITLE: Quicksilver : Volume One of The Baroque Cycle
AUTHOR: Neal Stephenson

TITLE: A Pattern Language: Towns, Buildings, Construction
AUTHOR: Christopher Alexander, Sara Ishikawa, Murray Silverstein

TITLE: Designing With Web Standards
AUTHOR: Jeffrey Zeldman

TITLE: Slander: Liberal Lies About the American Right
AUTHOR: Ann H. Coulter

TITLE: Bias : A CBS Insider Exposes How the Media Distort the News
AUTHOR: Bernard Goldberg

TITLE: The Adventures of Charmin the Bear
AUTHOR: David McKee, Joanna Quinn
<opt>
  <header 
    lastBuildDate="Sat May 28 13:30:02 2003" 
    title="All Consuming" 
    language="en-us" 
    description="Most recent books being talked about by webloggers." 
    link="http://allconsuming.net/" 
    number_updated="172" 
  />
  <asins 
    asin="0465045669" 
    title="Metamagical Themas" 
    author="Douglas R. Hofstadter" 
    url="http://www.erikbenson.com/"
    image="http://images.amazon.com/images/P/0465045669.01.THUMBZZZ.jpg" 
    excerpt="Douglas Hoftstadter's lesser-known book, Metamagical Themas, 
has a great chapter or two on self-referential sentences like 'This sentence 
was in the past tense.'." 
    amazon_url="http://amazon.com/exec/obidos/ASIN/0465045669/"
    allconsuming_url="http://allconsuming.net/item.cgi?id=0465045669"
  />
</opt>
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TableExtract;

# we use the Canada/English site, because its table
# of package tracking is simpler to parse than the "us".
my $url_base = "http://www.fedex.com/cgi-bin/tracking?action=track".
               "&cntry_code=ca_english&tracknumbers="; # woo hah.

# user wants to add a new tracking number.
my @tracknums; push(@tracknums, shift) if @ARGV;

# user already has some data on disk, so suck it in.
# we could technically add a grep on the readdir, but
# we have to postprocess @files anyway, so...
opendir(CWD, ".") or die $!; my @files = readdir(CWD); closedir(CWD);
foreach (@files) { /fedex_tracker_(\d+).dat/; push(@tracknums, $1) if $1; }
unless (@tracknums) { die "We have no packages to track!\n"; }
my %h; undef (@h{@tracknums}); @tracknums = keys %h; # quick unique.

# each tracking number, look it up.
foreach my $tracknum (@tracknums) {

    # suck down the data or end.
    my $data = get("$url_base$tracknum") or die $!;
    $data =~ s/&nbsp;/ /g; # sticky spaces.

    # and load our specific tracking table in.
    my $te = HTML::TableExtract->new(
           headers => ["Scan Activity","Date/Time"]);
    $te->parse($data); # alright, we've got everything loaded, hopefully.

    # now, get the new info.
    my $new_data_from_site;
    foreach my $ts ($te->table_states) {
       foreach my $row ($ts->rows) {
           $new_data_from_site .= " " . join(', ', @$row) . "\n";
       }
    }

    # if this is a broken tracking number,
    # move on and try the other ones we have.
    unless ($new_data_from_site) {
       print "No data found for package #$tracknum. Skipping.\n"; next; 
    }

    # if this package has never been tracked
    # before, then we'll create a file to
    # hold the data. this will be used for
    # comparisons on subsequent runs.
    unless (-e "fedex_tracker_$tracknum.dat") {
       open(FILE, ">fedex_tracker_$tracknum.dat") or die $!;
       print FILE $new_data_from_site; close (FILE);
       print "Adding the following data for #$tracknum:\n";
       print $new_data_from_site;
    }

    # if the datafile does exist, load it 
    # into a string, and do a simplisitic
    # comparison to see if they're equal.
    # if not, assume things have changed.
    if (-e "fedex_tracker_$tracknum.dat") {
        open(FILE, "<fedex_tracker_$tracknum.dat");
        $/ = undef; my $old_data_from_file = <FILE>; close(FILE);
        if ($old_data_from_file eq $new_data_from_site) {
            print "There have been no changes for package #$tracknum.\n";
        } else {
            print "Package #$tracknum has advanced in its journey!\n";
            print $new_data_from_site; # update the user.
            open(FILE, ">fedex_tracker_$tracknum.dat");
            print FILE $new_data_from_site; close(FILE);
            # the file is updated for next compare.
        }
    }
}
% perl fedex_tracker.pl 047655634284503
 Adding the following data for #047655634284503:
  Departed FedEx sort facility/SACRAMENTO, CA, 08/06/2003 06:54
  Scanned at FedEx sort facility/SACRAMENTO, CA, 08/06/2003 00:14
  Scanned at FedEx origin location/SACRAMENTO, CA, 08/05/2003 23:57
  Customer-Loaded Trailer Picked Up/SACRAMENTO, CA, 08/05/2003 00:00
 There have been no changes for package #047655634284503.
% perl fedex_tracker.pl 123456789
 No data found for package #123456789. Skipping.
 There have been no changes for package #047655634284503.
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use LWP::Simple;
my %opts; GetOptions(\%opts, 'v|verbose');

# where we find URLs. we'll also use this
# file to remember the number of comments.
my $urls_file = "chkcomments.dat";

# what follows is a list of regular expressions and assignment
# code that will be executed in search of matches, per site.
my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/&middot; (.*?) &middot; .*?<a href="(.*?)">(.*?)<\/a>/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})&nbsp;(.*)/,
     assign => '($date,$name,$contact) = ($1,$2,"none")'
   },
);

# open our URL file, and suck it in.
open(URLS_FILE, "<$urls_file") or die $!;
my %urls; while (<URLS_FILE>) { chomp;
   my ($url, $count) = split(/\|%%\|/);
   $urls{$url} = $count || undef;
} close (URLS_FILE);

# foreach URL in our dat file:
foreach my $url (keys %urls) {

   next unless $url; # no URL, no cookie.
   my $old_count = $urls{$url} || undef;

   # print a little happy message.
   print "\nSearching $url...\n"; 

   # suck down the data.
   my $data = get($url) or next;

   # now, begin looping through our matchers.
   # for each regular expression and assignment
   # code, we execute it in this namespace in an
   # attempt to find matches in our loaded data.
   my $new_count; foreach my $code (@signatures) {

      # with our regular expression loaded,
      # let's see if we get any matches.
      while ($data =~ /$code->{regex}/gism) {

         # since our $code contains two Perl statements
         # (one being the regex, above, and the other
         # being the assignment code), we have to eval
         # it once more so the assignments kick in.
         my ($date, $contact, $name); eval $code->{assign};
         next unless ($date && $contact && $name);
         print "  - $date: $name ($contact)\n" if $opts{v};
         $new_count++; # increase the count.
      }

      # if we've gotten a comment count, then assume
      # our regex worked properly, spit out a message,
      # and assign our comment count for later storage.
      if ($new_count) {
         print " * We saw a total of $new_count comments".
               " (old count: ". ($old_count || "unchecked") . ").\n";
         if ($new_count > ($old_count || 0)) { # joy of joys!
             print " * Woo! There are new comments to read!\n"
         } $urls{$url} = $new_count; last; # end the loop.
      }
   }
} print "\n";

# now that our comment counts are updated,
# write it back out to our datafile.
open(URLS_FILE, ">$urls_file") or die $!;
foreach my $url (keys %urls) {
   print URLS_FILE "$url|%%|$urls{$url}\n";
} close (URLS_FILE);
http://www.gamegrene.com/game_material/the_lazy_gm.shtml
% perl chkcomments.pl
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
 * We saw a total of 5 comments (old count: unchecked).
 * Woo! There are new comments to read!
% perl chkcomments.pl --verbose
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
  - July 23, 2003 01:53 AM: VMB (mailto:vesab@jippii.fi)
  - July 23, 2003 10:55 AM: Iridilate (mailto:)
  - July 29, 2003 02:46 PM: The Bebop Cow (mailto:blackcypress@yahoo.com)
... etc ...
 * We saw a total of 5 comments (old count: 5).
On July 23, 2003 01:53 AM,<a href="mailto:vesab@jippii.fi">VMB</a> said:
my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
http://diveintomark.org/archives/2003/07/28/atom_news
http://www.oreillynet.com/pub/wlg/3593
http://macdevcenter.com/pub/a/mac/2003/08/01/cocoa_series.html?page=2
% perl chkcomments.pl 
Searching http://www.gamegrene.com/game_material/the_lazy_gm.shtml...
 * We saw a total of 5 comments (old count: 5).

Searching http://diveintomark.org/archives/2003/07/28/atom_news...
 * We saw a total of 11 comments (old count: unchecked).
 * Woo! There are new comments to read!

Searching http://www.oreillynet.com/pub/wlg/3593 ...
 * We saw a total of 1 comments (old count: unchecked).
 * Woo! There are new comments to read!

Searching http://macdevcenter.com/pub/a/mac/2003/08/01/cocoa_seri...
 * We saw a total of 9 comments (old count: unchecked).
 * Woo! There are new comments to read!
<div class="date"><a href="http://scripting.com">
Dave Winer</a> &#0149; 7/18/03; 7:58:33 AM</div>
my @signatures = (
   { regex  => qr/On (.*?), <a href="(.*?)">(.*?)<\/a> said/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/&middot; (.*?) &middot; .*?<a href="(.*?)">(.*?)<\/a>/,
     assign => '($date,$contact,$name) = ($1,$2,$3)'
   },
   { regex  => qr/(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})&nbsp;(.*)/,
     assign => '($date,$name,$contact) = ($1,$2,"none")'
   },
   { regex  => qr/date"><a href="(.*)">(.*)<\/a> &#0149; (.*)<\/div>/,
     assign => '($contact,$name,$date) = ($1,$2,$3)'
   },
);
Searching http://blogs.law.harvard.edu/comments?u=homeManilaWebs...
  - 7/18/03; 1:23:14 AM: James Farmer (http://radio.weblogs.com/0120501/)
  - 7/18/03; 4:06:10 AM: Phil Wolff (http://dijest.com/aka)
  - 7/18/03; 7:58:33 AM: Dave Winer (http://scripting.com)
  - 7/18/03; 6:23:14 PM: Phil Wolff (http://dijest.com/aka)
 * We saw a total of 4 comments (old count: unchecked).
 * Woo! There are new comments to read!
#!/usr/bin/perl -w
#
# MyRSSMerger - read multiple RSS feeds, post new entries to Movable Type.
# http://disobey.com/d/code/ or contact morbus@disobey.com.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
my $VERSION = "1.0";
use Getopt::Long;
my %opts;

# make sure we have the modules we need, else die peacefully.
eval("use LWP::Simple;");  die "[err] LWP::Simple not installed.\n" if $@;
eval("use Net::Blogger;"); die "[err] Net::Blogger not installed.\n" if $@;
eval("use XML::RSS;");    die "[err] XML::RSS not installed.\n" if $@;

# define our command line flags (long and short versions).
GetOptions(\%opts, 'server|s=s',      # the POP3 server to use.
                   'username|u=s',    # the POP3 username to use.
                   'password|p=s',    # the POP3 password to use.
                   'blogid|b=i',      # unique ID of your blog.
                   'catid|c=i',       # unique ID for posting category.
                   'showcategories',  # list categories for blog.
                   'filter|f=s',      # per item filter for posting?
);

# at the very least, we need our login information.
die "[err] XML-RPC URL missing, use --server or -s.\n" unless $opts{server};
die "[err] Username missing, use --username or -u.\n"  
    unless $opts{username};
die "[err] Password missing, use --password or -p.\n"  
    unless $opts{password};
die "[err] BlogID missing, use --blogid or -b.\n"      unless $opts{blogid};

# every request past this point requires
# a connection, so we'll go and do so.
print "-" x 76, "\n"; # visual separator.
my $mt = Net::Blogger->new(engine=>"movabletype");
$mt->Proxy($opts{server});       # the servername.
$mt->Username($opts{username});  # the username.
$mt->Password($opts{password});  # the... ok. self-
$mt->BlogId($opts{blogid});      # explanatory!

# show existing categories.
if ($opts{showcategories}) {

    # get the list of categories from the server.
    my $cats = $mt->mt()->getCategoryList(  )
      or die "[err] ", $mt->LastError(  ), "\n";

    # and print 'em.
    if (scalar(@$cats) > 0) {
        print "The following blog categories are available:\n\n";
        foreach (sort { $a->{categoryId} <=> $b->{categoryId} } @$cats) {
            print " $_->{categoryId}: $_->{categoryName}\n";
        }
    } else { print "There are no selectable categories available.\n"; }

    # done with this request, so exit.
    print "\nCategory ID's can be used for --catid or -c.\n";
    print "-" x 76, "\n"; exit; # call me again, again!

}

# now, check for passed URLs for new-item-examination.
die "[err] No RSS URLs were passed for processing.\n" unless @ARGV;

# and store today's date for comparison.
# who needs the stinkin' Date:: modules?!
my ($day, $month, $year) = ((localtime)[3, 4, 5]);
$year+=1900; $month = sprintf("%02.0d", ++$month);
$day = sprintf("%02.0d", $day);  # zero-padding.
my $today = "$year-$month-$day"; # final version.

# loop through each RSS URL.
foreach my $rss_url (@ARGV) {

    # download whatever we've got coming.
    print "Downloading RSS feed at ", substr($rss_url, 0, 40), "...\n";
    my $data = get($rss_url) or print " [err] Data not downloaded!\n";
    next unless $data; # move onto the next URL in our list, if any.

    # parse it and then
    # count the number of items.
    # move on if nothing parsed.
    my $rss = new XML::RSS; $rss->parse($data);
    my $item_count = scalar(@{$rss->{items}});
    unless ($item_count) { print " [err] No parsable items.\n"; next; }

    # sandwich our post between a preface/anteface.
    my $clink = $rss->{channel}->{"link"}; # shorter variable.
    my $ctitle = $rss->{channel}->{title}; # shorter variable.
    my $preface = "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";
    my $anteface = "</blockquote>\n\n"; # new items as quotes.

    # and look for items dated today.
    foreach my $item (@{$rss->{items}}) {

        # no description or date for our item? move on.
        unless ($item->{description} or $item->{dc}->{date}) {
          print " Skipping (no description/date): '$item->{title}'.\n";
          next;
        }

        # if we have a date, is it today's?
        if ($item->{dc}->{date} =~ /^$today/) {

            # shorter variable. we're lazy.
            my $creator = $item->{dc}->{creator};

            # if there's a filter, check for goodness.
            if ($opts{filter} && $item->{description} !~ /$opts{filter}/i) {
                print " Skipping (failed filter): '$item->{title}'.\n"; 
                next;
            }

            # we found an item to post, so make a
            # final description from various parts.
            my $description = "$preface$item->{description} ";
            $description   .= "($creator) " if $creator;
            $description   .= "<a href=\"$item->{link}\">Read " .
                              "more from this post.</a>$anteface";

            # now, post to the passed blog info.
            print " Publishing item: '$item->{title}'.\n";
            my $id = $mt->metaWeblog(  )->newPost(
                              title       => $item->{title},
                              description => $description,
                              publish     => 1)
                     or die "[err] ", $mt->LastError(  ), "\n";

            # set the category?
            if ($opts{catid}) {
                $mt->mt(  )->setPostCategories(
                              postid     => $id,
                              categories => [ {categoryId => $opts{catid}}])
                or die " [err] ", $mt->LastError(  ), "\n";

                # "edit" the post with no changes so
                # that our category change activates.
                $mt->metaWeblog(  )->editPost(
                              title       => $item->{title},
                              description => $description,
                              postid      => $id,
                              publish     => 1)
                     or die " [err] ", $mt->LastError(  ), "\n";
            }
        } else { 
           print " Skipping (failed date check): '$item->{title}'.\n"; 
        }
    }
    print "-" x 76, "\n"; # visual separator.
}

exit;
% perl myrssmerger.pl -s http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi -u 
morbus -p HAAHAHAH -b 1 --showcategories
----------------------------------------------------------------------
 The following blog categories are available:

 1: Disobey Stuff
 2: The Idiot Box
 3: CHIApet
 4: Friends O' Disobey
 5: Stalkers O' Morbus
 6: Morbus Shoots, Jesus Saves
 7: El Casho Disappearo
 8: TechnOccult
 9: Potpourri
 10: Collected Nonsensicals

Category ID's can be used for --catid or -c.
----------------------------------------------------------------------
% perl myrssmerger.pl --server 
               [RETURN]

               http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi 
               [RETURN]

               --username morbus --password HAAHAHAH --blogid 1 --catid 1 
               http://gamegrene.com/index.xml
----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
 Publishing item: 'RPG, For Me'.
 Skipping (failed date check): 'Just Say No To Powergamers'.
 Skipping (failed date check): 'Every Story Needs A Soundtrack'.
 Skipping (failed date check): 'The Demise of Local Game Shops'.
 Skipping (failed date check): 'Death Of A Gaming System'.
 Skipping (failed date check): 'What Do You Do With Six Million Elves?'.
----------------------------------------------------------------------
% perl myrssmerger.pl --server 
               [RETURN]

               http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi 
               [RETURN]

               --username morbus --password HAAHAHAH --blogid 1 --catid 4 
               [RETURN]

               http://gamegrene.com/index.xml http://researchbuzz.com/researchbuzz.rss 
               http://camworld.com/index.rdf
----------------------------------------------------------------------
Downloading RSS feed at http://gamegrene.com/index.xml...
 Skipping (failed date check): 'RPG, For Me'.
 Skipping (failed date check): 'Just Say No To Powergamers'.
 Skipping (failed date check): 'Every Story Needs A Soundtrack'.
----------------------------------------------------------------------
Downloading RSS feed at http://camworld.com/index.rdf...
 Publishing item: 'Trinity's Hack from Matrix Reloaded'.
 Skipping (failed date check): 'Siberian Desktop'.
 Skipping (failed date check): 'The Sweet Hereafter'.
----------------------------------------------------------------------
Downloading RSS feed at http://researchbuzz.com/researchbuzz.rss...
 Skipping (no description/date): 'Northern Light Coming Back?'.
 Skipping (no description/date): 'This Week in LLRX'.
----------------------------------------------------------------------
% perl myrssmerger.pl --server 
               [RETURN]

               http://disobey.com/cgi-bin/mt/mt-xmlrpc.cgi 
               [RETURN]

               --username morbus --password HAAHAHAH --blogid 1 --catid 4 --filter "perl" 
               [RETURN]

               http://camworld.com/index.rdf
# loop through each RSS URL.
foreach my $rss_url (@ARGV) {

    # not an HTTP URL.
    next unless $rss_url =~ !^http://!;

    # download whatever we've got coming.
GetOptions(\%opts, 'server|s=s',      # the POP3 server to use.
                   'username|u=s',    # the POP3 username to use.
                   'password|p=s',    # the POP3 password to use.
                   'blogid|b=i',      # unique ID of your blog.
                   'catid|c=i',       # unique ID for posting category.
                   'showcategories',  # list categories for blog.
                   'filter|f=s',      # per item filter for posting?
                   'preface|r=s',    # the preface text before a posted item
                   'anteface|a=s"    # the text included after a posted item
               );
my $preface = $opts{preface} || "From <a href=\"$clink\">$ctitle</a>:\n\n<blockquote>";
my $anteface = $opts{anteface} 
    || "</blockquote>\n\n"; # new items as quotes.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;

my $key       = "your developer key";
my $searchURL = "http://www.perceive.net/";
my $restAPI   = "http://api.technorati.com/cosmos?key=$key&url=".
                "$searchURL&type=weblog&format=xml";
my $xml = get($restAPI);
print "$xml\n";
<item>
  <weblog>
    <name>phil ringnalda dot com</name>
    <url>http://philringnalda.com</url>
    <rssurl>http://www.philringnalda.com/index.xml</rssurl>
    <inboundblogs>339</inboundblogs>
    <inboundlinks>471</inboundlinks>
    <lastupdate>2003-07-11 21:09:28 GMT</lastupdate>
  </weblog>
</item>
use XML::Simple;
my $parsed_data = XMLin($xml);
my $items = $parsed_data->{document}->{item};

print qq{<ol>\n};
for my $item (@$items) {
    my ($weblog, $url) = ($item->{weblog}->{name}, $item->{weblog}->{url});
    print qq{<li><a href="$url">$name</a></li>};    
}
print qq{</ol>};
#!/usr/bin/perl-w
use strict;
use LWP::Simple;

my $key        = "your developer key";
my $searchTerm = "Perl";
my $restAPI    = "http://api.technorati.com/search?key=$key".
                 "&query=$searchTerm&format=xml";
my $xml = get($restAPI);
print "$xml\n";
<context>
   <excerpt>
    Ben Trott has uploaded version 0.02 of XML::FOAF to CPAN.
    This is a<b>Perl</b> module designed to make it...
   </excerpt>
   <title>New version of XML::FOAF in CPAN</title>
   <link>http://rdfweb.org/mt/foaflog/archives/000033.html</link>
</context>
use XML::Simple;
my $parsed_data = XMLin($xml);
my $items = $parsed_data->{document}->{item};

print qq{<dl>\n};
for my $item (@$items) {
    my ($weblog, $context, $title, $link) =
      ($item->{weblog}->{name}, $item->{context}->{excerpt},
      $item->{context}->{title}, $item->{context}->{link});
    print qq{<dt><a href="$link">$weblog : $title</a></dt>};
    print qq{<dd>$context</dd>};
}
print qq{</dl>};
#!/usr/bin/perl -w
use strict;
use POSIX;
use Memoize;
use LWP::Simple;
use XMLRPC::Lite;
use XML::RSS;
use HTML::RSSAutodiscovery;

use constant SYNDIC8_ID => 'syndic8_id';
use constant FEED_URL   => 'feed_url';
use constant SITE_URL   => 'site_url';
our $technorati_key = "your Technorati key";
our $ta_url         = 'http://api.technorati.com';
our $ta_cosmos_url  = "$ta_url/cosmos?key=$technorati_key&url=";

our $syndic8_url = 'http://www.syndic8.com/xmlrpc.php';
our $syndic8_max_results = 10;

my @feeds =
  qw(
   http://www.macslash.com/macslash.rdf
   http://www.wired.com/news_drop/netcenter/netcenter.rdf
   http://www.cert.org/channels/certcc.rdf
  );
map { memoize($_) }
  qw(
     get_ta_cosmos
     get_feed_info
     get_info_from_technorati
     get_info_from_rss
    );
my $feed_records = [];
for my $feed (@feeds) {
  my %feed_record = (url=>$feed);
  $feed_record{info}    = get_feed_info(FEED_URL, $feed);
  $feed_record{similar} = collect_similar_feeds($feed_record{info});
  $feed_record{related} = collect_related_feeds($feed_record{info});
  push @$feed_records, \%feed_record;
}

print html_wrapper(join("<hr />\n",
                   map { format_feed_record($_) }
                   @$feed_records));
sub get_feed_info {
  my ($type, $id) = @_;
  return {} if !$id;

  my ($rss, $s_info, $t_info, $feed_url, $site_url);

  if ($type eq SYNDIC8_ID) {
    $s_info = get_info_from_syndic8($id) || {};
    $feed_url = $s_info->{dataurl};
  } elsif ($type eq FEED_URL) {
    $feed_url = $id;
  } elsif ($type eq SITE_URL) {
    my $rss_finder = new HTML::RSSAutodiscovery(  );
    eval {
      ($feed_url) = map { $_->{href} } @{$rss_finder->locate($site_url)};
    };
  }

  $rss = get_info_from_rss($feed_url) || {};
  $s_info ||= get_info_from_syndic8($feed_url) || {};
  $site_url = $rss->{channel}{link} || $s_info->{dataurl};

  $t_info = get_info_from_technorati($site_url);

  return {url=>$feed_url, rss=>$rss, syndic8=>$s_info, technorati=>$t_info};
}
sub get_info_from_syndic8 {
  my $feed_url = shift;
  return {} if !$feed_url;

  my $result = {};
  eval {
    $result = XMLRPC::Lite->proxy($syndic8_url)
      ->call('syndic8.GetFeedInfo', $feed_url)->result(  ) || {};
  };
  return $result;
}
sub get_info_from_technorati {
  my $site_url = shift;
  return {} if !$site_url;

  my $xml = get_ta_cosmos($site_url);

  my $info = {};
  if ($xml =~ m{<result>(.*?)</result>}mgis) {
    my $xml2 = $1;
    $info = extract_ta_bloginfo($xml2);
  }
  return ($info->{lastupdate} =~ /1970/) ? {} : $info;
}
sub get_ta_cosmos {
  my $url = shift;
  return get($ta_cosmos_url.$url);
}
sub extract_ta_bloginfo {
  my $xml = shift;
  my %info = (  );

  if ($xml =~ m{<weblog>(.*?)</weblog>}mgis) {
    my ($content) = ($1||'');
    while ($content =~ m{<(.*?)>(.*?)</\1>}mgis) {
      my ($name, $val) = ($1||'', $2||'');
      $info{$name} = $val;
    }
  }

  return \%info;
}
sub get_info_from_rss {
  my $feed_url = shift;
  return {} if !$feed_url;

  my $rss = new XML::RSS(  );
  eval {
    $rss->parse(get($feed_url));
  };
  return $rss;
}
sub collect_related_feeds {
  my $feed_info = shift;
  my $site_url = $feed_info->{rss}{channel}{link} || $feed_info->{url};
  my %feeds = (  );
  my $xml = get_ta_cosmos($site_url);
  while ($xml =~ m{<item>(.*?)</item>}mgis) {
    my $xml2 = $1;
    my $ta_info = extract_ta_bloginfo($xml2);

    my $info = ($ta_info->{rssurl} ne '') ?
      get_feed_info(FEED_URL, $ta_info->{rssurl}) :
      get_feed_info(SITE_URL, $ta_info->{url});
    $info->{technorati} = $ta_info;

    while ($xml2 =~ m{<(.*?)>(.*?)</\1>}mgis) {
      my ($name, $val) = ($1||'', $2||'');
      next if $name eq 'weblog';
      $info->{technorati}{$name} = $val;
    }
      $feeds{$info->{url}} = $info;
  }

  return \%feeds;
}
sub collect_similar_feeds {
  my $feed_info = shift;
  my %feeds = (  );

  my $categories = $feed_info->{syndic8}->{Categories} || {};
  for my $cat_scheme (keys %{$categories}) {
    my $cat_name = $categories->{$cat_scheme};
    my $feeds = XMLRPC::Lite->proxy($syndic8_url)
      ->call('syndic8.GetFeedsInCategory', $cat_scheme, $cat_name)
        ->result(  ) || [];

    # Limit the number of feeds handled in any one category
    $feeds = [ @{$feeds}[0..$syndic8_max_results] ]
      if (scalar(@$feeds) > $syndic8_max_results);
    for my $feed (@$feeds) {
      my $feed_info = get_feed_info(SYNDIC8_ID, $feed);
      my $feed_url = $feed_info->{syndic8}{dataurl};
      next if !$feed_url;
      $feeds{"$cat_name ($cat_scheme)"}{$feed_url} = $feed_info;
    }
  }

  return \%feeds;
}
sub html_wrapper {
  my $content = shift;
  return qq^
    <html>
      <head>
        <title>Digging for RSS feeds</title>
      </head>
      <body>
        $content
      </body>
    </html>
    ^;
}
sub format_feed_info {
  my $info = shift;
  my ($feed_url, $feed_title, $feed_link) =
    ($info->{url}, feed_title($info), feed_link($info));
  return qq^<a href="$feed_link">$feed_title</a>
    (<a href="$feed_url">RSS</a>)^;
}
sub format_feed_record {
  my $record = shift;
  my $out = '';
  $out .= qq^
    <div class="record">
      ^;

  $out .= qq^<h2 class="main_feed">^.
    format_feed_info($record->{info})."</h2>\n";
  my $related = $record->{related};
  if (keys %{$related}) {
    $out .= "<h3>Feeds related by links:</h3>\n<ul>\n";
    $out .= join
      ('',
       map { "<li>".format_feed_info($related->{$_})."</li>\n" }
       sort keys %{$related})."\n\n";
    $out .= "</ul>\n";
  }
  my $similar = $record->{similar};
  if (keys %{$similar}) {
    $out .= "<h3>Similar feeds by category:</h3>\n<ul>\n";
    for my $cat (sort keys %{$similar}) {
      $out .= "<li>$cat\n<ul>";
      $out .= join
        ('',
         map { "<li>".format_feed_info($similar->{$cat}{$_})."</li>\n" }
         sort keys %{$similar->{$cat}})."\n\n";
        );
      $out .= "</ul>\n</li>\n";
    }
    $out .= "</ul>\n";
  }
  $out .= qq^
    </div>
      ^;

  return $out;
}
sub trim_space {
  my $val = shift;
  $val=~s/^\s+//;
  $val=~s/\s+$//g;
  return $val;
}
sub feed_title {
  my $feed_info = shift;
  return trim_space
    (
     $feed_info->{rss}{channel}{title} ||
     $feed_info->{syndic8}{sitename} ||
     $feed_info->{technorati}{name} ||
     $feed_info->{url} ||
     '(untitled)'
    );
}
sub feed_link {
  my $feed_info = shift;
  return trim_space
    (
     $feed_info->{rss}{channel}{link} ||
     $feed_info->{syndic8}{siteurl} ||
     $feed_info->{technorati}{url} ||
     $feed_info->{url} ||
     ''
    );
}
#!/usr/bin/perl  -w
use strict; $|++;

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

# where should results go?
my $result_file  = "./result.html";
my $keywords_reg = qr/pipe-delimited search terms/;
my $starter_url  = "your favorite blog here";

# open and create the result.html file.
open(RESULT, ">$result_file") or die "Couldn't create: $!\n";
print RESULT "<html><head><title>Spider Findings</title></head><body>\n";

# our workhorse for access.
my $ua = LWP::UserAgent->new;
print "\nnow spidering: $starter_url\n";

# begin our link searching. LinkExtor takes a 
# subroutine argument to handle found links,
# and then the actual data of the page. 
HTML::LinkExtor->new(
  sub {
        my ($tag, %attr) = @_;
        return if $tag ne 'a';

        # make any href relative link into
        # an absolute value, and add to an
        # internal list of links to check out.
        my @links = map { url($_, $starter_url)->abs(  ) }
                      grep { defined } @attr{qw/href/};

        # make 'em all pretty...
        foreach my $link (@links) {
           print " + $link\n"; # hello!
           my $data = $ua->get($link)->content;
           if ($data =~ m/$keywords_reg/i) {
              open(RESULT, ">>$result_file");
              print RESULT "<a href=\"$link\">$link</a><br>\n";
              close(RESULT); # one match printed, yes!
           }
        }

# and now, the actual content that
# HTML::LinkExtor goes through...
})->parse(
  do {
     my $r = $ua->get($starter_url);
     $r->content_type eq "text/html" ? $r->content : "";
  }
);

print RESULT "</body></html>";
close RESULT; exit;
my $keywords_reg = qr/foaf|perl|os x/;
my $starter_url  = "http://myfavoriteblog.com";
% perl blogfinder.pl
now spidering: http://www.myfavoriteblog.com
 + http://myfavoriteblog.com
 + http://www.luserinterface.net/index.cgi/colophon/
 + mailto:saf@luserinterface.net
 + http://jabber.org/
 + http://sourceforge.net/projects/gaim/
 + http://scottfallin.com/hacks/popBlosx.text
foreach my $link (@links) {
    next unless $link =~ /^http/i;
    print " + $link\n"; # hello!
    my $data = $ua->get($link)->content;
## To use this script as a
## CGI; please read CGI.txt
## Choices : $TRUE, $FALSE
$options{USE_CGI} = $FALSE;

## Choices : WGET, LYNX, CURL, LWPUSERAGENT
$options{GET_METHOD} = qw(LWPUSERAGENT);

## Choices : HTML, TEXT, LATEX, XAWTV, XML
$options{OUTPUT_FORMAT} = qw(XML);

## Choices : TVGUIDE
$options{INPUT_SOURCE} = qw(TVGUIDE);

### Attributes dealing with channels.
## Should channels be run through the filter?
## Choices : $TRUE, $FALSE
$options{FILTER_CHANNELS} = $TRUE;

## Filter by NAME and/or NUMBER?
$options{FILTER_CHANNELS_BY_NAME} = $FALSE;
$options{FILTER_CHANNELS_BY_NUMBER} = $TRUE;

## List of channels to OUTPUT
$options{FILTER_CHANNELS_BY_NAME_LIST} = 
   ["WTTV", "WISH", "WTHR", "WFYI", "WXIN", "WRTV", "WNDY", "WIPX"];

$options{FILTER_CHANNELS_BY_NUMBER_LIST} = 
   [qw( 2 3 4 5 6 7 9 11 12 14 15 16 18 28 29 30 31 32
        33 34 35 36 37 38 39 49 50 53 55 71 73 74 75 78)];

## Your personal Service ID, used by
## tvguide.com to localize your listings.
$options{SERVICE_ID} = 359508;
% bin/tvlisting
            6:30 PM             7:00 PM             7:30 PM   
           +---------+---------+---------+---------+---------+
76 WE      Felicity             Hollywood Wives
77 OXYGN   Can You Tell?        Beautiful
% bin/tvlisting
<Channel Name="TOON" Number="53">
  <Shows Title="Dexter's Laboratory" Sequence="1" Duration="6" />
  <Shows Title="Ed, Edd n Eddy" Sequence="2" Duration="6" />
  <Shows Title="Courage the Cowardly Dog" Sequence="3" Duration="6" />
  <Shows Title="Pokemon" Sequence="4" Duration="6" />
</Channel>
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
use LWP::Simple;
use HTML::TableExtract;
my %opts;

# our list of tvguide.com categories.
my @search_categories = ( qw/ action+%26+adventure adult Movie
                              comedy drama horror mystery+%26+suspense
                              sci-fi+%26+paranormal western Sports
                              Newscasts+%26+newsmagazines health+%26+fitness
                              science+%26+technology education Children%27s
                              talk+%26+discussion soap+opera
                              shopping+%26+classifieds music / );

# instructions for if the user doesn't
# pass a search term or category. bah.
sub show_usage {
 print "You need to pass either a search term (--search)\n";
 print "or use one of the category numbers below (--category):\n\n";
 my $i=1; foreach my $cat (@search_categories) {
    $cat =~ s/\+/ /g; $cat =~ s/%26/&/; $cat =~ s/%27/'/;
    print "  $i) ", ucfirst($cat), "\n"; $i++;
 } exit;
}

# define our command-line flags (long and short versions).
GetOptions(\%opts, 'search|s=s',      # a search term.
                   'category|c=s',    # a search category.
); unless ($opts{search} || $opts{category}) { show_usage; }

# create some variables for use at tvguide.com.
my ($day, $month) = (localtime)[3..4]; $month++;
my $start_time = "8:00";         # this time is in military format
my $time_span  = 20;             # number of hours of TV listings you want
my $start_date = "$month\/$day"; # set the current month and day
my $service_id = 61058;          # our service id (see tvlisting readme)
my $search_phrase = undef;       # final holder of what was searched for
my $html_file = undef;           # the downloaded data from tvguide.com
my $url = 'http://www.tvguide.com/listings/search/SearchResults.asp';

# search by category.
if ($opts{category}) {
   my $id = $opts{category}; # convenience.
   die "Search category must be a number!" unless $id =~ /\d+/;
   die "Category ID was invalid" unless ($id >= 1 && $id <= 19);
   $html_file = get("$url?l=$service_id&FormCategories=".
                    "$search_categories[$id-1]");
   die "get(  ) did not return as we expected.\n" unless $html_file;
   $search_phrase = $search_categories[$id-1];
}
elsif ($opts{search}) { 
   my $term = $opts{search}; # convenience.
   $html_file = get("$url?I=$service_id&FormText=$term");
   die "get(  ) did not return as we expected.\n" unless $html_file;
   $search_phrase = $term;
}

# now begin printing out our matches.
print "Search Results for '$search_phrase':\n\n";

# create a new table extract object and pass it the
# headers of the tvguide.com table in our data. 
my $table_extract =
   HTML::TableExtract->new(
        headers => ["Date","Start Time", "Title", "Ch#"],
            keep_html => 1 );
$table_extract->parse($html_file);

# now, with our extracted table, parse.
foreach my $table ($table_extract->table_states) {
    foreach my $cols ($table->rows) {

        # this is not the best way to do this...
        if(@$cols[0] =~ /Sorry your search found no matches/i)
          { print "No matches to found for your search!\n"; exit; }

        # get the date.
        my $date = @$cols[0];
        $date =~ s/<.*>//g;       $date =~ s/\s*//g;
        $date =~ /(\w*)\D(\d*)/g; $date = "$1/$2";

        # get the time.
        my $time = @$cols[1];
        $time =~ m/(\d*:\d*\s+\w+)/;
        $time = $1;

        # get the title, detail_url, detail_number, and station.
        @$cols[2] =~ /href="(.*\('\d*','(\d*)','\d*','\d*','(.*)',.*)"/i;
        my ($detail_url, $detail_num, $channel) = ($1, $2, $3);
        my $title = @$cols[2]; $title =~ s/<.*>//g;
        $title =~ /(\b(.*)\b)/g; $title = $1;

        # get channel number
        my $channel_num = @$cols[3];
        $channel_num =~ m/>\s*(\d*)\s*</;
        $channel_num = $1;

        # turn the evil Javascript URL into a normal one.
        $detail_url =~ /javascript:cu\('(\d+)','(\d+)'/;
        my $iSvcId = $1; my $iTitleId = $2;
        $detail_url = "http://www.tvguide.com/listings/".
                      "closerlook.asp?I=$iSvcId&Q=$iTitleId";

        # now, print the results.
        print " $date at $time on chan$channel_num ($channel): $title\n";
        print "    $detail_url\n\n";
    }
}
% perl tvsearch.pl --search  farscape
Search Results for 'farscape':

 Mon/28 at 12:00 AM on chan62 (SCI-FI): Farscape: What Was Lost: Sacrifice
    http://www.tvguide.com/listings/closerlook.asp?I=61058&Q=3508575

 Mon/4 at 12:00 AM on chan62 (SCI-FI): Farscape: What Was Lost: Resurrection
    http://www.tvguide.com/listings/closerlook.asp?I=61058&Q=3508576
#!/usr/bin/perl -w
#
# Ben Hammersley ben@benhammersley.com
# Looks up the real-world location of visiting IPs
# and then finds out the weather at those places
#

use strict;
use CAIDA::NetGeoClient;
use Weather::Underground;
use Geography::Countries;

my $apachelogfile = "access_log";
my $numberoflines = 10;
my $lastdomain    = "";

# Open up the logfile.
open (LOG, "<$apachelogfile") or die $!;

# Place all the lines of the logfile
# into an array, but in reverse order.
my @lines = reverse <LOG>;

# Start our HTML document.
print "<h2>Where my last few visitors came from:</h2>\n<ul>\n";

# Go through each line one
# by one, setting the variables.
my $i; foreach my $line (@lines) {
    my ($domain,$rfc931,$authuser,$TimeDate,
        $Request,$Status,$Bytes,$Referrer,$Agent) =
        $line =~ /^(\S+) (\S+) (\S+) \[([^\]\[]+)\] \"([^"]*)\" (\S+) # (\S+) 
\"?([^"]*)\"? \"([^"]*)\"/o;

    # If this record is one we saw
    # the last time around, move on.
    next if ($domain eq $lastdomain);

    # And now get the geographical info.
    my $geo     = CAIDA::NetGeoClient->new(  );
    my $record  = $geo->getRecord($domain);
    my $city    = ucfirst(lc($record->{CITY}));
    my $region  = "";

    # Check to see if there is a record returned at all.
    unless ($record->{COUNTRY}) { $lastdomain = $domain; next; }

    # If city is in the U.S., use the state as the "region". 
    # Otherwise, use Geography::Countries to munge the two letter
    # code for the country into its actual name. (Thanks to
    # Aaron Straup Cope for this tip.)
    if ($record->{COUNTRY} eq "US") {
        $region = ucfirst(lc($record->{STATE}));
    } else { $region = country($record->{COUNTRY}); }

    # Now get the weather information.
    my $place   = "$city, $region";
    my $weather = Weather::Underground->new(place => $place);
    my $data    = $weather->getweather(  );
    next unless $data; $data = $data->[0];

    # And print it for our HTML.
    print " <li>$city, $region where it is $data->{conditions}.</li>\n";

    # Record the last domain name
    # for the repeat prevention check
    $lastdomain = $domain;

    # Check whether you're not at the limit, and if you are, finish.
    if ($i++ >= $numberoflines-1) { last; }
}

print "</ul>";
my ($domain,$rfc931,$authuser,$TimeDate,$Request,$Status,$Bytes,$Referrer,$Agen
t) = $line =~ /^(\S+) (\S+) (\S+) \[([^\]\[]+)\] \"([^"]*)\" (\S+) (\S+) \"?([^"]*)\
"? \"([^"]*)\"/o;
% perl weather.pl
<h2>Where my last few visitors came from:</h2>

<ul>
 <li>London, UK, where it is cloudy</li>
 <li>New York, NY, where it is sunny</li>
</ul>
#!/usr/bin/perl-w
#
# geospider.pl
#
# Geotargeting spider -- queries Google through the Google API, extracts
# hostnames from returned URLs, looks up addresses of hosts, and matches
# addresses of hosts against the IP-to-Country database from Directi:
# ip-to-country.directi.com. For more information about this software:
# http://www.artymiak.com/software or contact jacek@artymiak.com
# 
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; 
use Getopt::Std;
use Net::Google;
use constant GOOGLEKEY => 'Your Google API key here;
use Socket;

my $help = <<"EOH";
----------------------------------------------------------------------------
Geotargeting trend analysis spider
----------------------------------------------------------------------------
Options:

  -h    prints this help
  -q    query in utf8, e.g. 'Spidering Hacks'
  -l    language codes, e.g. 'en fr jp'
  -d    domains, e.g. '.com'
  -s    which result should be returned first (count starts from 0), e.g. 0
  -n    how many results should be returned, e.g. 700
----------------------------------------------------------------------------
EOH

# define our arguments and show the
# help if asked, or if missing query.
my %args; getopts("hq:l:d:s:n:", \%args);
die $help if exists $args{h};
die $help unless $args{'q'};

# create the Google object.
my $google = Net::Google->new(key=>GOOGLEKEY);
my $search = $google->search(  );

# language, defaulting to English.
$search->lr(qw($args{l}) || "en");

# what search result to start at, defaulting to 0.
$search->starts_at($args{'s'} || 0);

# how many results, defaulting to 10.
$search->starts_at($args{'n'} || 10);

# input and output encoding.
$search->ie(qw(utf8)); $search->oe(qw(utf8));

my $querystr; # our final string for searching.
if ($args{d}) { $querystr = "$args{q} .site:$args{d}"; }
else { $querystr = $args{'q'} } # domain specific searching.

# load in our lookup list from
# http://ip-to-country.directi.com/
my $file = "ip-to-country.csv";
print STDERR "Trying to open $file... \n";
open (FILE, "<$file") or die "[error] Couldn't open $file: $!\n";

# now load the whole shebang into memory.
print STDERR "Database opened, loading... \n";
my (%ip_from, %ip_to, %code2, %code3, %country);
my $counter=0; while (<FILE>) {
    chomp; my $line = $_; $line =~ s/"//g; # strip all quotes.
    my ($ip_from, $ip_to, $code2, $code3, $country) = split(/,/, $line);

    # remove trailing zeros.
    $ip_from =~ s/^0{0,10}//g; 
    $ip_to =~ s/^0{0,10}//g;

    # and assign to our permanents.
    $ip_from{$counter} = $ip_from;
    $ip_to{$counter}   = $ip_to;
    $code2{$counter}   = $code2;
    $code3{$counter}   = $code3;
    $country{$counter} = $country;
    $counter++; # move on to next line.
}

$search->query(qq($querystr));
print STDERR "Querying Google with $querystr... \n";
print STDERR "Processing results from Google... \n";

# for each result from Google, display 
# the geographic information we've found.
foreach my $result (@{$search->response(  )}) {
    print "-" x 80 . "\n";
    print " Search time: " . $result->searchTime(  ) . "s\n";
    print "       Query: $querystr\n";
    print "   Languages: " . ( $args{l} || "en" ) . "\n";
    print "      Domain: " . ( $args{d} || "" ) . "\n";
    print "    Start at: " . ( $args{'s'} || 0 ) . "\n";
    print "Return items: " . ( $args{n} || 10 ) . "\n";
    print "-" x 80 . "\n";

    map {
        print "url: " . $_->URL(  ) . "\n";
        my @addresses = get_host($_->URL(  ));
        if (scalar @addresses != 0) {
            match_ip(get_host($_->URL(  )));
        } else {
            print "address: unknown\n";
            print "country: unknown\n";
            print "code3: unknown\n";
            print "code2: unknown\n";
        } print "-" x 50 . "\n";
    } @{$result->resultElements(  )};
}

# get the IPs for 
# matching hostnames.
sub get_host {
    my ($url) = @_;

    # chop the URL down to just the hostname.
    my $name = substr($url, 7); $name =~ m/\//g;
    $name = substr($name, 0, pos($name) - 1);
    print "host: $name\n";

    # and get the matching IPs.
    my @addresses = gethostbyname($name);
    if (scalar @addresses != 0) {
        @addresses = map { inet_ntoa($_) } @addresses[4 .. $#addresses];
    } else { return undef; }
    return "@addresses";
}

# check our IP in the
# Directi list in memory.
sub match_ip {
    my (@addresses) = split(/ /, "@_");
    foreach my $address (@addresses) {
        print "address: $address\n";
        my @classes = split(/\./, $address);
        my $p; foreach my $class (@classes) {
            $p .= pack("C", int($class));
        } $p  = unpack("N", $p);
        my $counter = 0;
        foreach (keys %ip_to) {
            if ($p <= int($ip_to{$counter})) {
                print "country: " . $country{$counter} . "\n";
                print "code3: "   . $code3{$counter}   . "\n";
                print "code2: "   . $code2{$counter}   . "\n";
                last;
            } else { ++$counter; }
        } 
    }
}
% perl geospider.pl -q "amphetadesk"
Trying to open ip-to-country.csv... 
Database opened, loading... 
Querying Google with amphetadesk... 
Processing results from Google... 
--------------------------------------------------------------
 Search time: 0.081432s
       Query: amphetadesk
   Languages: en
      Domain: 
    Start at: 0
Return items: 10
--------------------------------------------------------------
url: http://www.macupdate.com/info.php/id/9787
host: www.macupdate.com
host: www.macupdate.com
address: 64.5.48.152
country: UNITED STATES
code3: USA
code2: US
--------------------------------------------------
url: http://allmacintosh.forthnet.gr/preview/214706.html
host: allmacintosh.forthnet.gr
host: allmacintosh.forthnet.gr
address: 193.92.150.100
country: GREECE
code3: GRC
code2: GR
--------------------------------------------------
...etc...
#!/usr/bin/perl -w
#
# broute.pl
# 
# A European train timetable hack that displays available train connections
# between two cities, with dates, times, and the number of changes. You
# can limit the number of acceptable changes with -c. If there are no
# connections, try earlier/later times/dates or search again for connections
# with intermediate stops, e.g., instead of Manchester -> Roma, choose 
# Manchester -> London, London -> Paris, and Paris -> Roma.
# 
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict;
use LWP::UserAgent;
use Net::HTTP;
use Getopt::Std;

my $help = <<"EOH";
---------------------------------------------------------------------------
Best train routes in Europe

Options: -a   depart from
         -z   arrive in
         -d   date (of departure, if -s d; arrival, if -s a)
              in dd.mm.yy format (e.g. June 1, 2004 is 01.06.04)
         -t   time (of departure, if -s d; arrival, if -s a)
              in hh:mm format (e.g. 12:45)
         -s   select time point for -d and -t options, default -s d
         -c   maximum number of changes, default 0
         -h   print this help
EOH

# set out command-line options,
# requirements, and defaults.
my %args; getopt('ha:z:d:t:s:c:', \%args);
die $help if exists $args{h};
die $help unless $args{a};
die $help unless $args{z};
die $help unless $args{t};
$args{'s'} = 'depart' unless $args{'s'};
$args{'s'} = 'depart' if $args{'s'} eq 'd';
$args{'s'} = 'arrive' if $args{'s'} eq 'a';

# our requesting agent. define our URL and POST.
my $url  = 'http://www.rozklad.pkp.pl/cgi-bin/new/query.exe/en';
my $post = "protocol=http:&from=$args{a}&to=$args{z}&datesel=custom".
           "&date=$args{d}&timesel=$args{s}&time=$args{t}";

# the headers we'll send off...
my $hdrs = HTTP::Headers->new(Accept => 'text/plain',
                 'User-Agent' => 'PKPTrainTimetableLookup/1.0');

# and the final requested documents.
my $uable = HTTP::Request->new(POST, $url, $hdrs, $post);
my $ua    = LWP::UserAgent->new; my $req = $ua->request($uable);

# if a success,
# let's parse it!
die $req->message
  unless $req->is_success;
my $doc = $req->content;

$doc =~ s/[\f\t\n\r]//isg; # remove linefeeds.
while ($doc =~ m/ NAME=sel[0-9]{1,2}>/isg) {
    my $begin = pos($doc);
    $doc =~ m/<TR>/isg;
    my $end = pos($doc);
    next unless $begin;
    next unless $end;

    # munch our content into columns.
    my $content = substr($doc, $begin, ($end -= 5) - $begin);
    $doc = substr($doc, $end);
    my @columns = split(/<TD/, $content); shift @columns;
    foreach my $column (@columns) {
        $column = '<TD' . $column;
        $column =~ s/<[^>]*>//g;
        $column =~ s/<[^>]*//g;
    }

    # skip schedules that have more hops than we want.
    if ($args{c} and int $args{c} < int $columns[2]) { next; }

    # and print out our data.
    print "-" x 80 . "\n";
    print "             From: $columns[0]\n";
    print "               To: $columns[1]\n";
    print "          Changes: $columns[2]\n";
    print "Date of Departure: $columns[3]\n" if $args{'s'} eq 'depart';
    print "  Date of Arrival: $columns[3]\n" if $args{'s'} eq 'arrive';
    print "   Departure Time: $columns[4]\n";
    print "     Arrival Time: $columns[5]\n";
}
% perl broute.pl -a Berlin -z Szczecin -s a -d 12.15.04 -t 8:00 -c 0
% perl 
               broute.pl -a Manchester -z Roma -s d -d 12.15.04 -t 8:00 -c 4
trying http://www.rozklad.pkp.pl/cgi-bin/new/query.exe/en ...
-------------------------------------------------------------------------
             From: Berlin Ostbf
               To: Szczecin G_wny
          Changes: 0
  Date of Arrival: 05.07.03
Departure Time: 5:55
  Arrival Time: 7:41
#!/usr/bin/perl -w

# Usage: geodist.pl --from="fromaddr" --to="toaddr" [--unit="unit"]
# See ParseAddress(  ) below for the format of addresses. Default unit is
# "mile". Other units are yard, foot, inch, kilometer, meter, centimeter.

use strict;
use Getopt::Long;
use Geo::Distance;
use HTTP::Request::Common;
use LWP::UserAgent;

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

my $_ADDRESS_REGEX = q<(((([^\,]+),\s*)?([^\,]+),\s*)?([A-Z]{2}))?> .
  q<(\s*(\d{5}(-\d{4})?))?>;

sub ParseAddress {

  # Moderately robust regex parse of an address of the form:
  #   Street Address, City, ST ZIP
  # Assumes that a city implies a state, and a street address implies a
  # city; otherwise, all fields are optional. Does a good job so long as
  # there are no commas in street address or city fields.

  my $AddrIn = shift;
  my $ComponentsOut = shift;
  $AddrIn =~ /$_ADDRESS_REGEX/;
  $ComponentsOut->{Address} = $4 if $4;
  $ComponentsOut->{City} = $5 if $5;
  $ComponentsOut->{State} = $6 if $6;
  $ComponentsOut->{Zip} = $8 if $8;
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub GetPosition {

  # Hack mappoint.msn.com to obtain the longitude and latitude of an
  # address. MapPoint doesn't actually return lon/lat as user data, but
  # it can be found in a Location header when a successful map request is
  # made. Testing has shown this to be a robust hack. Biggest caveat
  # presently is failure when MapPoint returns multiple address matches.

  my $AddressIn = shift;
  my $LatitudeOut = shift;
  my $LongitudeOut = shift;

  # Create a user agent for HTTP requests.
  my $ua = LWP::UserAgent->new;

  # First do a simple request to get the redirect that MapPoint sends us.
  my $req = GET( 'http://mappoint.msn.com/' );
  my $res = $ua->simple_request( $req );

  # Save the redirect URI and then grab the full page.
  my $uri = $res->headers->{location};
  my $req = GET( 'http://mappoint.msn.com' . $uri );
  my $res = $ua->request( $req );

  # Get the _  _VIEWSTATE hidden input from the result.
  my ( $_  _VIEWSTATE ) =
    $res->content =~ /name="_  _VIEWSTATE" value="([^\"]*)"/s;

  # Construct the form fields expected by the mapper.
  my $req = POST( 'http://mappoint.msn.com' . $uri,
    [ 'FndControl:SearchType' => 'Address',
      'FndControl:ARegionSelect' => '12',
      'FndControl:StreetText' => $AddressIn->{Address},
      'FndControl:CityText' => $AddressIn->{City},
      'FndControl:StateText' => $AddressIn->{State},
      'FndControl:ZipText' => $AddressIn->{Zip},
      'FndControl:isRegionChange' => '0',
      'FndControl:resultOffSet' => '0',
      'FndControl:BkARegion' => '12',
      'FndControl:BkPRegion' => '15',
      'FndControl:hiddenSearchType' => '',
      '__VIEWSTATE' => $_  _VIEWSTATE
    ] );

  # Works without referer, but we include it for good measure.
  $req->push_header( 'Referer' => 'http://mappoint.msn.com' . $uri );

  # Do a simple request because all we care about is the redirect URI.
  my $res = $ua->simple_request( $req );

  # Extract and return the latitude/longitude from the redirect URI.
  ( $$LatitudeOut, $$LongitudeOut ) = $res->headers->{location} =~
    /C=(-?[0-9]+\.[0-9]+)...(-?[0-9]+\.[0-9]+)/;
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub main {

  # Get the command-line options.
  my ( $FromOpt, %FromAddress, $ToOpt, %ToAddress );
  my $UnitOpt = 'mile';
  GetOptions( "from=s" => \$FromOpt,
              "to=s"   => \$ToOpt,
              "unit=s" => \$UnitOpt );

  # Parse the addresses.
  ParseAddress( $FromOpt, \%FromAddress );
  ParseAddress( $ToOpt, \%ToAddress );

  # Get latitude/longitude for the addresses.
  my ( $FromLat, $FromLon, $ToLat, $ToLon );
  GetPosition( \%FromAddress, \$FromLat, \$FromLon );
  GetPosition( \%ToAddress, \$ToLat, \$ToLon );

  # If we at least got some numbers, then find the distance.
  if ( $FromLat && $FromLon && $ToLat && $ToLon ) {
    print "($FromLat,$FromLon) to ($ToLat,$ToLon) is ";
    my $geo = new Geo::Distance;
    print $geo->distance_calc( $UnitOpt, $FromLon,
                               $FromLat, $ToLon, $ToLat );
    if ( $UnitOpt eq 'inch' ) { print " inches\n"; }
    elsif ( $UnitOpt eq 'foot' ) { print " feet\n"; }
    else { print " ", $UnitOpt, "s\n"; }
  }
  else {
    print "Latitude/Longitude lookup failed for FROM address\n"
      if !( $FromLat && $FromLon );
    print "Latitude/Longitude lookup failed for TO address\n"
      if !( $ToLat && $ToLon );
  }
}

main(  );
% perl geodist.pl --from="Los Angeles, CA" --to="New York, NY"
(34.05466,-118.24150) to (40.71012,-74.00657) is 2448.15742500315 miles

% perl geodist.pl
                    --from="14 Horseshoe Drive, Brookfield, CT" 
                    --to="5 Mountain Orchard, Bethel, CT"
(41.46380,-73.42021) to (41.35659,-73.41078) is 7.43209675476431 miles

% perl geodist.pl --from=06804 --to=06801
(41.47364,-73.38575) to (41.36418,-73.39262) is 7.57999735385486 miles
% perl geodist.pl --from="Los Angeles, CA" --to="New York"
Latitude/Longitude lookup failed for TO address
#!/usr/bin/perl -w
#
# Dict - looks up definitions, synonyms and antonyms of words.
# Comments, suggestions, contempt? Email adam@bregenzer.net.
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#

use strict; $|++;
use LWP;
use Net::Dict;
use Sort::Array "Discard_Duplicates";
use URI::Escape;

my $word = $ARGV[0]; # the word to look-up
die "You didn't pass a word!\n" unless $word;
print "Definitions for word '$word':\n";

# get the dict.org results.
my $dict = Net::Dict->new('dict.org');
my $defs = $dict->define($word);
foreach my $def (@{$defs}) {
    my ($db, $definition) = @{$def};
    print $definition . "\n";
}

# base URL for thesaurus.com requests
# as well as the surrounding HTML of
# the data we want. cleaner regexps.
my $base_url       = "http://thesaurus.reference.com/search?q=";
my $middle_html    = ":</b>&nbsp;&nbsp;</td><td>";
my $end_html       = "</td></tr>";
my $highlight_html = "<b style=\"background: #ffffaa\">";

# grab the thesaurus results.
my $ua = LWP::UserAgent->new(agent => 'Mozilla/4.76 [en] (Win98; U)');
my $data = $ua->get("$base_url" . uri_escape($word))->content;

# holders for matches.
my (@synonyms, @antonyms);

# and now loop through them all.
while ($data =~ /Entry(.*?)<b>Source:<\/b>(.*)/) {
    my $match = $1; $data = $2;

    # strip out the bold marks around the matched word.
    $match =~ s/${highlight_html}([^<]+)<\/b>/$1/;

    # push our results into our various arrays.
    if ($match =~ /Synonyms${middle_html}([^<]*)${end_html}/) {
        push @synonyms, (split /, /, $1);
    }
    elsif ($match =~ /Antonyms${middle_html}([^<]*)${end_html}/) {
        push @antonyms, (split /, /, $1);
    }
}

# sort them with sort::array,
# and return unique matches.
if ($#synonyms > 0) {
    @synonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@synonyms,
    );

    print "Synonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@synonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n\n";
}

# same thing as above.
if ($#antonyms > 0) {
    @antonyms = Discard_Duplicates(
        sorting      => 'ascending',
        empty_fields => 'delete',
        data         => \@antonyms,
    );

    print "Antonyms for $word:\n";
    my $quotes = ''; # purtier.
    foreach my $nym (@antonyms) {
        print $quotes . $nym;
        $quotes = ', ';
    } print "\n";
}
% perl dict.pl "hack"
Definitions for word 'hack':
<snip>
hack

   <jargon> 1. Originally, a quick job that produces what is
   needed, but not well.

   2.  An incredibly good, and perhaps very time-consuming, piece
   of work that produces exactly what is needed.

<snip>

   See also {neat hack}, {real hack}.

   [{Jargon File}]

   (1996-08-26)

Synonyms for hack:
be at, block out, bother, bug, bum, carve, chip, chisel, chop, cleave, 
crack, cut, dissect, dissever, disunite, divide, divorce, dog, drudge, 
engrave, etch, exasperate, fashion, form, gall, get, get to, grate, grave, 
greasy grind, grind, grub, grubber, grubstreet, hack, hew, hireling, incise, 
indent, insculp, irk, irritate, lackey, machine, mercenary, model, mold, 
mould, nag, needle, nettle, old pro, open, part, pattern, peeve, pester, 
pick on, pierce, pique, plodder, potboiler, pro, provoke, rend, rip, rive, 
rough-hew, sculpt, sculpture, separate, servant, sever, shape, slash, slave, 
slice, stab, stipple, sunder, tear asunder, tease, tool, trim, vex, whittle, 
wig, workhorse

Antonyms for hack:
appease, aristocratic, attach, calm, cultured, gladden, high-class, humor, 
join, make happy, meld, mollify, pacify, refined, sophisticated, superior, 
unite
my $defs = $dict->define($word, 'web1913');
#!/usr/bin/perl-w
#
# Hack to query and report from www.lexfn.com
#
# This code is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
# by rik - ora@rikrose.net
#

######################
# support stage      #
######################

use strict;
use Getopt::Std qw(getopts);
use LWP::Simple qw(get);
use URI::Escape qw(uri_escape uri_unescape);
use HTML::TokeParser;

sub usage (  ) { print "
usage: lexfn [options] word1 [word2]
options available:
 -s Synonymous     -a Antonym        -b Birth Year
 -t Triggers       -r Rhymes         -d Death Year
 -g Generalizes    -l Sounds like    -T Bio Triggers
 -S Specialises    -A Anagram of     -k Also Known As
 -c Comprises      -o Occupation of
 -p Part of        -n Nationality

 or -x for all

word1 is mandatory, but some searches require word2\n\n"
}

######################
# parse stage        #
######################

# grab arguments, and put them into %args hash, leaving nonarguments
# in @ARGV for us to process later (where word1 and word2 would be)
# if we don't have at least one argument, we die with our usage.
my %args; getopts('stgScparlAonbdTkx', \%args);
if (@ARGV > 2 || @ARGV == 0) { usage(  ); exit 0; }

# turn both our words into queries.
$ARGV[0] =~ s/ /\+/g; $ARGV[1] ||= "";
if ($ARGV[1]) { $ARGV[1] =~ s/ /\+/g; }

# begin our URL construction with the keywords.
my $URL = "http://www.lexfn.com/l/lexfn-cuff.cgi?sWord=$ARGV[0]".
          "&tWord=$ARGV[1]&query=show&maxReach=2";

# now, let's figure out our command-line arguments. each
# argument is associated with a relevant search at LexFN,
# so we'll first create a mapping to and fro.
my %keynames = (
 s => 'ASYN', t => 'ATRG', g => 'AGEN', S => 'ASPC', c => 'ACOM', 
 p => 'APAR', a => 'AANT', r => 'ARHY', l => 'ASIM', A => 'AANA', 
 o => 'ABOX', n => 'ABNX', b => 'ABBX', d => 'ABDX', T => 'ABTR', 
 k => 'ABAK'
);

# if we want everything all matches
# then add them to our arguments hash,
# in preparation for our URL.
if (defined($args{'x'}) && $args{'x'} == 1) {
   foreach my $arg (qw/s t g l S c p a r l A o n b d T k/){
       $args{$arg} = 1; # in preparation for URL.
   } delete $args{'x'}; # x means nothing to LexFN.
}

# build the URL from the flags we want.
foreach my $arg (keys %args) { $URL .= '&' . $keynames{$arg} . '=on'; }

######################
# request stage      #
######################

# and download it all for parsing.
my $content = get($URL) or die $!;

######################
# extract stage      #
######################

# with the data sucked down, pass it off to the parser.
my $stream = HTML::TokeParser->new( \$content ) or die $!;

# skip the form on the page, then it's the first <b>
# after the form that we start extracting data from
my $tag = $stream->get_tag("/form");
while ($tag = $stream->get_tag("b")) {
    print $stream->get_trimmed_text("/b") . " ";
    $tag = $stream->get_tag("img");
    print $tag->[1]{alt} . " ";
    $tag = $stream->get_tag("a");
    print $stream->get_trimmed_text("/a") . "\n";
}

exit 0;
http://www.lexfn.com/l/lexfn-cuff.cgi?fromresub=on&
ASYN=on&ATRG=on&AGEN=on&ASPC=on&ACOM=on&APAR=on&AANT=on&
ARHY=on&ASIM=on&AANA=on&ABOX=on&ABNX=on&ABBX=on&ABDX=on&
ABTR=on&ABAK=on&sWord=lee+harvey+oswald&tWord=disobey&query=SHOW
% perl lexfn.pl -x disease
disease triggers aids
disease triggers cancer
disease triggers patients
disease triggers virus
disease triggers doctor
...
disease is more general than blood disorder
disease is more general than boutonneuse fever
disease is more general than cat scratch disease
...
disease rhymes with breeze
disease rhymes with briese
disease rhymes with cheese
disease rhymes with crees
...
% perl lexfn.pl -bdonT "lee harvey oswald"
lee harvey oswald was born in 1939
lee harvey oswald died in 1963
lee harvey oswald has the nationality american
lee harvey oswald has the occupation assassin
lee harvey oswald triggers 1956-1959
lee harvey oswald triggers 1959
lee harvey oswald triggers 1962
lee harvey oswald triggers attempted
lee harvey oswald triggers become
lee harvey oswald triggers book
lee harvey oswald triggers citizen
lee harvey oswald triggers communist
...
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Net::SMTP;
use Net::AIM;
use XML::RSS;

# get params for later use.
my $RUN_STATE = shift(@ARGV);

# the base URL of the site we are scraping and
# the URL of the page where the bugtraq list is located.
my $base_url = "http://www.security-focus.com";
my $url      = "http://www.security-focus.com/archive/1";

# get our data.
my $html_file = get($url) or die "$!\n";

# create an iso date.
my ($day, $month, $year) = (localtime)[3..5];
$year += 1900; my $date = "$year-$month-$day";

# since the data we are interested in is contained in a table,
# and the table has headers, then we can specify the headers and
# use TableExtract to grab all the data below the headers in one
# fell swoop. We want to keep the HTML code intact so that we
# can use the links in our output formats. start the parse:
my $table_extract =
   HTML::TableExtract->new(
     headers   => [qw(Date Subject Author)],
     keep_html => 1 );
$table_extract->parse($html_file);

# parse out the desired info and
# stuff into a data structure.
my @parsed_rows; my $ctr = 0;
foreach my $table ($table_extract->table_states) {
   foreach my $cols ($table->rows) {
      @$cols[0] =~ m|(\d+/\d+/\d+)|;
      my %parsed_cols = ( "date" => $1 );

      # since the subject links are in the 2nd column, parse unwanted HTML
      # and grab the anchor tags. Also, the subject links are relative, so
      # we have to expand them. I could have used URI::URL, HTML::Element,
      # HTML::Parse, etc. to do most of this as well.
      @$cols[1] =~ s/ class="[\w\s]*"//;
      @$cols[1] =~ m|(<a href="(.*)">(.*)</a>)|;
      $parsed_cols{"subject_html"} = "<a href=\"$base_url$2\">$3</a>";
      $parsed_cols{"subject_url"}  = "$base_url$2";
      $parsed_cols{"subject"}      = $3;

      # the author links are in the 3rd
      # col, so do the same thing.
      @$cols[2] =~ s/ class="[\w\s]*"//;
      @$cols[2] =~ m|(<a href="mailto:(.*@.*)">(.*)</a>)|;
      $parsed_cols{"author_html"}  = $1;
      $parsed_cols{"author_email"} = $2;
      $parsed_cols{"author"}       = $3;

      # put all the information into an
      # array of hashes for easy access.
      $parsed_rows[$ctr++] = \%parsed_cols;
   }
}

# if no params were passed, then
# simply output to stdout.
unless ($RUN_STATE) { print &format_my_data(  ); }

# formats the actual
# common data, per format.
sub format_my_data(  ) {
   my $data = "";

   foreach my $cols (@parsed_rows)  {
      unless ($RUN_STATE) { $data .= "$cols->{'date'} $cols->{'subject'}\n"; }
   }

   return $data;
}
% perl bugtraq.pl 
07/11/2003 Invision Power Board v1.1.2
07/11/2003 LeapFTP remote buffer overflow exploit
07/11/2003 TSLSA-2003-0025 - apache
07/11/2003 W-Agora 4.1.5
...etc...
sub format_my_data(  ) {
   my $data = "";

   foreach my $cols (@parsed_rows)  {
      unless ($RUN_STATE || $RUN_STATE eq 'file') {
         $data .= "$cols->{date} $cols->{subject}\n"; 
      }
      elsif ($RUN_STATE eq 'html') {
         $data .= "<tr>\n<td>$cols->{date}</td>\n".
                  "<td>$cols->{subject_html}</td>\n".
                  "<td>$cols->{author_html}</td>\n</tr>\n";
      }
      elsif ($RUN_STATE eq 'email') {
         $data .= "$cols->{date} $cols->{subject}\n".
                  "link: $cols->{subject_url}\n";
      }
      elsif ($RUN_STATE eq 'aim') {
         $data .= "$cols->{date} $cols->{subject} $cols->{subject_url}\n";
      }
   }

   return $data;
}
unless ($RUN_STATE) { print &format_my_data(  ); }
elsif ($RUN_STATE eq 'html') {
   my $html .= "<html><head><title>Bugtraq $date</title></head><body>\n";
   $html    .= "<h1>Bugtraq listings for: $date</h1><table border=0>\n";
   $html    .= "<tr><th>Date</th><th>Subject</th><th>Author</th></tr>\n";
   $html    .= &format_my_data(  ) . "</table></body></html>\n";
   print $html;
}

elsif ($RUN_STATE eq 'email') {
   my $mailer = Net::SMTP->new('your mail server here');
   $mailer->mail('your sending email address');
   $mailer->to('your receiving email address');
   $mailer->data(  );
   $mailer->datasend("Subject: Bugtraq Report for $date\n\n");
   $mailer->datasend( format_my_data );
   $mailer->dataend(  );
   $mailer->quit;
}

elsif ($RUN_STATE eq 'rss') {
   my $rss = XML::RSS->new(version => '0.91');
   $rss->channel(title           => 'SecurityFocus Bugtraq',
                 link            => $bugtraq_url,
                 language        => 'en',
                 description     => 'Latest Bugtraq listings' );

   # add items to the RSS object.
   foreach my $cols (@parsed_rows) {
      $rss->add_item(title       => $cols->{date},
                     link        => $cols->{subject_url},
                     description => $cols->{subject} );
   } print $rss->as_string;
}

elsif ($RUN_STATE eq 'aim') {
  # AIM-related code goes here.
}
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTTP::Status;
use MIME::Lite;

# Locate the utility programs needed
our $lynx = '/usr/bin/lynx';
our $diff = '/usr/bin/diff';

# Define a location to store datafiles,
# and an address for notification
my $data_path = "$ENV{HOME}/.pagediff";
my $email = 'your_email@here.com';
my %sites =
  (
   'slashdot'     => ['http://slashdot.org/index.html', 500],
   'penny_arcade' => ['http://www.penny-arcade.com/view.php3', 20],
  );
for my $site (keys %sites) {
  my ($url, $threshold) = @{$sites{$site}};

  # Build filenames for storing the HTML content, text
  # content, as well as content from the previous notification.
  my $html_fn = "$data_path/$site.html";
  my $new_fn  = "$data_path/$site.txt";
  my $old_fn  = "$data_path/$site-old.txt";

  # Download a new copy of the HTML.
  getstore($url, $html_fn);

  # Get text content from the new HTML.
  html_to_text($html_fn, $new_fn);

  # Check out by how much the page has changed since last notification.
  my $change = measure_change($new_fn, $old_fn);

  # If the page has changed enough,
  # send off a notification.
  if ($change > $threshold) {
    send_change_notification
      ($email,
       {
        site      => $site,
        url       => $url,
        change    => $change,
        threshold => $threshold,
        html_fn   => $html_fn,
        new_fn    => $new_fn,
        old_fn    => $old_fn
       }
      );

    # Rotate the old text content for the new.
    unlink $old_fn if (-e $old_fn);
    rename $new_fn, $old_fn;
  }
}
sub html_to_text {
  my ($html_fn, $txt_fn) = @_;
  open(FOUT, ">$txt_fn");
  print FOUT `$lynx -dump $html_fn`;
  close(FOUT);
}
sub get_changes {
  my ($fn1, $fn2) = @_;
  return `$diff $fn1 $fn2`;
}
sub measure_change {
  my ($fn1, $fn2) = @_;
  return 0 if ( (!-e $fn1) || (!-e $fn2) );
  my @lines = split(/\n/, get_changes($fn1, $fn2));
  return scalar(@lines);
}
sub send_change_notification {
  my ($email, $vars) = @_;

  # Start constructing the email message
  my $msg = MIME::Lite->new
    (
     Subject => "$vars->{site} has changed.".
       "($vars->{change} > $vars->{threshold})",
     To      => $email,
     Type    => 'multipart/alternative',
    );

  # Create a separator line of '='
  my $sep = ("=" x 75);
  # Start the text part of email
  # by dumping out the page text.
  my $out = '';
  $out .= "The page at $vars->{url} has changed. ";
  $out .= "($vars->{change} > $vars->{threshold})\n\n";
  $out .= "\n$sep\nNew page text follows:\n$sep\n";

  open(FIN, $vars->{new_fn});
  local $/; undef $/;
  $out .= <FIN>;
  close(FIN);

  # Follow with a diff summary of page changes.
  $out .= "$sep\nSummary of changes follows:\n$sep\n\n";
  $out .= get_changes($vars->{new_fn}, $vars->{old_fn})."\n";
  # Add the text part to the email.
  my $part1 = MIME::Lite->new
    (
     Type => 'text/plain',
     Data => $out
    );
  $msg->attach($part1);
  # Create and add the HTML part of the email, making sure to add a
  # header indicating the base URL used for relative URLs.
  my $part2 = MIME::Lite->new
    (
     Type => 'text/html',
     Path => $vars->{html_fn}
    );
  $part2->attr('Content-Location' => $vars->{url});
  $msg->attach($part2);

  # Send off the email
  $msg->send(  );
}
#!/usr/bin/perl -w
use strict;
use File::Spec;
use File::Temp;
use Net::FTP;
use Text::Template;

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

## Configurable Globals

## $FAV_ROOT = Location of the root of the Favorites folder
my $FAV_ROOT = File::Spec->join( $ENV{USERPROFILE}, 'Favorites' );

## $FAV_NAME = Top level name to use in favorites folder tree
my $FAV_NAME = 'Favorites';

## $FAV_TMPL = Text::Template file; output files will use same extension
my $FAV_TMPL = 'favorites.tmpl.html';

## Host data for publishing favorites via ftp
my $FAV_HOST = 'myserver.net';
my $FAV_PATH = 'favorites';
my $FAV_USER = 'username';
my $FAV_PASS = 'password';

## End of Configurable Globals

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

my $_FAV_TEMPDIR = File::Temp->tempdir( 'XXXXXXXX', CLEANUP => 1 );

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub LoadFavorites {

  # Recursively load the structure of an IE
  # Favorites directory tree into a tree of hashes.

  my $FolderIn = shift;      # Folder to process
  my $FavoritesOut = shift;  # Hashref to load with this folder's entries

  # Do a readdir into an array for a
  # quick load of the directory entries.
  opendir( FOLDER, $FolderIn ) ||
    die "Could not open favorites folder '$FolderIn'";
  my @FolderEntries = readdir( FOLDER );
  closedir( FOLDER );

  # Process each entry in the directory.
  foreach my $FolderEntry ( @FolderEntries ) {

    # Skip special names . and ..
    next if $FolderEntry eq '.' || $FolderEntry eq '..';

    # Construct the full path to the current entry.
    my $FileSpec = File::Spec->join( $FolderIn, $FolderEntry );

    # Call LoadFavorites recursively if we're processing a directory.
    if ( -d $FileSpec && !( -l $FileSpec ) ) {
      $FavoritesOut->{$FolderEntry} = {};
      LoadFavorites( $FileSpec, $FavoritesOut->{$FolderEntry} );
    }

    # If it's not a directory, check for a filename that ends with '.url'.
    # When we find a link file, extract the URL and map the favorite to it.
    elsif ( $FolderEntry =~ /^.*\.url$/i ) {
      my ( $FavoriteId ) = $FolderEntry =~ /^(.*)\.url$/i;
      next if !open( FAVORITE, $FileSpec );
      ( $FavoritesOut->{$FavoriteId} ) =
           join( '', <FAVORITE> ) =~ /^URL=([^\n]*)\n/m;
      close( FAVORITE );
    }
  }
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub MakeDocName {

  # Quick hack to generate a safe filename for a favorites entry. Replaces
  # all whitespace and special characters with underscores, concatenates
  # parent spec with the new spec, and postfixes the the whole thing with
  # the same file extension as the globally named template document.

  my $FavoriteIn = shift;        # Label of new favorites entry
  my $ParentFilenameIn = shift;  # MakeDocName of the parent level

  my ( $FileType ) = $FAV_TMPL =~ /\.([^\.]+)$/;
  $FavoriteIn =~ s/(\s+|\W)/_/g;
  $ParentFilenameIn =~ s/$FileType$//;
  return lc( $ParentFilenameIn . $FavoriteIn . '.' . $FileType );
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub GenerateFavorites {

  # Recurse through a tree of Favorites entries and generate a document for
  # each level based on the globally named template document.

  my $FavoritesIn = shift;       # Hashref to current tree level
  my $FolderNameIn = shift;      # Name of the current folder
  my $ParentFilenameIn = shift;  # MakeDocName of the parent level

  # Create shortcut identifiers for things that get reused a lot.
  my $Folder = $FavoritesIn->{$FolderNameIn};
  my $FolderFilename = MakeDocName( $FolderNameIn, $ParentFilenameIn );

  # Separate the entries in the current folder into folders and links.
  # Folders can be identified because they are hash references, whereas
  # links are mapped to simple scalars (the URL of the link).
  my (%Folders,%Links);
  foreach my $Favorite ( keys( %{$Folder} ) ) {
    if ( ref( $Folder->{$Favorite} ) eq 'HASH' ) {
      $Folders{$Favorite} = { label => $Favorite,
        document => MakeDocName( $Favorite, $FolderFilename ) };
    }
    else {
      $Links{$Favorite}={label => $Favorite, href => $Folder->{$Favorite} };
    }
  }

  # Set up Text::Template variables, fill in the template with the folders
  # and links at this level of the favorites tree, and then output the
  # processed document to our temporary folder.
  my $Template = Text::Template->new( TYPE => 'FILE',
    DELIMITERS => [ '<{', '}>' ], SOURCE => $FAV_TMPL );
  my %Vars = (
    FAV_Name => $FAV_NAME,
    FAV_Home => MakeDocName( $FAV_NAME ),
    FAV_Folder => $FolderNameIn,
    FAV_Parent => $ParentFilenameIn,
    FAV_Folders => \%Folders,
    FAV_Links => \%Links
  );
  my $Document = $Template->fill_in( HASH => \%Vars );
  my $DocumentFile = File::Spec->join( $_FAV_TEMPDIR, $FolderFilename );
  if ( open( FAVORITES, ">$DocumentFile" ) ) {
    print( FAVORITES $Document );
    close( FAVORITES );
  }

  # Generate Favorites recursively for each of this folder's subfolders.
  foreach my $Subfolder ( keys( %Folders ) ) {
    GenerateFavorites( $Folder, $Subfolder, $FolderFilename );
  }
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub PublishFavorites {

  # Publish the generated documents via FTP. Pretty
  # much just gives up if something goes wrong.

  my $ftp = Net::FTP->new( $FAV_HOST ) ||
    die( "Cannot connect to '$FAV_HOST'" );
  $ftp->login( $FAV_USER, $FAV_PASS ) ||
    die( "Authorization for user '$FAV_USER' failed" );
  $ftp->cwd( $FAV_PATH ) ||
    die( "Could not CWD to '$FAV_PATH'" );
  opendir( FOLDER, $_FAV_TEMPDIR ) ||
    die( "Cannot open working directory '$_FAV_TEMPDIR'" );
  my @FolderEntries = readdir( FOLDER );
  closedir( FOLDER );
  foreach my $FolderEntry ( @FolderEntries ) {
    next if $FolderEntry eq '.' || $FolderEntry eq '..';
    $ftp->put( File::Spec->join( $_FAV_TEMPDIR, $FolderEntry ) ) ||
      warn( "Could not upload '$FolderEntry'...skipped" );
  }
  $ftp->quit;
}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

sub main {
  my %Favorites;
  $Favorites{$FAV_NAME} = {};
  LoadFavorites( $FAV_ROOT, $Favorites{$FAV_NAME} );
  GenerateFavorites( \%Favorites, $FAV_NAME, '' );
  PublishFavorites(  );
}

main(  );
<html>
<body>
  <h1><a href="<{$FAV_Home}>"><{$FAV_Name}></a></h1>
  <select onChange="location.replace(this[this.selectedIndex].value)">
    <{
      $OUT .= '<option selected>' . $FAV_Folder . '</option>' . "\n";
      if ( $FAV_Parent ne '' ) {
        $OUT .= '<option value="' . $FAV_Parent . '">..</option>' . "\n";
      }
      foreach my $folder ( sort( keys( %FAV_Folders ) ) ) {
        $OUT .= '<option value="' . $FAV_Folders{$folder}->{document} .
          '">&gt;' . $FAV_Folders{$folder}->{label} . '</option>' . "\n";
      }
    }>
  </select>
  <table>
    <{
      foreach my $link ( sort( keys( %FAV_Links ) ) ) {
        $OUT .= '<tr><td><a target="net" href="' .
          $FAV_Links{$link}->{href} . '">' .
          $FAV_Links{$link}->{label} . '</a></td></tr>' . "\n";
      }
    }>
  </table>
</body>
</html>
<html>
<head>
  <title>Favorites</title>
</head>
<frameset cols="250,*">
  <frame name="nav" scrolling="yes" src="favorites.html" />
  <frame name="net" src="http://refdesk.com"/>
</frameset>
</html>
dir "%USERPROFILE%"\Favorites
% perl PublishFavorites.pl
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;
use LWP::Simple;

# the magical URL.
my $url = "http://www.gamestop.com/search.asp?keyword=&platform=26".
          "&lookin=title&range=all&genre=0&searchtype=adv&sortby=title";

# the magical data.
my $data = get($url) or die $!;

# the magical parser.
my $p = HTML::TokeParser->new(\$data);

# now, find every table that's 510 and 75.
while (my $token = $p->get_tag("table")) {
    next unless defined($token->[1]{height});
    next unless defined($token->[1]{width});
    next unless $token->[1]{height} == 75;
    next unless $token->[1]{width} == 510;

    # get our title.
    $p->get_tag("font"); $p->get_tag("a");
    my $title = $p->get_trimmed_text;

    # and our price.
    $p->get_tag("font"); $p->get_tag("/b");
    my $ptoken = $p->get_token;
    my $price = $ptoken->[1];
    $price =~ s/\$//;

    # comma spliced.
    print "\"$title\",$price\n";
}
% perl gamestop.pl 
"4x4 Evolution 2 - Preowned",16.99
"Aggressive Inline - Preowned",16.99
"Air Force Delta Storm - Preowned",27.99
"Alias",49.99
...etc...
# get our query, else die miserably.
my $query = shift @ARGV; die unless $query;
 # the magical URL.
  my $url = "http://www.gamestop.com/search.asp?keyword=$query&platform=".
            "&lookin=title&range=all&genre=0&searchtype=adv&sortby=title";
% perl gamestop.pl racing
"All Star Racing",7.99
"Andretti Racing - Preowned",9.99
"Andretti Racing - Preowned",7.99
"Antz Extreme Racing - Preowned",16.99
"Antz Racing",4.99
"Antz Racing - Preowned",29.99
"ATV Quad Power Racing 2 - Preowned",24.99
"ATV Quad Power Racing 2 - Preowned",17.99
"ATV Quad Power Racing 2 - Preowned",17.99
"ATV: Quad Power Racing 2",19.99
"Batman: Gotham City Racer - Preowned",27.99
"Beetle Adventure Racing - Preowned",29.99
  # the magical URL.
  my $url = "http://www.gamestop.com/search.asp?".
            "keyword=your search keyword here&platform=".
            "&lookin=title&range=all&genre=0&searchtype=adv&sortby=title";
 # comma spliced.
 print "\"$title\",$price\n";
# start the RSS feed.
my $rss = XML::RSS->new(version => '0.91');
$rss->channel(
    'link'       => http://www.gamestop.com,
     title        => "Game Prices from GameStop",
     description  => "Great Games and Stuff!"
);
# add this item
# to our RSS feed.
$rss->add_item(
   title       => "$title, $price...", 
   'link'      => "http://www.gamestop.com/search.asp?keyword=$title".
                  "&platform=0&lookin=title&range=all&genre=0&sortby=title"
);
# and save our RSS.
$rss->save("gamestop.rdf");
function getURL( $pURL ) {
   $_data = null;
   if( $_http = fopen( $pURL, "r" ) ) {
      while( !feof( $_http ) ) {
         $_data .= fgets( $_http, 1024 );
      }
      fclose( $_http );
   }
   return( $_data );
}
$_rawData = getURL( "http://www.example.com/" );
function cleanString( $pString ) {
   $_data = str_replace( array( chr(10), chr(13), chr(9) ), chr(32), [RETURN]
$pString );
      while( strpos( $_data, str_repeat( chr(32), 2 ), 0 ) != false ) {
         $_data = str_replace( str_repeat( chr(32), 2 ), chr(32), $_data );
      }
      return( trim( $_data ) );
}
$_rawData = cleanString( $_rawData );
function getBlock( $pStart, $pStop, $pSource, $pPrefix = true ) {
   $_data = null;
   $_start = strpos( strtolower( $pSource ), strtolower( $pStart ), 0 );
   $_start = ( $pPrefix == false ) ? $_start + strlen( $pStart ) : $_start;
   $_stop = strpos( strtolower( $pSource ), strtolower( $pStop ), $_start );
   if( $_start > strlen( $pElement ) && $_stop > $_start ) {
      $_data = trim( substr( $pSource, $_start, $_stop - $_start ) );
   }
   return( $_data );
}

function getElement( $pElement, $pSource ) {
   $_data = null;
   $pElement = strtolower( $pElement );
   $_start = strpos( strtolower( $pSource ), chr(60) . $pElement, 0 );
   $_start = strpos( $pSource, chr(62), $_start ) + 1;
   $_stop = strpos( strtolower( $pSource ), "</" . $pElement . [RETURN]
   chr(62), $_start );
   if( $_start > strlen( $pElement ) && $_stop > $_start ) {
      $_data = trim( substr( $pSource, $_start, $_stop - $_start ) );
   }
   return( $_data );
}
$_rawData = getBlock( start_string, end_string, raw_source, [RETURN]
include_start_string );
$_rawData = getElement( html_tag, raw_source );
$_count = getBlock( "Total of", "results", $_rawData, false );
$_count = getElement( "b", $_rawData );
/* include the scraping functions script:  */
include( "scrape_func.php" ); 

/* Next, we'll get the raw source code of
   the page using our getURL(  ) function:  */
$_rawData = getURL( "http://www.techdeals.net/" ); 

/* And clean up the raw source for easier parsing:  */
$_rawData = cleanString( $_rawData ); 

/* The next step is a little more complex. Because we've already
   looked at the HTML source, we know that the items start and
   end with two particular strings. We'll use these strings to
   get the main data portion of the page:*/
$_rawData = getBlock( "<div class=\"NewsHeader\">",
                      "</div> <div id=\"MenuContainer\">", $_rawData ); 

/* We now have the particular data that we want to parse into
   an itemized list. We do that by breaking the code into an
   array so we can loop through each item: */
$_rawData = explode( "<div class=\"NewsHeader\">", $_rawData ); 

/* While iterating through each value, we 
   parse out the individual item portions:  /*
foreach( $_rawData as $_rawBlock ) {
   $_item = array(  );
   $_rawBlock = trim( $_rawBlock );
   if( strlen( $_rawBlock ) > 0 ) {

      /*   The title of the item can be found in <h2> ... </h2> tags   */
      $_item[ "title" ] = strip_tags( getElement( "h2", $_rawBlock ) );

      /*   The link URL can is found between
           http://www.techdeals.net/rd/go.php?id= and "   */
      $_item[ "link" ] = getBlock( "http://www.techdeals.net/rd/go.php?id=",
                                   chr(34), $_rawBlock );

      /*   Posting info is in <span> ... </span> tags   */
      $_item[ "post" ] = strip_tags( getElement( "span", $_rawBlock ) );

      /*   The description is found between an </div> and a <img tag   */
      $_item[ "desc" ] = cleanString( strip_tags( getBlock( "</div>",
                                      "<img", $_rawBlock ) ) );

      /*   Some descriptions are slightly different,
           so we need to clean them up a bit   */
      if( strpos( $_item[ "desc" ], "Click here for the techdeal", 0 ) [RETURN]
      > 0 ) {
         $_marker = strpos( $_item[ "desc" ], "Click here for the techdeal", [RETURN]
         0 );
         $_item[ "desc" ] = trim( substr( $_item[ "desc" ], 0, $_marker ) );
      }

      /*   Print out the scraped data   */
      print( implode( chr(10), $_item ) . chr(10) . chr(10) );

      /*   Save the data as a string (used in the mail example below)   */
      $_text .= implode( chr(10), $_item ) . chr(10) . chr(10);
   }
}
% php -q bargains.php

Values on Video
http://www.techdeals.net/rd/go.php?id=28
Posted 08/06/03 by david
TigerDirect has got the eVGA Geforce FX5200 Ultra 128MB video card
with TV-Out & DVI for only $124.99+S/H after a $20 rebate. 

Potent Portable
http://www.techdeals.net/rd/go.php?id=30
Posted 08/06/03 by david
Best Buy has got the VPR Matrix 220A5 2.2Ghz Notebook for just
$1049.99 with free shipping after $250 in rebates.

...etc...
mail( "me@foo.com", "Latest Tech Deals", $_text );
<rss version="0.91">
   <channel>
      <title><?= htmlentities( $_feedTitle ) ?></title>
      <link><?= htmlentities( $_feedLink ) ?></link>
      <description><?= htmlentities( $_feedDescription ) ?></description>
      <language>en-us</language> 

      <item>
         <title><?= htmlentities( $_itemTitle ) ?></title>
         <link><?= htmlentities( $_itemLink ) ?></link>
         <description><?= htmlentities( $_itemDescription ) ?></description>
      </item> 

   </channel>
</rss>
<rss version="0.91">
   <channel>
      <title>TechDeals: Latest Deals</title>
      <link>http://www.techdeals.net/</link>
      <description>Latest deals from TechDeals.net (scraped)</description>
      <language>en-us</language>
<?
   include( "scrape_func.php" );
   $_rawData = getURL( "http://www.techdeals.net/" );
   $_rawData = cleanString( $_rawData );
   $_rawData = getBlock( "<div class=\"NewsHeader\">",
                         "</div> <div id=\"MenuContainer\">", $_rawData );
   $_rawData = explode( "<div class=\"NewsHeader\">", $_rawData );
   foreach( $_rawData as $_rawBlock ) {
      $_item = array(  );
      $_rawBlock = trim( $_rawBlock );
      if( strlen( $_rawBlock ) > 0 ) {
         $_item[ "title" ] = strip_tags( getElement( "h2", $_rawBlock ) );
         $_item[ "link" ] 
         = getBlock( "http://www.techdeals.net/rd/go.php?id=", 
         chr(34), $_rawBlock );
         $_item[ "post" ] = strip_tags( getElement( "span", $_rawBlock ) );
         $_item[ "desc" ] = cleanString( strip_tags( getBlock( "</div>",
                                      "<img", $_rawBlock ) ) );
         if( strpos($_item[ "desc" ], "Click for the techdeal", 0 ) > 0 ) {
            $_marker = strpos($_item[ "desc" ], "Click for the techdeal",0 );
            $_item[ "desc" ] = trim(substr( $_item[ "desc" ], 0, $_marker) );
         }
?>
      <item>
         <title><?= $_item ["title" ] ?></title>
         <link><?=  $_item[ "link" ] ?></link>
         <description>
            <?= $_item[ "desc" ] . " (" . $_item[ "post" ] . ")" ?>
         </description>
      </item>
<?
      }
   }
?>
   </channel>
</rss>
#!/usr/bin/perl -w

# aggsearch - aggregate searching engine
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

######################
# support stage      #
######################

use strict;

# change this, if neccessary.
my $pluginDir = "plugins";

# if the user didn't enter any search terms, yell at 'em.
unless (@ARGV) { print 'usage: aggsearch "search terms"', "\n"; exit; }

# this routine actually executes the current
# plug-in, receives the tabbed data, and sticks
# it into a result array for future printing.
sub query {
    my ($plugin, $args, @results) = (shift, shift);
    my $command = $pluginDir . "/" . $plugin . " " . (join " ", @$args);
    open RESULTS, "$command |" or die "Plugin $plugin failed!\n";
    while (<RESULTS>) {
        chomp; # remove new line.
        my ($url, $name) = split /\t/;
        push @results, [$name, $url];
    } close RESULTS;

    return @results;
}

######################
# find plug-ins stage #
######################

opendir PLUGINS, $pluginDir
   or die "Plugin directory \"$pluginDir\"".
     "not found! Please create, and populate\n";
my @plugins = grep {
    stat $pluginDir . "/$_"; -x _ && ! -d _ && ! /\~$/;
} readdir PLUGINS; closedir PLUGINS;

######################
# query stage        #
######################

for my $plugin (@plugins){
    print "$plugin results:\n";
    my @results = query $plugin, \@ARGV;
    for my $listref (@results){
        print " $listref->[0] : $listref->[1] \n"
    } print "\n";
}

exit 0;
#!/usr/bin/perl -w

# Example freshmeat searching plug-in
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

use strict;
use LWP::UserAgent;
use HTML::TokeParser;

# create the URL from our incoming query.
my $url = "http://freshmeat.net/search-xml?q=" . join "+", @ARGV;

# download the data.
my $ua = LWP::UserAgent->new(  );
$ua->agent('Mozilla/5.0');
my $response = $ua->get($url);
die $response->status_line . "\n"
  unless $response->is_success;

my $stream = HTML::TokeParser->new (\$response->content) or die "\n";
while (my $tag = $stream->get_tag("match")){
    $tag = $stream->get_tag("projectname_full");
    my $name = $stream->get_trimmed_text("/projectname_full");
    $tag = $stream->get_tag("url_homepage");
    my $url = $stream->get_trimmed_text("/url_homepage");
    print "$url\t$name\n";
}
#!/usr/bin/perl -w

# Example Google searching plug-in

use strict;
use warnings;
use SOAP::Lite;

# all the Google information
my $google_key  = "your API key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");
my $query       = join "+", @ARGV;

# do the search...
my $result = $gsrch->doGoogleSearch($google_key, $query,
                          1, 10, "false", "",  "false",
                          "lang_en", "", "");

# and print the results.
foreach my $hit (@{$result->{'resultElements'}}){
   print "$hit->{URL}\t$hit->{title}\n";
}
#!/usr/bin/perl -w

# Example alltheweb searching plug-in
#
# This file is distributed under the same licence as Perl itself.
#
# by rik - ora@rikrose.net

use strict;
use LWP::UserAgent;
use HTML::TokeParser;

# create the URL from our incoming query.
my $url = "http://www.alltheweb.com/search?cat=web&cs=iso-8859-1" .
          "&q=" . (join "+", @ARGV) . "&_sb_lang=en";

print $url;
# download the data.
my $ua = LWP::UserAgent->new(  );
$ua->agent('Mozilla/5.0');
my $response = $ua->get($url);
die $response->status_line . "\n"
  unless $response->is_success;

my $stream = HTML::TokeParser->new (\$response->content) or die "\n";
while (my $tag = $stream->get_tag("p")){
    $tag = $stream->get_tag("a");
    my $name = $stream->get_trimmed_text("/a");
    last if $name eq "last 10 queries";
    my $url = $tag->[1]{href};
    print "$url\t$name\n";
}
% perl aggsearch.pl spidering 
alltheweb results:
 Google is now better at spidering dynamic sites. : [long url here] 
 Submitting sites to search engines : [long url here]
 WebcamCrawler.com  : [long url here]
 ...etc...

freshmeat results:
 HouseSpider : http://freshmeat.net/redir/housespider/28546/url_homepage/ 
 PhpDig : http://freshmeat.net/redir/phpdig/15340/url_homepage/
 ...etc...

google results:
 What is Spidering? : http://www.1afm.com/optimization/spidering.html
 SWISH-Enhanced Manual: Spidering : http://swish-e.org/Manual/spidering.html
 ...etc...
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use URI::Escape;
use Win32::Sound;
use SOAP::Lite;

# use your own Google API key here!
my $google_key  = "your Google key here";
my $google_wdsl = "GoogleSearch.wsdl";

# load in our lyrics phrase from the command line.
my $lyrics_phrase = shift or die "Usage: robot-karaoke.pl <phrase>\n";

# and perform the search on Google.
my $google_search_term = "intitle:\"$lyrics_phrase\" site:lyricsfreak.com";
my $googleSearch = SOAP::Lite->service("file:$google_wdsl");
my $result = $googleSearch->doGoogleSearch(
                      $google_key, $google_search_term,
                      0, 10, "false", "", "false",
                      "", "", "");

# if there are no matches, then say so and die.
die "No LyricsFreak matches were found for '$lyrics_phrase'.\n"
          if $result->{estimatedTotalResultsCount} == 0;

# and take the first Google result as
# the most likely location on LyricsFreak.com.
my @results         = @{$result->{'resultElements'}};
my $first_result    = $results[0];
my $lyricsfreak_url = $first_result->{'URL'};
print "Downloading lyrics from:\n $lyricsfreak_url\n";

# and download the data from LyricsFreak.com.
my $content = get($lyricsfreak_url) or die $!;
print "Connection to LyricsFreak was successful.\n";

# we have the data, so let's parse it.
# all lyrics are stored in a pre tag,
# so we delete everything before and after.
$content =~ s/.*<pre><b>.*<\/b><br>//mgis;
$content =~ s/<\/pre>.*//mgis;
my @lyrics_lines = split("\x0d", $content);

# AT&T's demo TTS service takes a maximum of 30 words,
# so we'll create a mini chunk of the lyrics to send off.
# each of these chunks will be sent to the TTS server
# then saved seperately as multiple mini-wav files.
my (@lyrics_chunks, $current_lyrics_chunk); my $line_counter = 0;
for (my $i = 0; $i <= scalar(@lyrics_lines) - 1; ++$i) {
    next if $lyrics_lines[$i] =~ /^\s*$/;
    $current_lyrics_chunk .= $lyrics_lines[$i] . "\n";

    if (($line_counter == 5) || ($i == scalar(@lyrics_lines) - 1) ) {
        push(@lyrics_chunks, $current_lyrics_chunk);
        $current_lyrics_chunk = ''; $line_counter = 0;
    } $line_counter++;
}

# now, we'll go through each chunk,
# and send it off to our TTS server.
my @temporary_wav_files;
foreach my $lyrics_chunk (@lyrics_chunks) {

    # and download the data.
    my $url = 'http://morrissey.naturalvoices.com/tts/cgi-bin/nph-talk';
    my $req = HTTP::Request->new('POST', $url); # almost there!
    $req->content('txt=' . uri_escape($lyrics_chunk) .
                  '&voice=crystal&speakButton=SPEAK');
    $req->content_type('application/x-www-form-urlencoded');
    my $res = LWP::UserAgent->new->simple_request($req);

    # incorrect server response? then die.
    unless ($res->is_success || $res->code == 301) {
       die "Error connecting to TTS server: " . $res->status_line . ".\n"; }

    # didn't get the response we wanted? die.
    if ($res->content !~ /can be found <A HREF=([^>]*)>here<\/A>/i) {
       die "Response from TTS server not understood. Odd.\n"; }

    # side effect of error checking above is to set $1 to
    # the actual wav file that was generated. this is good.
    my $wav_url  = "http://morrissey.naturalvoices.com$1";
    my $wav_file = $1; # for use in saving to disk.
    $wav_file =~ s/.*?\/(\w+.wav)/$1/;
    getstore($wav_url, "$wav_file") or
     die "Download of $wav_file failed: $!";
    push(@temporary_wav_files, $wav_file);
}

# with all our files downloaded, play them in
# order with the Win32::Sound module. else, they
# just sit there in hopes of the user playing them.
print  "Playing downloaded wav files...\n";
foreach my $temporary_wav_file (@temporary_wav_files) {
    print " Now Playing: $temporary_wav_file\n";
    Win32::Sound::Play("$temporary_wav_file");
% perl robotkaroake.pl "fish heads"
No LyricsFreak matches were found for 'fish heads'.
% perl robotkaroake.pl "born never asked"
Downloading lyrics from:
 http://www.lyricsfreak.com/l/laurie-anderson/81556.html
Connection to LyricsFreak was successful.
Playing downloaded wav files...
 Now Playing: 7a0c0093f2f531ac98691152d1f74367.wav
% perl robotkaroake.pl "under the moon"
Downloading lyrics from:
 http://www.lyricsfreak.com/i/insane-clown-posse/67657.html
Connection to LyricsFreak was successful.
Playing downloaded wav files...
 Now Playing: fe34e081ab8a3abaeecdb1e50b030209.wav
 Now Playing: 80709499765f9bfe75d3c7234c435a79.wav
 Now Playing: f1ca99233f9cdc6a78f311db887914f1.wav
 Now Playing: fd6b61421f3fc56510cf4b9e0d3a0e12.wav
 Now Playing: b954f58f906d53ec312bbcc6579ebe12.wav
 Now Playing: 407415e685260754174cf45338ba4d10.wav
 Now Playing: 8a2ade6e7f8fe950ddcb58747d241694.wav
 Now Playing: 22ed038190b9ed0fb4e3077655503422.wav
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use URI::Escape;

# $MAX_BBB_SEARCH_RETRIES is the number of times that the
# script will attempt to look up the URL on the BBB web site. 
# (Experimentally, the BBB web site appeared to give "database
# unavailable" error messages about 30% of the time.)
my $MAX_BBB_SEARCH_RETRIES = 3;

# $MAX_BBB_REFERRAL_PAGE_RETRIES is the number of times the
# script will attempt to download the company information
# from the URL provided in the search results.
my $MAX_BBB_REFERRAL_PAGE_RETRIES = 3;

# suck in our business URL, and append it to the BBB URL.
my $business_url = shift || die "You didn't pass a URL for checking!\n";
my $search_url   = "http://search.bbb.org/results.html?tabletouse=".
                   "url_search&url=" . $business_url;
my %company; # place we keep company info.

# look for the results until requested.
for (my $i = 1; $i <= $MAX_BBB_SEARCH_RETRIES; ++$i) {
    my $data = get($search_url); # gotcha, bugaboo!

    # did we have a problem? pause if so.
    if ($data =~ /apologize.*delay/ or !defined($data)) {
       print "Connection to BBB failed. Waiting 5 seconds to retry.\n";
       sleep(5); next; # let's try this again, shall we?
    }

    # die if there's no data to yank.
    die "There were no companies found for this URL.\n"
         if $data =~ /There are no companies/i;

    # get the company name, address, and redirect.
    if ($data =~ /<!-- n -->.*?href="(.*?)">(.*)<!--  -->.*?">(.*)<\/f/i) {
       $company{redir}   = "http://search.bbb.org/$1";
       $company{name}    = $2; $company{address} = $3;
       $company{address} =~ s/<br>/\n/g;
       print "\nCompany name and address:\n";
       print "$company{name}\n$company{address}\n\n";
    }

    # if there was no redirect, then we can't
    # move on to the local BBB site, so we die.
    unless ($company{redir}) {
      die "Unable to process the results returned. You can inspect ".
          "the results manually at the following url: $search_url\n"; }

    last if $data;
}

# now that we have the redirect for the local BBB site,
# we'll try to download its contents and parse them.
for (my $i = 1; $i <= $MAX_BBB_REFERRAL_PAGE_RETRIES; ++$i) {
    my $data = get($company{redir}); 

    # did we have a problem? pause if so.
    unless (defined $data) {
       print "Connection to BBB failed. Waiting 5 seconds to retry.\n";
       sleep(5); next; # let's try this again, shall we?
    }

    $data =~ s/\n|\f|\r//g; # grab even more information.
    $data =~ s/\n|\f|\r//g; # grab even more information.
    if ($data=~/Date:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{start}=$1;}
    if ($data=~/Entity:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{entity}=$1;}
    if ($data=~/l ?:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{principal}=$1;}
    if ($data=~/Phone.*?:<\/b>.*?<td.*?>(.*?)<\/td>/i){$company{phone}=$1;}
    if ($data=~/Fax.*?:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{fax}=$1;}
    if ($data=~/Status:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{mbr}=$1;}
    if ($data=~/BBB:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{joined}=$1;}
    if ($data=~/sification:<\/b>.*?<td.*?>(.*?)<\/td>/){$company{type}=$1;}
    last if $data;
}

# print out the extra data we've found.
print "Further information (if any):\n";
foreach (qw/start_date entity principal phone fax mbr joined type/) {
   next unless $company{$_}; # skip blanks.
   print " Start Date: " if $_ eq "start_date";
   print " Type of Entity: " if $_ eq "entity";
   print " Principal: " if $_ eq "principal";
   print " Phone Number: " if $_ eq "phone";
   print " Fax Number: " if $_ eq "fax";
   print " Membership Status: " if $_ eq "mbr";
   print " Date Joined BBB: " if $_ eq "joined";
   print " Business Classification: " if $_ eq "type";
   print "$company{$_}\n";
} print "\n";

# alright. we have all our magic data that we can get from the 
# BBB, so let's see if there's anything on PlanetFeedback.com to display.
my $planetfeedback_url = "http://www.planetfeedback.com/sharedLetters".
                         "Results/1,2933,,00.html?frmCompany=".
                         uri_escape($company{name})."&frmFeedbackType".
                         "One=0&frmIndustry=0&frmFeedbackTypeTwo=0".
                         "&frmMaxValue=20&buttonClicked=submit1".
                         "&frmEventType=0";
my $data = get($planetfeedback_url) or # go, speed
  die "Error downloading from PlanetFeedback: $!"; # racer, go!

# did we get anything worth showing?
if ($data =~ /not posted any Shared Letters/i) {
   print "No feedback found for company '$company{name}'\n";
} else { print "Feedback available at $planetfeedback_url\n"; }
% perl bbbcheck.pl http://www.oreilly.com
There were no companies found for this URL.
% perl bbbcheck.pl http://www.microsoft.com
Company name and address:
MICROSOFT CORPORATION
9255 Towne Center Dr 4th Fl
SAN DIEGO, CA

Further information (if any):
 Start Date: January 1975
 Type of Entity: Corporation

 Principal: Ms Shaina Houston FMS
 Phone Number: January 1975
 Fax Number: (858) 909-3838
 Membership Status: Yes
 Date Joined BBB: May 2003
 Business Classification: Computer Sales & Service

Feedback available at http://www.planetfeedback.com/sharedLettersResults/
1,2933,,00.html?frmCompany=MICROSOFT%20CORPORATION&frmFeedbackTypeOne=0& 
frmIndustry=0&frmFeedbackTypeTwo=0&frmMaxValue=20&buttonClicked=submit1& 
frmEventType=0
#!/usr/bin/perl -w
use strict;
use HTML::TableExtract;
use LWP::Simple;
use URI::Escape;

# get our restaurant name from the command line.
my $name = shift || die "Usage: kcrestaurants.pl <string>\n";

# and our constructed URL to the health database.
my $url = "http://www.decadeonline.com/results.phtml?agency=skc".
          "&forceresults=1&offset=0&businessname=" . uri_escape($name) .
          "&businessstreet=&city=&zip=&soundslike=&sort=FACILITY_NAME";

# download our health data.
my $data = get($url) or die $!;
die "No restaurants matched your search query.\n"
    if $data =~ /no results were found/;

# and suck in the returned matches.
my $te = HTML::TableExtract->new(keep_html => 1, count => 1);
$te->parse($data) or die $!; # yum, yum, i love second table!

# and now loop through the data.
foreach my $ts ($te->table_states) {
  foreach my $row ($ts->rows) {
     next if $row->[1] =~ /Site Address/; # skip if this is our header.
     foreach ( qw/ 0 1 / ) { # remove googly poofs.
        $row->[$_] =~ s/^\s+|\s+|\s+$/ /g; # remove whitespace.
        $row->[$_] =~ s/\n|\f|\r/ /g; # remove newlines.
     } 

     # determine name/addresses.
     my ($url, $name, $address, $mp_url); 
     if ($row->[0] =~ /href="(.*?)">.*?2">(.*?)<\/font>/) {
         ($url, $name) = ($1, $2); # almost there.
     } if ($row->[1] =~ /2">(.*?)<\/font>/) { $address = $1; }

     # and the MapQuest URL.
     if ($address =~ /(.*), ([^,]*)/) {
         my $street = $1; my $city = $2;
         $mp_url = "http://www.mapquest.com/maps/map.adp?".
                   "country=US&address=" . uri_escape($street) .
                   "&city=" . $city . "&state=WA&zipcode=";
     }

     print "Company name: $name\n";
     print "Company address: $address\n";
     print "Results of past inspections:\n ".
           "http://www.decadeonline.com/$url\n";
     print "MapQuest URL: $mp_url\n\n";
  }
}
% perl kcrestaurants.pl perlfood
No restaurants matched your search query.
% perl kcrestaurants.pl "restaurant le gourmand"
Company name: RESTAURANT LE GOURMAND
Company address: 425 NW MARKET ST , Seattle
Results of past inspections:
 http://www.decadeonline.com/fac.phtml?
   agency=skc&forceresults=1&facid=FA0003608
MapQuest URL: http://www.mapquest.com/maps/map.adp?country=US&address
   =425%20NW%20MARKET%20ST%20&city=Seattle&state=WA&zipcode=
% perl kcrestaurants.pl restaurant
Company name: RESTAURANT EL TAPATIO
Company address: 3720 FACTORIA BL , Bellevue
Results of past inspections:
 http://www.decadeonline.com/fac.phtml?
   agency=skc&forceresults=1&facid=FA0003259
MapQuest URL: http://www.mapquest.com/maps/map.adp?country=US&address
   =3720%20FACTORIA%20BL%20&city=Bellevue&state=WA&zipcode=

Company name: RESTAURANT ICHIBAN
Company address: 601 S MAIN ST , Seattle
Results of past inspections:
 http://www.decadeonline.com/fac.phtml?
   agency=skc&forceresults=1&facid=FA0001743
MapQuest URL: http://www.mapquest.com/maps/map.adp?country=US&address
   =601%20S%20MAIN%20ST%20&city=Seattle&state=WA&zipcode=

...
bphonebook:Restaurant Le Gourmand Seattle WA
http://www.google.com/search?q=bphonebook:Restaurant+Le+Gourmand+Seattle+WA
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use SOAP::Lite;

# fill in your google.com API information here.
my $google_key  = "your Google API key here";
my $google_wdsl = "GoogleSearch.wsdl";
my $gsrch       = SOAP::Lite->service("file:$google_wdsl");

# get our data from Fark's "friends".
my $fark = get("http://www.fark.com/") or die $!;
$fark =~ m!Friends:</td></tr>(.*?)<tr><td class=\"lmhead\">Fun Games:!migs; 
my $farklinks = $1; # all our relevances are in here.

# and now loop through each entry.
while ($farklinks =~ m!href="(.*?)"!gism) {
   my $farkurl = $1; next unless $farkurl;
   my @checklist; # urls to check for safety.
   print "\n\nChecking $farkurl.\n";

   # getting the full result count for this URL.
   my $count = $gsrch->doGoogleSearch($google_key, $farkurl,
                        0, 1, "false", "",  "false", "", "", "");
   my $firstresult = $count->{estimatedTotalResultsCount};
   print "$firstresult matching results were found.\n";
   if ($firstresult > 50) { $firstresult = 50; }

   # now, get a maximum of 50 results, with no safe search.
   # getting the full result count for this URL.
   my $counter = 0; while ($counter < $firstresult) {

       my $urls = $gsrch->doGoogleSearch($google_key, $farkurl,
                           $counter, 10, "false", "",  "false", "", "", "");

       foreach my $hit (@{$urls->{resultElements}}) {
           push (@checklist, $hit->{URL}); 
       } $counter = $counter +10; 
   }

   # and now check each of the matching URLs.
   my (@goodurls, @badurls); # storage.
   foreach my $urltocheck (@checklist) {
       $urltocheck =~ s/http:\/\///;

       my $firstcheck = $gsrch->doGoogleSearch($google_key, $urltocheck,
                                 0, 1, "true", "",  "true", "", "", "");

       # check our results. if no matches, it's naughty.
       my $firstnumber = $firstcheck->{estimatedTotalResultsCount} || 0;
       if ($firstnumber == 0) { push @badurls, $urltocheck; }
       else { push @goodurls, $urltocheck; }
   }

   # and spit out some results.
   my ($goodcount, $badcount) = (scalar(@goodurls), scalar(@badurls));
   print "There are $goodcount good URLs and $badcount ".
         "possibly impure URLs.\n"; # wheeEeeeEE!

   # display bad domains if there are only a few.
   unless ( $badcount >= 10 || $badcount == 0) {
       print "The bad URLs are\n";
       foreach (@badurls) {
          print " http://$_\n"; 
       }
    }

   # happy percentage display.
   my $percent = $goodcount * 2; my $total = $goodcount+$badcount;
   if ($total==50) { print "This URL is $percent% pure!"; }

}
% perl purity.pl

Checking http://www.aprilwinchell.com/.
161 matching results were found.
There are 36 good URLs and 14 possibly impure URLs.
This URL is 72% pure!

Checking http://www.badjocks.com/.
47 matching results were found.
There are 36 good URLs and 9 possibly impure URLs.
The bad URLs are
 http://www.thepunchline.com/cgi-bin/links/bad_link.cgi?ID=4052&d=1
 http://www.ilovebacon.com/020502/i.shtml
 http://www.ilovebacon.com/022803/l.shtml
...
if ($firstresult > 10) { $firstresult = 10; }
unless ( $badcount >= 50 || $badcount == 0) {

Chapter 4. Gleaning Data from Databases

Hack 48. Hacks #43-89

Hack #43. Archiving Yahoo! Groups Messages with yahoo2mbox

Running the Hack

Hacking the Hack

Hack #44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups

The Code

Running the Hack

Hacking the Hack

Hack #45. Gleaning Buzz from Yahoo!

The Code

Running the Hack

Hacking the Hack

Hack #46. Spidering the Yahoo! Catalog

The Code

Running the Hack

Hacking the Hack

See Also

Hack #47. Tracking Additions to Yahoo!

The Code

Running the Hack

Hacking the Hack

Hack #48. Scattersearch with Yahoo! and Google

The Code

Running the Hack

Hacking the Hack

Hack #49. Yahoo! Directory Mindshare in Google

The Code

Running The Hack

Hacking the Hack

Hack #50. Weblog-Free Google Results

The Code

Hacking the Hack

Hack #51. Spidering, Google, and Multiple Domains

Example: Top 20 Searching on Google

The Code

Running the Hack

Hacking the Hack

Hack #52. Scraping Amazon.com Product Reviews

The Code

Running the Hack

See Also

Hack #53. Receive an Email Alert for Newly Added Amazon.com Reviews

The Code

Running the Hack

See Also

Hack #54. Scraping Amazon.com Customer Advice

The Code

Running the Hack

See Also

Hack #55. Publishing Amazon.com Associates Statistics

The Code

Running the Hack

See Also

Hack #56. Sorting Amazon.com Recommendations by Rating

The Code

Running the Hack

See Also

Hack #57. Related Amazon.com Products with Alexa

The Code

Running the Hack

Hacking the Hack

Hack #58. Scraping Alexa’s Competitive Data with Java

The Code

Running the Hack

Hacking the Hack

Hack #59. Finding Album Information with FreeDB and Amazon.com

Getting Started

Checking Your Disc ID

Digging Up the FreeDB Details

Rocking with Amazon.com

Presenting the Results

Hacking the Hack

Hack #60. Expanding Your Musical Tastes

The Code

Running the Hack

Hacking the Hack

Changing the number of results returned

Looking up artists

See Also