Build Your Own Podcatcher

Using Perl, you can quickly build a command-line podcatcher for yourself.

Rolling your own command-line podcatcher, like the one shown here, gives you ultimate flexibility in what podcasts you download and when you fetch them. You can also hook up this script to a cron job or to a Windows batch file and download new podcasts overnight.

The Code

Save this code as spc.pl:

   #!/usr/bin/perl -w
   use Storable qw ( store retrieve );
   use FileHandle;
   use LWP::Simple qw( get );
   use strict;

   # The path to the history file that remembers
   # what we have downloaded.

   use constant HISTORY_FILE => ".history";

   # The file that includes the URLs of all of the feeds

   use constant FEEDS_FILE => "feeds.txt";
   #The directory to use for output of the enclosure files

   use constant OUTPUT_DIR => "enclosures"; 

   # Loads all of the feeds from the feeds file and returns 
   # an array.
   
   sub feeds_load()
   { 
     my $feeds = [];
     my $fh = new FileHandle( FEEDS_FILE );
     while( <$fh> ) { chomp; push @$feeds, $_; }
     $fh->close();
     return $feeds;
  }
   # Returns the filename from a URL
   
   sub parse_filename($)
   {
     my ( $fname ) = @_;
   
   # Remove the arguments portion of the URL
   $fname =~ s/\?.*$//;
   # Trim anything up to the final slash
   $fname =~ s/.*\///;

   return $fname;
 }
   
   # Parses a feed and finds the title of the feed and the
   # URLs for all of the enclosures
   
   sub parse_feed($)
   { 
	my ( $rss ) = @_;
	
	my $info = {};
	my $urls = [];

	while( $rss =~ /(\<item\>.*?\<\/item\>)/sg )
	{
     my $item = $1;
     if ( $item =~ /(\<enclosure.*?\>)/ )
     {
       my $enc = $1;
       if ( $enc =~ /url=[\"|\'](.*?)[\"|\']/i )
       {
         push @$urls, {
         url => $1,
         filename => parse_filename( $1 )
         };
       }
     }
   }
   $info->{enclosures} = $urls;

   $rss =~ s/\<item\>.*?\<\/item\>//sg;
   my $title = "";
   if ( $rss =~ /\<title\>(.*?)\<\/title\>/sg )
   {
     $title = $1;
     # Strip leading and trailing whitespace
     $title =~ s/^\s+//g;
     $title =~ s/\s+$//g;
     # Strip out the returns and line feeds
     $title =~ s/\n|\r//g;
     # Strip out any HTML entities
     $title =~ s/\&.*?;//g;
     # Strip out any slashes
     $title =~ s/\///g;
   }
   $info->{title} = $title;

   return $info;
  }

  # Grabs and parses a feed. Then adds the enclosures
  # referenced in the feed to the queue.
  
  sub feed_read($$)
  {
    my ( $queue, $rss_url ) = @_;
  
	print "Reading feed $rss_url\n";
    
    my $rss = get $rss_url;
    my $info = parse_feed( $rss );

    foreach my $item ( @{$info->{enclosures}} )
    {
      push @$queue, {
        url => $item->{url},
        filename => $item->{filename},
        feed => $info->{title}
      };
   }

   print "\tFound ".scalar(@{$info->{enclosures}})." enclosures\n";
 }

 # Reads all of the feeds in the feed lists and creates
 # a queue of enclosures to retrieve.

 sub feeds_read($)
   {
   my ( $feeds ) = @_;
   my $queue = [];
   foreach my $feed ( @$feeds ) 
   {
     feed_read( $queue, $feed );
   }
   return $queue;
}

 # Loads the history file and returns its contents as a hash 
 # table. If a URL is in the hash table then we have already
 # downloaded it.

 sub history_load()
 {
   my $history = {};
   $history = retrieve( HISTORY_FILE ) if ( -e HISTORY_FILE );
   $history = {} unless ( $history );
   return $history;
 }

 # Saves the history out to the history file.

 sub history_save($)
 {
   my ( $history ) = @_;
   store( $history, HISTORY_FILE );
 }

 # Checks if a URL is in the history, and thus has already been 
 # downloaded.
 
 sub history_check($$)
 { 
	my ( $history, $url ) = @_;
	return ( exists $history->{ $url } );
 }
 
 # Adds a URL to the history
 
 sub history_add($$)
 { 
	my ( $history, $url ) = @_;
	$history->{$url} = 1;
 }

 # Downloads an enclosure and saves it out to a file in a subdirectory
 # of the output directory. There is one subdirectory for each
 # feed.

 sub download_enclosure($$$)
 
{
	my ( $url, $filename, $feed ) = @_;

	my $dirname = OUTPUT_DIR."/".$feed;
	my $fullpath = $dirname."/".$filename;

	mkdir( $dirname );

	print "Getting $url…\n";

	my $data = get $url;
	
	my $fh = new FileHandle( $fullpath, "w" );
	binmode($fh);
	print $fh $data;
	$fh->close();

	print "\tdone\n";

	return 1; 
 }

 # Downloads all of the items in the queue that are not in the history 
 # already.

 sub download_queue($$)
 {
   my ( $history, $queue ) = @_;
   foreach my $item ( @$queue ) 
   {
     next if ( history_check( $history, $item->{url} ) );
     if ( download_enclosure( $item->{url},
         $item->{filename},
         $item->{feed} ) )
    {
      history_add($history,$item->{url});
      history_save( $history );
   } 
  }
 }

 # Create the output directory

 mkdir( OUTPUT_DIR );

 # Read the feeds, build the queue, get the history and download 
 # enclosures that haven't already been downloaded.

 my $feeds = feeds_load();
 my $queue = feeds_read( $feeds );
 my $history = history_load();
 download_queue( $history, $queue );

The script starts at the bottom (after all the subroutines have been set up). The first thing it does is load in the feed list from thefeeds.txtfile. Here is an example feeds.txtfile:

http://www.curry.com/xml/rss.xml

http://www.boundcast.com/index.xml

After that, the feeds_read subroutine downloads the RSS and parses it up, looking for enclosures.

The fun part of feed_read comes with the parse_feed subroutine that looks for the <item> and <enclosure> tags and then builds an array of hash tables to store what it found. All the enclosures from all the feeds are put into one big array called the queue. The queue is a to-do list of podcasts the script plans to download.

The script keeps a history of the files it has downloaded in a .history file that is loaded into a hash table. Both the queue and the history are passed to the download_queue function. This function will only download items that are not in the history. I could use the existence of the file on disk to decide whether to download the file. However, that means that I will have to keep the enclosure files in the output directory to make sure I do not download them twice. That limits my options in terms of what to do with the files after I download them. Therefore, I keep a separate history.

Running the Hack

On Windows machines, use ActivePerl (http://activestate.com/) or Cygwin’s Perl (http://www.cygwin.com/) to run this Perl script. Perl is included on Macintoshes, though you might need to download the LWP::Simple module from CPAN [Hack #7] if you haven’t installed it already.

I recommend using this script in conjunction with [Hack #4] to take the downloaded files and import them into iTunes. That script can also update your iPod automatically after all the files have been imported.

Podcasting Hacks by Jack D. Herrington

Build Your Own Podcatcher

The Code

Running the Hack

See Also

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly