Chapter 15. A Web Spider in One Line
One day, someone on the IRC #perl channel was asking some confused questions. We finally managed to figure out that he was trying to write a web robot, or “spider,” in Perl. Which is a grand idea, except that:
Perfectly good spiders have already been written and are freely available at http://info.webcrawler.com/mak/projects/robots/robots.html.
A Perl-based web spider is probably not an ideal project for novice Perl programmers. They should work their way up to it.
Having said that, I immediately pictured a one-line Perl robot. It wouldn’t do much, but it would be amusing. After a few abortive attempts, I ended up with this monster, which requires Perl 5.005. I’ve split it onto separate lines for easier reading.
perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe ' $ua = LWP::UserAgent->new; while (my $link = shift @ARGV) { print STDERR "working on $link"; HTML::LinkExtor->new( sub { my ($t, %a) = @_; my @links = map { url($_, $link)->abs( ) } grep { defined } @a{qw/href img/}; print STDERR "+ $_" foreach @links; push @ARGV, @links; } ) -> parse( do { my $r = $ua->simple_request (HTTP::Request->new("GET", $link)); $r->content_type eq "text/html" ? $r->content : ""; } ) }'http://slinky.scrye.com/~tkil/
I actually edited this on a single line; I use shell-mode inside of Emacs, so it wasn’t that much of a terror. Here’s the one-line version.
perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '$ua = LWP::UserAgent->new; while (my $link = shift ...
Get Web, Graphics & Perl/Tk Programming now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.