Press Release: July 29, 2002
Fetching Web Pages, Parsing HTML, Writing Spiders, and More: O'Reilly Releases "Perl & LWP"
Sebastopol, CA--The Swiss Army Knife of programming languages, Perl turns up in diverse and sundry applications. Its flexibility makes it a favorite of coders, and with its multi-purpose modules--like the tools and gadgets on a pocketknife--there are very few tasks that Perl is not applied to. One of Perl's handiest and most practical tools is LWP (Library for WWW in Perl), the suite of modules for fetching and processing web pages. There is a wealth of information on the Web: news, weather, government info, shopping, discussion groups, product info, reviews, games, and other entertainment, and LWP can help automate all of it. In his book, Perl & LWP (O'Reilly, US $34.95), author Sean Burke shows how to use the powerful LWP library and its related HTML tools to build useful web client applications to automate various tasks on the Web.
LWP is the most frequently downloaded Perl distribution in all of CPAN (Comprehensive Perl Archive Network). It enables programmers to write "spiders" to automatically fetch web pages, extract information from HTML pages, submit forms, and write homegrown servers. With LWP, programmers can dispense with graphical web browsers such as Netscape Navigator and interact with web servers directly, making it ideal for repetitive tasks that would be cumbersome to perform with a browser.
"As people deal more and more with the Web, there are more tasks that we routinely carry out over the Web that could be automated using LWP or the HTML-parsing modules," says Burke. "For example, I'm a fan of CSPAN2's weekend programming, Book TV, but sometimes they'll have an interesting author on at 5 a.m. on Saturday morning, when I definitely would not be awake and flipping channels. If I want to catch these things, I have to program my VCR in advance. However, that means I have to remember to look at Book TV's web site on Friday night, and remembering is not one of my strong points. So, I wrote a simple LWP program that emails me the web page from the Book TV web site, and then I scheduled crontab to run that program every Friday afternoon. So, what used to be a matter of often missing really good programs is now convenient: I get an email message every Friday night, skim it for interesting authors or subjects, and program the VCR accordingly."
"Perl and LWP" includes many step-by-step examples that show readers how to apply the various techniques for their own needs. Programs to extract information from the web sites of BBC News, AltaVista, ABEBooks.com, and Weather Underground, as well as others, are explained in detail. The book also covers:
- Understanding LWP and its design
- Fetching and analyzing web pages
- Extracting information from HTML using regular expressions, tokens, and trees
- Setting and inspecting HTTP headers and response codes
- Accessing information that requires authentication or cookies
- Extracting links
- Cooperating with proxy caches
- Writing web spiders (a.k.a. robots) in a safe fashion
Says Burke, "Readers will realize that they can make their life simpler by using what they've learned in this book to write a few little LWP programs to automate two or three of their most common tasks that involve the Web. That needn't be something like getting TV listings off the Web; it could be a program that checks the server status page on a dozen different servers and shows them all on a single page, for the convenience of the server administrator."
Perl programmers who want to automate and mine the Web can pick up this book and be immediately productive. Written by a contributor to LWP, with a foreword by one of LWP's creators, "Perl & LWP" is the authoritative guide to this powerful and popular toolkit.
"Perl & LWP" is also available on Safari Books Online
Visit O'Reilly's perl.com for news, articles and other resources
Chapter 7, HTML Processing with Tokens is available free online
O’Reilly Media spreads the knowledge of innovators through its books, online services, magazines, and conferences. Since 1978, O’Reilly Media has been a chronicler and catalyst of cutting-edge development, homing in on the technology trends that really matter and spurring their adoption by amplifying “faint signals” from the alpha geeks who are creating the future. An active participant in the technology community, the company has a long history of advocacy, meme-making, and evangelism.
Return to: O’Reilly Press Room