Perl soared to popularity as a language for creating and managing web content. Perl is equally adept at consuming information on the Web. Most web sites are created for people, but quite often you want to automate tasks that involve accessing a web site in a repetitive way. Such tasks could be as simple as saying “here’s a list of URLs; I want to be emailed if any of them stop working,” or they could involve more complex processing of any number of pages. This book is about using LWP (the Library for World Wide Web in Perl) and Perl to fetch and process web pages.
For example, if you want to compare the prices of all O’Reilly books on Amazon.com and bn.com, you could look at each page yourself and keep track of the prices. Or you could write an LWP program to fetch the product pages, extract the prices, and generate a report. O’Reilly has a lot of books in print, and after reading this one, you’ll be able to write and run the program much more quickly than you could visit every catalog page.
Consider also a situation in which a particular page has links to several dozen files (images, music, and so on) that you want to download. You could download each individually, by monotonously selecting each link in your browser and choosing Save as..., or you could dash off a short LWP program that scans for URLs in that page and downloads each, unattended.
Besides extracting data from web pages, you can also automate submitting data through web forms. Whether this is a matter of uploading 50 image files through your company’s intranet interface, or searching the local library’s online card catalog every week for any new books with “Navajo” in the title, it’s worth the time and piece of mind to automate repetitive processes by writing LWP programs to submit data into forms and scan the resulting data.
Audience for This Book
This book is aimed at someone who already knows Perl and HTML, but I don’t assume you’re an expert at either. I give quick refreshers on some of the quirkier aspects of HTML (e.g., forms), but in general, I assume you know what each of the HTML tags means. If you know basic regular expressions and are familiar with references and maybe even objects, you have all the Perl skills you need to use this book.
If you’re new to Perl, consider reading Learning Perl (O’Reilly) and maybe also The Perl Cookbook (O’Reilly). If your HTML is shaky, try the HTML Pocket Reference or HTML: The Definitive Guide (O’Reilly). If you don’t feel comfortable using objects in Perl, reading Appendix G in this book should be enough to bring you up to speed.
Structure of This Book
The book is divided into 12 chapters and 7 appendixes, as follows:
Chapter 1 covers in general terms what LWP does, the alternatives to using LWP, and when you shouldn’t use LWP.
Chapter 2 explains how the Web works and some easy-to-use yet limited functions for accessing it.
Chapter 3 covers the more powerful interface to the Web.
Chapter 4 shows how to parse URLs with the URI class, and how to convert between relative and absolute URLs.
Chapter 5 describes how to submit GET and POST forms.
Chapter 6 shows how to extract information from HTML using regular expressions.
Chapter 7 provides an alternative approach to extracting data from HTML using the HTML::TokeParser module.
Chapter 8 is a case study of data extraction using tokens.
Chapter 9 shows how to extract data from HTML using the HTML::TreeBuilder module.
Chapter 10 covers the use of HTML::TreeBuilder to modify HTML files.
Chapter 11 deals with the tougher parts of requests.
Chapter 12 explores the technological issues involved in automating the download of more than one page from a site.
Appendix A is a complete list of the LWP modules.
Appendix B is a list of HTTP codes, what they mean, and whether LWP considers them error or success.
Appendix C contains the most common MIME types and what they mean.
Appendix D lists the most common language tags and their meanings (e.g., “zh-cn” means Mainland Chinese, while “sv” is Swedish).
Appendix E is a list of the most common character encodings (character sets) and the tags that identify them.
Appendix F is a table to help you make sense of the most common Unicode characters. It shows each character, its numeric code (in decimal, octal, and hex), and any HTML escapes there may be for it.
Appendix G is an introduction to the use of Perl’s object-oriented programming features.
Order of Chapters
The chapters in this book are arranged so that if you read them in
order, you will face a minimum of cases where I have to say “you won’t
understand this part of the code, because we won’t cover that topic
until two chapters later.” However, only some of what each chapter
introduces is used in later chapters. For example, Chapter 3 lists all sorts of LWP methods
that you are likely to use eventually, but the typical task will use
only a few of those, and only a few will show up in later chapters. In
cases where you can’t infer the meaning of a method from its name, you
can always refer back to the earlier chapters or use
perldoc to see the applicable module’s online
Important Standards Documents
The basic protocols and data formats of the Web are specified in a number of Internet RFCs. The most important are:
- RFC 2616: HTTP 1.1
- RFC 2965: HTTP Cookies Specification
- RFC 2617: HTTP Authentication: Basic and Digest Access Authentication
- RFC 2396: Uniform Resource Identifiers: Generic Syntax
- HTML 4.01 specification
- HTML 4.01 Forms specification
- Character sets
- Country codes
- Unicode specifications
- RFC 2279: Encoding Unicode as UTF-8
- Request For Comments documents
- IANA protocol assignments
Conventions Used in This Book
The following typographic conventions are used in this book:
Used for file and directory names, email addresses, and URLs, as well as for new terms where they are defined.
Used for code listings and for keywords, variables, function names, command options, parameters, and bits of HTML source where they appear in the text.
Constant width bold
Used to highlight key fragments of larger code examples, or to show the output of a piece of code.
Constant width italic
Used as a general placeholder to indicate terms that should be replaced by actual values in your own programs.
Comments & Questions
Please address comments and questions concerning this book to the publisher:
|O’Reilly & Associates, Inc.|
|1005 Gravenstein Highway North|
|Sebastopol, CA 95472|
|(800) 998-9938 (in the United States or Canada)|
|(707) 829-0515 (international/local)|
|(707) 829-0104 (fax)|
There is a web page for this book, which lists errata, examples, or any additional information. You can access this page at:
To comment or ask technical questions about this book, send email to:
For more information about books, conferences, Resource Centers, and the O’Reilly Network, see the O’Reilly web site at:
It takes a mere village to raise a puny human child, but it took a whole globe-girdling Perl cabal to get this book done! These are the readers who, as a personal favor to me, took the time to read and greatly improve my first sketchy manuscript, each in their own particular, helpful, and careful ways: Gisle Aas, David H. Adler, Tim Allwine, Elaine Ashton, Gene Boggs, Gavin Estey, Scott Francis, Joe Johnston, Kevin Healy, Conrad Heiney, David Huggins-Daines, Samy Kamkar, Joe Kline, Yossef Mendelssohn, Abhijit Menon-Sen, Brad Murray, David Ondrik, Clinton Pierce, Robert Spier, Andrew Stanley, Dennis Taylor, Martin Thurn, and Glenn Wood.
I’m also especially thankful to Elaine Ashton for doing a last-minute review not just of this manuscript’s prose, but of all the code blocks. If not for her eagle eye, you’d be scratching your head over variables and subroutines magically renaming themselves all over the place!
I am grateful to Conrad Heiney for suggesting the California Department of Motor Vehicles as an example for Chapter 5. Thanks also to Mark-Jason Dominus for suggesting the ABEBooks web site as an example in that same chapter. Many thanks to Gisle Aas, Michael A. Chase, and Martijn Koster for making LWP such a reliable and indispensable addition to every programmer’s toolkit.
And last but not least, thanks to the people at O’Reilly who
intrepidly pushed for this book to get done when I really just wanted to
stay in bed and play Tetris. The chief author-wrangler is my editor, Nat
Torkington, but I’m much obliged also to the many other
under-appreciated O’Reilly people who conspired to get this book from my
hands to yours: Jon Orwant (of Perl Journal fame
even before he got to O’Reilly), Neil Walls (who slaved over Appendix F so you can see what a
⊥ looks like!), sage editor Linda
Mui, Betsy Waliszewski in marketing, and in the production department,
Linley Dolby, the book’s production editor and copyeditor and Rob
Romano, the book’s illustrator.