book

Perl & LWP

Name: Perl & LWP
Author: Sean M. Burke
ISBN: 9780596001780

by Sean M. Burke

June 2002

Intermediate to advanced

260 pages

6h 51m

English

O'Reilly Media, Inc.

Read now

Unlock full access

A Note Regarding Supplemental Files
Foreword
Preface
Audience for This BookStructure of This BookOrder of ChaptersImportant Standards DocumentsConventions Used in This BookComments & QuestionsAcknowledgments
1. Introduction to Web Automation
1.1. The Web as Data Source1.1.1. Screen Scraping1.1.2. Brittleness1.1.3. Web Services1.2. History of LWP1.3. Installing LWP1.3.1. Installing LWP from the CPAN Shell1.3.1.1. Configuring1.3.1.2. Obtaining help1.3.1.3. Installing LWP1.3.2. Installing LWP Manually1.3.2.1. Download distributions1.3.2.2. Unpack and configure1.3.2.3. Make, test, and install1.4. Words of Caution1.4.1. Network and Server Load1.4.2. Copyright1.4.3. Acceptable Use1.5. LWP in Action1.5.1. The Object-Oriented Interface1.5.2. Forms1.5.3. Parsing HTML1.5.4. Authentication
2. Web Basics
2.1. URLs2.2. An HTTP Transaction2.2.1. Request2.2.2. Response2.3. LWP::Simple2.3.1. Basic Document Fetch2.3.2. Fetch and Store2.3.3. Fetch and Print2.3.4. Previewing with HEAD2.4. Fetching Documents Without LWP::Simple2.5. Example: AltaVista2.6. HTTP POST2.7. Example: Babelfish
3. The LWP Class Model
3.1. The Basic Classes3.2. Programming with LWP Classes3.3. Inside the do_GET and do_POST Functions3.4. User Agents3.4.1. Connection Parameters3.4.2. Request Parameters3.4.3. Protocols3.4.4. Redirection3.4.5. Authentication3.4.6. Proxies3.4.7. Request Methods3.4.7.1. Saving response content to a file3.4.7.2. Sending response content to a callback3.4.7.3. Mirroring a URL to a file3.4.8. Advanced Methods3.5. HTTP::Response Objects3.5.1. Status Line3.5.2. Content3.5.3. Headers3.5.4. Expiration Times3.5.5. Base for Relative URLs3.5.6. Debugging3.6. LWP Classes: Behind the Scenes
4. URLs
4.1. Parsing URLs4.1.1. Constructors4.1.2. Output4.1.3. Comparison4.1.4. Components of a URL4.1.5. Queries4.2. Relative URLs4.3. Converting Absolute URLs to Relative4.4. Converting Relative URLs to Absolute
5. Forms
5.1. Elements of an HTML Form5.2. LWP and GET Requests5.2.1. GETting Fixed URLs5.2.2. GETting a query_form( ) URL5.3. Automating Form Analysis5.4. Idiosyncrasies of HTML Forms5.4.1. Hidden Elements5.4.2. Text Elements5.4.3. Password Elements5.4.4. Checkboxes5.4.5. Radio Buttons5.4.6. Submit Buttons5.4.7. Image Buttons5.4.8. Reset Buttons5.4.9. File Selection Elements5.4.10. Textarea Elements5.4.11. Select Elements and Option Elements5.5. POST Example: License Plates5.5.1. The Form5.5.2. Use formpairs.pl5.5.3. Translating This into LWP5.6. POST Example: ABEBooks.com5.6.1. The Form5.6.2. Translating This into LWP5.6.3. Adding Features5.6.4. Generalizing the Program5.7. File Uploads5.8. Limits on Forms
6. Simple HTML Processing with Regular Expressions
6.1. Automating Data Extraction6.2. Regular Expression Techniques6.2.1. Anchor Your Match6.2.2. Whitespace6.2.3. Embedded Newlines6.2.4. Minimal and Greedy Matches6.2.5. Capture6.2.6. Repeated Matches6.2.7. Develop from Components6.2.8. Use Multiple Steps6.3. Troubleshooting6.4. When Regular Expressions Aren’t Enough6.5. Example: Extracting Linksfrom a Bookmark File6.6. Example: Extracting Linksfrom Arbitrary HTML6.7. Example: Extracting Temperatures from Weather Underground
7. HTML Processing with Tokens
7.1. HTML as Tokens7.2. Basic HTML::TokeParser Use7.2.1. Start-Tag Tokens7.2.2. End-Tag Tokens7.2.3. Text Tokens7.2.4. Comment Tokens7.2.5. Markup Declaration Tokens7.2.6. Processing Instruction Tokens7.3. Individual Tokens7.3.1. Checking Image Tags7.3.2. HTML Filters7.4. Token Sequences7.4.1. Example: BBC Headlines7.4.2. Translating the Problem into Code7.4.3. Bundling into a Program7.5. More HTML::TokeParser Methods7.5.1. The get_text( ) Method7.5.2. The get_text( ) Method with Parameters7.5.3. The get_trimmed_text( ) Method7.5.4. The get_tag( ) Method7.5.4.1. Start-tags7.5.4.2. End-tags7.5.5. The get_tag( ) Method with Parameters7.6. Using Extracted Text

8. Tokenizing Walkthrough
8.1. The Problem8.2. Getting the Data8.3. Inspecting the HTML8.4. First Code8.5. Narrowing In8.6. Rewrite for Features8.6.1. Debuggability8.6.2. Images and Applets8.6.3. Link Text8.6.4. Live Data8.7. Alternatives
9. HTML Processing with Trees
9.1. Introduction to Trees9.2. HTML::TreeBuilder9.2.1. Constructors9.2.2. Parse Options9.2.3. Parsing9.2.4. Cleanup9.3. Processing9.3.1. Methods for Searching the Tree9.3.2. Attributes of a Node9.3.3. Traversing9.4. Example: BBC News9.5. Example: Fresh Air
10. Modifying HTML with Trees
10.1. Changing Attributes10.1.1. Whitespace10.1.2. Other HTML Options10.2. Deleting Images10.3. Detaching and Reattaching10.3.1. The detach_content( ) Method10.3.2. Constraints10.4. Attaching in Another Tree10.4.1. Retaining Comments10.4.2. Accessing Comments10.4.3. Attaching Content10.5. Creating New Elements10.5.1. Literals10.5.2. New Nodes from Lists
11. Cookies, Authentication,and Advanced Requests
11.1. Cookies11.1.1. Enabling Cookies11.1.2. Loading Cookies from a File11.1.3. Saving Cookies to a File11.1.4. Cookies and the New York Times Site11.2. Adding Extra Request Header Lines11.2.1. Pretending to Be Netscape11.2.2. Referer11.3. Authentication11.3.1. Comparing Cookies with Basic Authentication11.3.2. Authenticating via LWP11.3.3. Security11.4. An HTTP Authentication Example:The Unicode Mailing Archive
12. Spiders
12.1. Types of Web-Querying Programs12.2. A User Agent for Robots12.3. Example: A Link-Checking Spider12.3.1. The Basic Spider Logic12.3.2. Overall Design in the Spider12.3.3. HEAD Response Processing12.3.4. Redirects12.3.5. Link Extraction12.3.6. Fleshing Out the URL Scheduling12.3.7. The Rest of the Code12.4. Ideas for Further Expansion
A. LWP Modules
B. HTTP Status Codes
B.1. 100s: InformationalB.2. 200s: SuccessfulB.3. 300s: RedirectionB.4. 400s: Client ErrorsB.5. 500s: Server Errors
C. Common MIME Types
D. Language Tags
E. Common Content Encodings
F. ASCII Table
G. User’s View of Object-Oriented Modules
G.1. A User’s View of Object-Oriented ModulesG.2. Modules and Their Functional InterfacesG.3. Modules with Object-Oriented InterfacesG.4. What Can You Do with Objects?G.5. What’s in an Object?G.6. What Is an Object Value?G.7. So Why Do Some Modules Use Objects?G.8. The Gory Details
Index
Colophon
Copyright

Content preview from Perl & LWP

Preface

Perl soared to popularity as a language for creating and managing web content. Perl is equally adept at consuming information on the Web. Most web sites are created for people, but quite often you want to automate tasks that involve accessing a web site in a repetitive way. Such tasks could be as simple as saying “here’s a list of URLs; I want to be emailed if any of them stop working,” or they could involve more complex processing of any number of pages. This book is about using LWP (the Library for World Wide Web in Perl) and Perl to fetch and process web pages.

For example, if you want to compare the prices of all O’Reilly books on Amazon.com and bn.com, you could look at each page yourself and keep track of the prices. Or you could write an LWP program to fetch the product pages, extract the prices, and generate a report. O’Reilly has a lot of books in print, and after reading this one, you’ll be able to write and run the program much more quickly than you could visit every catalog page.

Consider also a situation in which a particular page has links to several dozen files (images, music, and so on) that you want to download. You could download each individually, by monotonously selecting each link in your browser and choosing Save as..., or you could dash off a short LWP program that scans for URLs in that page and downloads each, unattended.

Besides extracting data from web pages, you can also automate submitting data through web forms. Whether this is a matter of uploading ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0596001789Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Perl & LWP

by Sean M. Burke

Preface

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.