book

Perl & LWP

Name: Perl & LWP
Author: Sean M. Burke
ISBN: 9780596001780

by Sean M. Burke

June 2002

Intermediate to advanced

260 pages

6h 51m

English

O'Reilly Media, Inc.

Read now

Unlock full access

A Note Regarding Supplemental Files
Foreword
Preface
Audience for This BookStructure of This BookOrder of ChaptersImportant Standards DocumentsConventions Used in This BookComments & QuestionsAcknowledgments
1. Introduction to Web Automation
1.1. The Web as Data Source1.1.1. Screen Scraping1.1.2. Brittleness1.1.3. Web Services1.2. History of LWP1.3. Installing LWP1.3.1. Installing LWP from the CPAN Shell1.3.1.1. Configuring1.3.1.2. Obtaining help1.3.1.3. Installing LWP1.3.2. Installing LWP Manually1.3.2.1. Download distributions1.3.2.2. Unpack and configure1.3.2.3. Make, test, and install1.4. Words of Caution1.4.1. Network and Server Load1.4.2. Copyright1.4.3. Acceptable Use1.5. LWP in Action1.5.1. The Object-Oriented Interface1.5.2. Forms1.5.3. Parsing HTML1.5.4. Authentication
2. Web Basics
2.1. URLs2.2. An HTTP Transaction2.2.1. Request2.2.2. Response2.3. LWP::Simple2.3.1. Basic Document Fetch2.3.2. Fetch and Store2.3.3. Fetch and Print2.3.4. Previewing with HEAD2.4. Fetching Documents Without LWP::Simple2.5. Example: AltaVista2.6. HTTP POST2.7. Example: Babelfish
3. The LWP Class Model
3.1. The Basic Classes3.2. Programming with LWP Classes3.3. Inside the do_GET and do_POST Functions3.4. User Agents3.4.1. Connection Parameters3.4.2. Request Parameters3.4.3. Protocols3.4.4. Redirection3.4.5. Authentication3.4.6. Proxies3.4.7. Request Methods3.4.7.1. Saving response content to a file3.4.7.2. Sending response content to a callback3.4.7.3. Mirroring a URL to a file3.4.8. Advanced Methods3.5. HTTP::Response Objects3.5.1. Status Line3.5.2. Content3.5.3. Headers3.5.4. Expiration Times3.5.5. Base for Relative URLs3.5.6. Debugging3.6. LWP Classes: Behind the Scenes
4. URLs
4.1. Parsing URLs4.1.1. Constructors4.1.2. Output4.1.3. Comparison4.1.4. Components of a URL4.1.5. Queries4.2. Relative URLs4.3. Converting Absolute URLs to Relative4.4. Converting Relative URLs to Absolute
5. Forms
5.1. Elements of an HTML Form5.2. LWP and GET Requests5.2.1. GETting Fixed URLs5.2.2. GETting a query_form( ) URL5.3. Automating Form Analysis5.4. Idiosyncrasies of HTML Forms5.4.1. Hidden Elements5.4.2. Text Elements5.4.3. Password Elements5.4.4. Checkboxes5.4.5. Radio Buttons5.4.6. Submit Buttons5.4.7. Image Buttons5.4.8. Reset Buttons5.4.9. File Selection Elements5.4.10. Textarea Elements5.4.11. Select Elements and Option Elements5.5. POST Example: License Plates5.5.1. The Form5.5.2. Use formpairs.pl5.5.3. Translating This into LWP5.6. POST Example: ABEBooks.com5.6.1. The Form5.6.2. Translating This into LWP5.6.3. Adding Features5.6.4. Generalizing the Program5.7. File Uploads5.8. Limits on Forms
6. Simple HTML Processing with Regular Expressions
6.1. Automating Data Extraction6.2. Regular Expression Techniques6.2.1. Anchor Your Match6.2.2. Whitespace6.2.3. Embedded Newlines6.2.4. Minimal and Greedy Matches6.2.5. Capture6.2.6. Repeated Matches6.2.7. Develop from Components6.2.8. Use Multiple Steps6.3. Troubleshooting6.4. When Regular Expressions Aren’t Enough6.5. Example: Extracting Linksfrom a Bookmark File6.6. Example: Extracting Linksfrom Arbitrary HTML6.7. Example: Extracting Temperatures from Weather Underground
7. HTML Processing with Tokens
7.1. HTML as Tokens7.2. Basic HTML::TokeParser Use7.2.1. Start-Tag Tokens7.2.2. End-Tag Tokens7.2.3. Text Tokens7.2.4. Comment Tokens7.2.5. Markup Declaration Tokens7.2.6. Processing Instruction Tokens7.3. Individual Tokens7.3.1. Checking Image Tags7.3.2. HTML Filters7.4. Token Sequences7.4.1. Example: BBC Headlines7.4.2. Translating the Problem into Code7.4.3. Bundling into a Program7.5. More HTML::TokeParser Methods7.5.1. The get_text( ) Method7.5.2. The get_text( ) Method with Parameters7.5.3. The get_trimmed_text( ) Method7.5.4. The get_tag( ) Method7.5.4.1. Start-tags7.5.4.2. End-tags7.5.5. The get_tag( ) Method with Parameters7.6. Using Extracted Text

8. Tokenizing Walkthrough
8.1. The Problem8.2. Getting the Data8.3. Inspecting the HTML8.4. First Code8.5. Narrowing In8.6. Rewrite for Features8.6.1. Debuggability8.6.2. Images and Applets8.6.3. Link Text8.6.4. Live Data8.7. Alternatives
9. HTML Processing with Trees
9.1. Introduction to Trees9.2. HTML::TreeBuilder9.2.1. Constructors9.2.2. Parse Options9.2.3. Parsing9.2.4. Cleanup9.3. Processing9.3.1. Methods for Searching the Tree9.3.2. Attributes of a Node9.3.3. Traversing9.4. Example: BBC News9.5. Example: Fresh Air
10. Modifying HTML with Trees
10.1. Changing Attributes10.1.1. Whitespace10.1.2. Other HTML Options10.2. Deleting Images10.3. Detaching and Reattaching10.3.1. The detach_content( ) Method10.3.2. Constraints10.4. Attaching in Another Tree10.4.1. Retaining Comments10.4.2. Accessing Comments10.4.3. Attaching Content10.5. Creating New Elements10.5.1. Literals10.5.2. New Nodes from Lists
11. Cookies, Authentication,and Advanced Requests
11.1. Cookies11.1.1. Enabling Cookies11.1.2. Loading Cookies from a File11.1.3. Saving Cookies to a File11.1.4. Cookies and the New York Times Site11.2. Adding Extra Request Header Lines11.2.1. Pretending to Be Netscape11.2.2. Referer11.3. Authentication11.3.1. Comparing Cookies with Basic Authentication11.3.2. Authenticating via LWP11.3.3. Security11.4. An HTTP Authentication Example:The Unicode Mailing Archive
12. Spiders
12.1. Types of Web-Querying Programs12.2. A User Agent for Robots12.3. Example: A Link-Checking Spider12.3.1. The Basic Spider Logic12.3.2. Overall Design in the Spider12.3.3. HEAD Response Processing12.3.4. Redirects12.3.5. Link Extraction12.3.6. Fleshing Out the URL Scheduling12.3.7. The Rest of the Code12.4. Ideas for Further Expansion
A. LWP Modules
B. HTTP Status Codes
B.1. 100s: InformationalB.2. 200s: SuccessfulB.3. 300s: RedirectionB.4. 400s: Client ErrorsB.5. 500s: Server Errors
C. Common MIME Types
D. Language Tags
E. Common Content Encodings
F. ASCII Table
G. User’s View of Object-Oriented Modules
G.1. A User’s View of Object-Oriented ModulesG.2. Modules and Their Functional InterfacesG.3. Modules with Object-Oriented InterfacesG.4. What Can You Do with Objects?G.5. What’s in an Object?G.6. What Is an Object Value?G.7. So Why Do Some Modules Use Objects?G.8. The Gory Details
Index
Colophon
Copyright

Content preview from Perl & LWP

Chapter 8. Tokenizing Walkthrough

So far, I’ve been showing examples of data in a particular format, then presenting code for extracting the data out of that format, as an illustration of newly introduced HTML::TokeParser methods. But in real life, you do not proceed tidily from the problem to an immediate and fully formed solution. And ideally, the task of data extraction is simple: identify patterns surrounding the data you’re after and write a program that matches those patterns and extracts the embedded data.

In practice, however, you write programs bit by bit and in fits and starts, and with data extraction specifically; this involves a good amount of trying one pattern, finding that its matching is too narrow or too broad, trying to amend it, possibly having to backtrack and try another pattern, and so on. Moreover, even equally effective patterns are not equal; some patterns are easier to capture in code than others, and some patterns are more temporary than others.

In this section, I’ll try to make these points by walking though the implementation of a data extraction task, with all alternatives considered, and even a misstep or two.

The Problem

As a starting point, consider the task of harvesting a month’s worth of listings and corresponding RealAudio URLs from the web site of the National Public Radio program Fresh Air, at http://freshair.npr.org. Fresh Air is on NPR stations each weekday, and on every show, different guests are interviewed. The show’s web site lists which ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0596001789Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Perl & LWP

by Sean M. Burke

Chapter 8. Tokenizing Walkthrough

The Problem

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.