book

HTTP: The Definitive Guide

by David Gourley, Brian Totty, Marjorie Sayer, Anshu Aggarwal, Sailu Reddy

September 2002

Intermediate to advanced

656 pages

22h 14m

English

O'Reilly Media, Inc.

Read now

Unlock full access

1.3.1. Media Types1.3.2. URIs1.3.3. URLs1.3.4. URNs
1.4.1. Methods1.4.2. Status Codes1.4.3. Web Pages Can Consist of Multiple Objects
1.5.1. Simple Message Example
1.6.1. TCP/IP1.6.2. Connections, IP Addresses, and Port Numbers1.6.3. A Real Example Using Telnet
1.8.1. Proxies1.8.2. Caches1.8.3. Gateways1.8.4. Tunnels1.8.5. Agents
1.10.1. HTTP Protocol Information1.10.2. Historical Perspective1.10.3. Other World Wide Web Information
2.1.1. The Dark Days Before URLs
2.2.1. Schemes: What Protocol to Use2.2.2. Hosts and Ports2.2.3. Usernames and Passwords2.2.4. Paths2.2.5. Parameters2.2.6. Query Strings2.2.7. Fragments
2.3.1. Relative URLs2.3.1.1. Base URLs2.3.1.2. Resolving relative references2.3.2. Expandomatic URLs
2.4.1. The URL Character Set2.4.2. Encoding Mechanisms2.4.3. Character Restrictions2.4.4. A Bit More
2.6.1. If Not Now, When?
3.1.1. Messages Commute Inbound to the Origin Server3.1.2. Messages Flow Downstream
3.2.1. Message Syntax3.2.2. Start Lines3.2.2.1. Request line3.2.2.2. Response line3.2.2.3. Methods3.2.2.4. Status codes3.2.2.5. Reason phrases3.2.2.6. Version numbers3.2.3. Headers3.2.3.1. Header classifications3.2.3.2. Header continuation lines3.2.4. Entity Bodies3.2.5. Version 0.9 Messages
3.3.1. Safe Methods3.3.2. GET3.3.3. HEAD3.3.4. PUT3.3.5. POST3.3.6. TRACE3.3.7. OPTIONS3.3.8. DELETE3.3.9. Extension Methods
3.4.1. 100-199: Informational Status Codes3.4.1.1. Clients and 100 Continue3.4.1.2. Servers and 100 Continue3.4.1.3. Proxies and 100 Continue3.4.2. 200-299: Success Status Codes3.4.3. 300-399: Redirection Status Codes3.4.4. 400-499: Client Error Status Codes3.4.5. 500-599: Server Error Status Codes
3.5.1. General Headers3.5.1.1. General caching headers3.5.2. Request Headers3.5.2.1. Accept headers3.5.2.2. Conditional request headers3.5.2.3. Request security headers3.5.2.4. Proxy request headers3.5.3. Response Headers3.5.3.1. Negotiation headers3.5.3.2. Response security headers3.5.4. Entity Headers3.5.4.1. Content headers3.5.4.2. Entity caching headers
4.1.1. TCP Reliable Data Pipes4.1.2. TCP Streams Are Segmented and Shipped by IP Packets4.1.3. Keeping TCP Connections Straight4.1.4. Programming with TCP Sockets
4.2.1. HTTP Transaction Delays4.2.2. Performance Focus Areas4.2.3. TCP Connection Handshake Delays4.2.4. Delayed Acknowledgments4.2.5. TCP Slow Start4.2.6. Nagle’s Algorithm and TCP_NODELAY4.2.7. TIME_WAIT Accumulation and Port Exhaustion
4.3.1. The Oft-Misunderstood Connection Header4.3.2. Serial Transaction Delays
4.4.1. Parallel Connections May Make Pages Load Faster4.4.2. Parallel Connections Are Not Always Faster4.4.3. Parallel Connections May “Feel” Faster
4.5.1. Persistent Versus Parallel Connections4.5.2. HTTP/1.0+ Keep-Alive Connections4.5.3. Keep-Alive Operation4.5.4. Keep-Alive Options4.5.5. Keep-Alive Connection Restrictions and Rules4.5.6. Keep-Alive and Dumb Proxies4.5.6.1. The Connection header and blind relays4.5.6.2. Proxies and hop-by-hop headers4.5.7. The Proxy-Connection Hack4.5.8. HTTP/1.1 Persistent Connections4.5.9. Persistent Connection Restrictions and Rules
4.7.1. “At Will” Disconnection4.7.2. Content-Length and Truncation4.7.3. Connection Close Tolerance, Retries, and Idempotency4.7.4. Graceful Connection Close4.7.4.1. Full and half closes4.7.4.2. TCP close and reset errors4.7.4.3. Graceful close
4.8.1. HTTP Connections4.8.2. HTTP Performance Issues4.8.3. TCP/IP
5.1.1. Web Server Implementations5.1.2. General-Purpose Software Web Servers5.1.3. Web Server Appliances5.1.4. Embedded Web Servers
5.4.1. Handling New Connections5.4.2. Client Hostname Identification5.4.3. Determining the Client User Through ident
5.5.1. Internal Representations of Messages5.5.2. Connection Input/Output Processing Architectures
5.7.1. Docroots5.7.1.1. Virtually hosted docroots5.7.1.2. User home directory docroots5.7.2. Directory Listings5.7.3. Dynamic Content Resource Mapping5.7.4. Server-Side Includes (SSI)5.7.5. Access Controls
5.8.1. Response Entities5.8.2. MIME Typing5.8.3. Redirection
6.1.1. Private and Shared Proxies6.1.2. Proxies Versus Gateways
6.3.1. Proxy Server Deployment6.3.2. Proxy Hierarchies6.3.2.1. Proxy hierarchy content routing6.3.3. How Proxies Get Traffic
6.4.1. Client Proxy Configuration: Manual6.4.2. Client Proxy Configuration: PAC Files6.4.3. Client Proxy Configuration: WPAD
6.5.1. Proxy URIs Differ from Server URIs6.5.2. The Same Problem with Virtual Hosting6.5.3. Intercepting Proxies Get Partial URIs6.5.4. Proxies Can Handle Both Proxy and Server Requests6.5.5. In-Flight URI Modification6.5.6. URI Client Auto-Expansion and Hostname Resolution6.5.7. URI Resolution Without a Proxy6.5.8. URI Resolution with an Explicit Proxy6.5.9. URI Resolution with an Intercepting Proxy
6.6.1. The Via Header6.6.1.1. Via syntax6.6.1.2. Via request and response paths6.6.1.3. Via and gateways6.6.1.4. The Server and Via headers6.6.1.5. Privacy and security implications of Via6.6.2. The TRACE Method6.6.2.1. Max-Forwards
6.8.1. Handling Unsupported Headers and Methods6.8.2. OPTIONS: Discovering Optional Feature Support6.8.3. The Allow Header
7.5.1. Revalidations7.5.2. Hit Rate7.5.3. Byte Hit Rate7.5.4. Distinguishing Hits and Misses
7.6.1. Private Caches7.6.2. Public Proxy Caches7.6.3. Proxy Cache Hierarchies7.6.4. Cache Meshes, Content Routing, and Peering
7.7.1. Step 1: Receiving7.7.2. Step 2: Parsing7.7.3. Step 3: Lookup7.7.4. Step 4: Freshness Check7.7.5. Step 5: Response Creation7.7.6. Step 6: Sending7.7.7. Step 7: Logging7.7.8. Cache Processing Flowchart
7.8.1. Document Expiration7.8.2. Expiration Dates and Ages7.8.3. Server Revalidation7.8.4. Revalidation with Conditional Methods7.8.5. If-Modified-Since: Date Revalidation7.8.6. If-None-Match: Entity Tag Revalidation7.8.7. Weak and Strong Validators7.8.8. When to Use Entity Tags and Last-Modified Dates
7.9.1. No-Cache and No-Store Headers7.9.2. Max-Age Response Headers7.9.3. Expires Response Headers7.9.4. Must-Revalidate Response Headers7.9.5. Heuristic Expiration7.9.6. Client Freshness Constraints7.9.7. Cautions
7.10.1. Controlling HTTP Headers with Apache7.10.2. Controlling HTML Caching Through HTTP-EQUIV
7.11.1. Age and Freshness Lifetime7.11.2. Age Computation7.11.2.1. Apparent age is based on the Date header7.11.2.2. Hop-by-hop age calculations7.11.2.3. Compensating for network delays7.11.3. Complete Age-Calculation Algorithm7.11.4. Freshness Lifetime Computation7.11.5. Complete Server-Freshness Algorithm
7.12.1. The Advertiser’s Dilemma7.12.2. The Publisher’s Response7.12.3. Log Migration7.12.4. Hit Metering and Usage Limiting
8.1.1. Client-Side and Server-Side Gateways
8.2.1. HTTP/*: Server-Side Web Gateways8.2.2. HTTP/HTTPS: Server-Side Security Gateways8.2.3. HTTPS/HTTP: Client-Side Security Accelerator Gateways
8.3.1. Common Gateway Interface (CGI)8.3.2. Server Extension APIs
8.5.1. Establishing HTTP Tunnels with CONNECT8.5.1.1. CONNECT requests8.5.1.2. CONNECT responses8.5.2. Data Tunneling, Timing, and Connection Management8.5.3. SSL Tunneling8.5.4. SSL Tunneling Versus HTTP/HTTPS Gateways8.5.5. Tunnel Authentication8.5.6. Tunnel Security Considerations
9.1.1. Where to Start: The “Root Set”9.1.2. Extracting Links and Normalizing Relative Links9.1.3. Cycle Avoidance9.1.4. Loops and Dups9.1.5. Trails of Breadcrumbs9.1.6. Aliases and Robot Cycles9.1.7. Canonicalizing URLs9.1.8. Filesystem Link Cycles9.1.9. Dynamic Virtual Web Spaces9.1.10. Avoiding Loops and Dups
9.2.1. Identifying Request Headers9.2.2. Virtual Hosting9.2.3. Conditional Requests9.2.4. Response Handling9.2.4.1. Status codes9.2.4.2. Entities9.2.5. User-Agent Targeting
9.4.1. The Robots Exclusion Standard9.4.2. Web Sites and robots.txt Files9.4.2.1. Fetching robots.txt9.4.2.2. Response codes9.4.3. robots.txt File Format9.4.3.1. The User-Agent line9.4.3.2. The Disallow and Allow lines9.4.3.3. Disallow/Allow prefix matching9.4.4. Other robots.txt Wisdom9.4.5. Caching and Expiration of robots.txt9.4.6. Robot Exclusion Perl Code9.4.7. HTML Robot-Control META Tags9.4.7.1. Robot META directives9.4.7.2. Search engine META tags
9.6.1. Think Big9.6.2. Modern Search Engine Architecture9.6.3. Full-Text Index9.6.4. Posting the Query9.6.5. Sorting and Presenting the Results9.6.6. Spoofing
11.6.1. Types of Cookies11.6.2. How Cookies Work11.6.3. Cookie Jar: Client-Side State11.6.3.1. Netscape Navigator cookies11.6.3.2. Microsoft Internet Explorer cookies11.6.4. Different Cookies for Different Sites11.6.4.1. Cookie Domain attribute11.6.4.2. Cookie Path attribute11.6.5. Cookie Ingredients11.6.6. Version 0 (Netscape) Cookies11.6.6.1. Version 0 Set-Cookie header11.6.6.2. Version 0 Cookie header11.6.7. Version 1 (RFC 2965) Cookies11.6.7.1. Version 1 Set-Cookie2 header11.6.7.2. Version 1 Cookie header11.6.7.3. Version 1 Cookie2 header and version negotiation11.6.8. Cookies and Session Tracking11.6.9. Cookies and Caching11.6.10. Cookies, Security, and Privacy
12.1.1. HTTP’s Challenge/Response Authentication Framework12.1.2. Authentication Protocols and Headers12.1.3. Security Realms
12.2.1. Basic Authentication Example12.2.2. Base-64 Username/Password Encoding12.2.3. Proxy Authentication
13.1.1. Using Digests to Keep Passwords Secret13.1.2. One-Way Digests13.1.3. Using Nonces to Prevent Replays13.1.4. The Digest Authentication Handshake
13.2.1. Digest Algorithm Input Data13.2.2. The Algorithms H(d) and KD(s,d)13.2.3. The Security-Related Data (A1)13.2.4. The Message-Related Data (A2)13.2.5. Overall Digest Algorithm13.2.6. Digest Authentication Session13.2.7. Preemptive Authorization13.2.7.1. Next nonce pregeneration13.2.7.2. Limited nonce reuse13.2.7.3. Synchronized nonce generation13.2.8. Nonce Selection13.2.9. Symmetric Authentication
13.3.1. Message Integrity Protection13.3.2. Digest Authentication Headers
13.4.1. Multiple Challenges13.4.2. Error Handling13.4.3. Protection Spaces13.4.4. Rewriting URIs13.4.5. Caches
13.5.1. Header Tampering13.5.2. Replay Attacks13.5.3. Multiple Authentication Mechanisms13.5.4. Dictionary Attacks13.5.5. Hostile Proxies and Man-in-the-Middle Attacks13.5.6. Chosen Plaintext Attacks13.5.7. Storing Passwords
14.1.1. HTTPS
14.2.1. The Art and Science of Secret Coding14.2.2. Ciphers14.2.3. Cipher Machines14.2.4. Keyed Ciphers14.2.5. Digital Ciphers
14.3.1. Key Length and Enumeration Attacks14.3.2. Establishing Shared Keys
14.4.1. RSA14.4.2. Hybrid Cryptosystems and Session Keys
14.5.1. Signatures Are Cryptographic Checksums
14.6.1. The Guts of a Certificate14.6.2. X.509 v3 Certificates14.6.3. Using Certificates to Authenticate Servers
14.7.1. HTTPS Overview14.7.2. HTTPS Schemes14.7.3. Secure Transport Setup14.7.4. SSL Handshake14.7.5. Server Certificates14.7.6. Site Certificate Validation14.7.7. Virtual Hosting and Certificates
14.8.1. OpenSSL14.8.2. A Simple HTTPS Client14.8.3. Executing Our Simple OpenSSL Client
14.10.1. HTTP Security14.10.2. SSL and TLS14.10.3. Public-Key Infrastructure14.10.4. Digital Cryptography
15.1.1. Entity Bodies
15.2.1. Detecting Truncation15.2.2. Incorrect Content-Length15.2.3. Content-Length and Persistent Connections15.2.4. Content Encoding15.2.5. Rules for Determining Entity Body Length
15.4.1. Character Encodings for Text Media15.4.2. Multipart Media Types15.4.3. Multipart Form Submissions15.4.4. Multipart Range Responses
15.5.1. The Content-Encoding Process15.5.2. Content-Encoding Types15.5.3. Accept-Encoding Headers
15.6.1. Safe Transport15.6.2. Transfer-Encoding Headers15.6.3. Chunked Encoding15.6.3.1. Chunking and persistent connections15.6.3.2. Trailers in chunked messages15.6.4. Combining Content and Transfer Encodings15.6.5. Transfer-Encoding Rules
15.8.1. Freshness15.8.2. Conditionals and Validators
15.10.1. Instance Manipulations, Delta Generators, and Delta Appliers
16.2.1. Charset Is a Character-to-Bits Encoding16.2.2. How Character Sets and Encodings Work16.2.3. The Wrong Charset Gives the Wrong Characters16.2.4. Standardized MIME Charset Values16.2.5. Content-Type Charset Header and META Tags16.2.6. The Accept-Charset Header
16.3.1. Character Set Terminology16.3.2. Charset Is Poorly Named16.3.3. Characters16.3.4. Glyphs, Ligatures, and Presentation Forms16.3.5. Coded Character Sets16.3.5.1. US-ASCII: The mother of all character sets16.3.5.2. iso-885916.3.5.3. JIS X 020116.3.5.4. JIS X 0208 and JIS X 021216.3.5.5. UCS16.3.6. Character Encoding Schemes16.3.6.1. 8-bit16.3.6.2. UTF-816.3.6.3. iso-2022-jp16.3.6.4. euc-jp
16.4.1. The Content-Language Header16.4.2. The Accept-Language Header16.4.3. Types of Language Tags16.4.4. Subtags16.4.5. Capitalization16.4.6. IANA Language Tag Registrations16.4.7. First Subtag: Namespace16.4.8. Second Subtag: Namespace16.4.9. Remaining Subtags: Namespace16.4.10. Configuring Language Preferences16.4.11. Language Tag Reference Tables
16.5.1. Global Transcribability Versus Meaningful Characters16.5.2. URI Character Repertoire16.5.3. Escaping and Unescaping16.5.4. Escaping International Characters16.5.5. Modal Switches in URIs
16.6.1. Headers and Out-of-Spec Data16.6.2. Dates16.6.3. Domain Names
16.7.1. Appendixes16.7.2. Internet Internationalization16.7.3. International Standards
17.3.1. Content-Negotiation Headers17.3.2. Content-Negotiation Header Quality Values17.3.3. Varying on Other Headers17.3.4. Content Negotiation on Apache17.3.4.1. Using type-map files17.3.4.2. Using MultiViews17.3.5. Server-Side Extensions
17.4.1. Caching and Alternates17.4.2. The Vary Header
17.5.1. Format Conversion17.5.2. Information Synthesis17.5.3. Content Injection17.5.4. Transcoding Versus Static Pregeneration
18.1.1. A Simple Example: Dedicated Hosting
18.2.1. Virtual Server Request Lacks Host Information18.2.2. Making Virtual Hosting Work18.2.2.1. Virtual hosting by URL path18.2.2.2. Virtual hosting by port number18.2.2.3. Virtual hosting by IP address18.2.2.4. Virtual hosting by Host header18.2.3. HTTP/1.1 Host Headers18.2.3.1. Syntax and usage18.2.3.2. Missing Host headers18.2.3.3. Interpreting Host headers18.2.3.4. Host headers and proxies
18.3.1. Mirrored Server Farms18.3.2. Content Distribution Networks18.3.3. Surrogate Caches in CDNs18.3.4. Proxy Caches in CDNs
19.1.1. FrontPage Server Extensions19.1.2. FrontPage Vocabulary19.1.3. The FrontPage RPC Protocol19.1.3.1. Request19.1.3.2. Response19.1.4. FrontPage Security Model
19.2.1. WebDAV Methods19.2.2. WebDAV and XML19.2.3. WebDAV Headers19.2.4. WebDAV Locking and Overwrite Prevention19.2.5. The LOCK Method19.2.5.1. The opaquelocktoken scheme19.2.5.2. The <lockdiscovery> XML element19.2.5.3. Lock refreshes and the Timeout header19.2.6. The UNLOCK Method19.2.7. Properties and META Data19.2.8. The PROPFIND Method19.2.9. The PROPPATCH Method19.2.10. Collections and Namespace Management19.2.11. The MKCOL Method19.2.12. The DELETE Method19.2.13. The COPY and MOVE Methods19.2.13.1. Overwrite header effect19.2.13.2. COPY/MOVE of properties19.2.13.3. Locked resources and COPY/MOVE19.2.14. Enhanced HTTP/1.1 Methods19.2.14.1. The PUT method19.2.14.2. The OPTIONS method19.2.15. Version Management in WebDAV19.2.16. Future of WebDAV
20.4.1. HTTP Redirection20.4.2. DNS Redirection20.4.2.1. DNS round robin20.4.2.2. Multiple addresses and round-robin address rotation20.4.2.3. DNS round robin for load balancing20.4.2.4. The impact of DNS caching20.4.2.5. Other DNS-based redirection algorithms20.4.3. Anycast Addressing20.4.4. IP MAC Forwarding20.4.5. IP Address Forwarding20.4.6. Network Element Control Protocol20.4.6.1. Messages
20.5.1. Explicit Browser Configuration20.5.2. Proxy Auto-configuration20.5.3. Web Proxy Autodiscovery Protocol20.5.3.1. PAC file autodiscovery20.5.3.2. WPAD algorithm20.5.3.3. CURL discovery using DHCP20.5.3.4. DNS A record lookup20.5.3.5. Retrieving the PAC file20.5.3.6. When to execute WPAD20.5.3.7. WPAD spoofing20.5.3.8. Timeouts20.5.3.9. Administrator considerations
20.6.1. WCCP Redirection20.6.1.1. How WCCP redirection works20.6.1.2. WCCP2 messages20.6.1.3. Message components20.6.1.4. Service groups20.6.1.5. GRE packet encapsulation20.6.1.6. WCCP load balancing
20.9.1. HTCP Authentication20.9.2. Setting Caching Policies
21.2.1. Common Log Format21.2.2. Combined Log Format21.2.3. Netscape Extended Log Format21.2.4. Netscape Extended 2 Log Format21.2.5. Squid Proxy Log Format
21.3.1. Overview21.3.2. The Meter Header
D.2.1. Discrete TypesD.2.2. Composite TypesD.2.3. Multipart TypesD.2.4. Syntax
D.3.1. Registration TreesD.3.2. Registration ProcessD.3.3. Registration RulesD.3.4. Registration TemplateD.3.5. MIME Media Type Registry
D.4.1. application/*D.4.2. audio/*D.4.3. chemical/*D.4.4. image/*D.4.5. message/*D.4.6. model/*D.4.7. multipart/*D.4.8. text/*D.4.9. video/*D.4.10. Experimental Types
F.4.1. File “digcalc.h”F.4.2. File “digcalc.c”F.4.3. File “digtest.c”

Content preview from HTTP: The Definitive Guide

Crawlers and Crawling

Web crawlers are robots that recursively traverse information webs, fetching first one web page, then all the web pages to which that page points, then all the web pages to which those pages point, and so on. When a robot recursively follows web links, it is called a crawler or a spider because it “crawls” along the web created by HTML hyperlinks.

Internet search engines use crawlers to wander about the Web and pull back all the documents they encounter. These documents are then processed to create a searchable database, allowing users to find documents that contain particular words. With billions of web pages out there to find and bring back, these search-engine spiders necessarily are some of the most sophisticated robots. Let’s look in more detail at how crawlers work.

Where to Start: The “Root Set”

Before you can unleash your hungry crawler, you need to give it a starting point. The initial set of URLs that a crawler starts visiting is referred to as the root set . When picking a root set, you should choose URLs from enough different places that crawling all the links will eventually get you to most of the web pages that interest you.

What’s a good root set to use for crawling the web in Figure 9-1? As in the real Web, there is no single document that eventually links to every document. If you start with document A in Figure 9-1, you can get to B, C, and D, then to E and F, then to J, and then to K. But there’s no chain of links from A to G or from A to ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 1565925092Errata Page

HTTP: The Definitive Guide

by David Gourley, Brian Totty, Marjorie Sayer, Anshu Aggarwal, Sailu Reddy

Crawlers and Crawling

Where to Start: The “Root Set”

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

REST API Design Rulebook

Building Event-Driven Microservices

Kubernetes: Up and Running, 3rd Edition

React - The Complete Guide (Includes Hooks, React Router, and Redux) - Second Edition

Publisher Resources