book

HTTP: The Definitive Guide

by David Gourley, Brian Totty, Marjorie Sayer, Anshu Aggarwal, Sailu Reddy

September 2002

Intermediate to advanced

656 pages

22h 14m

English

O'Reilly Media, Inc.

Read now

Unlock full access

HTTP: The Definitive Guide
Preface
Running Example: Joe’s Hardware Store
Chapter-by-Chapter Guide
Typographic Conventions
Comments and Questions
Acknowledgments
I. HTTP: The Web’s Foundation
1. Overview of HTTP
1.1. HTTP: The Internet’s Multimedia Courier

1.2. Web Clients and Servers
1.3. Resources
1.3.1. Media Types1.3.2. URIs1.3.3. URLs1.3.4. URNs
1.4. Transactions
1.4.1. Methods1.4.2. Status Codes1.4.3. Web Pages Can Consist of Multiple Objects
1.5. Messages
1.5.1. Simple Message Example
1.6. Connections
1.6.1. TCP/IP1.6.2. Connections, IP Addresses, and Port Numbers1.6.3. A Real Example Using Telnet
1.7. Protocol Versions
1.8. Architectural Components of the Web
1.8.1. Proxies1.8.2. Caches1.8.3. Gateways1.8.4. Tunnels1.8.5. Agents
1.9. The End of the Beginning
1.10. For More Information
1.10.1. HTTP Protocol Information1.10.2. Historical Perspective1.10.3. Other World Wide Web Information
2. URLs and Resources
2.1. Navigating the Internet’s Resources
2.1.1. The Dark Days Before URLs
2.2. URL Syntax
2.2.1. Schemes: What Protocol to Use2.2.2. Hosts and Ports2.2.3. Usernames and Passwords2.2.4. Paths2.2.5. Parameters2.2.6. Query Strings2.2.7. Fragments
2.3. URL Shortcuts
2.3.1. Relative URLs2.3.1.1. Base URLs2.3.1.2. Resolving relative references2.3.2. Expandomatic URLs
2.4. Shady Characters
2.4.1. The URL Character Set2.4.2. Encoding Mechanisms2.4.3. Character Restrictions2.4.4. A Bit More
2.5. A Sea of Schemes
2.6. The Future
2.6.1. If Not Now, When?
2.7. For More Information
3. HTTP Messages
3.1. The Flow of Messages
3.1.1. Messages Commute Inbound to the Origin Server3.1.2. Messages Flow Downstream
3.2. The Parts of a Message
3.2.1. Message Syntax3.2.2. Start Lines3.2.2.1. Request line3.2.2.2. Response line3.2.2.3. Methods3.2.2.4. Status codes3.2.2.5. Reason phrases3.2.2.6. Version numbers3.2.3. Headers3.2.3.1. Header classifications3.2.3.2. Header continuation lines3.2.4. Entity Bodies3.2.5. Version 0.9 Messages
3.3. Methods
3.3.1. Safe Methods3.3.2. GET3.3.3. HEAD3.3.4. PUT3.3.5. POST3.3.6. TRACE3.3.7. OPTIONS3.3.8. DELETE3.3.9. Extension Methods
3.4. Status Codes
3.4.1. 100-199: Informational Status Codes3.4.1.1. Clients and 100 Continue3.4.1.2. Servers and 100 Continue3.4.1.3. Proxies and 100 Continue3.4.2. 200-299: Success Status Codes3.4.3. 300-399: Redirection Status Codes3.4.4. 400-499: Client Error Status Codes3.4.5. 500-599: Server Error Status Codes
3.5. Headers
3.5.1. General Headers3.5.1.1. General caching headers3.5.2. Request Headers3.5.2.1. Accept headers3.5.2.2. Conditional request headers3.5.2.3. Request security headers3.5.2.4. Proxy request headers3.5.3. Response Headers3.5.3.1. Negotiation headers3.5.3.2. Response security headers3.5.4. Entity Headers3.5.4.1. Content headers3.5.4.2. Entity caching headers
3.6. For More Information
4. Connection Management
4.1. TCP Connections
4.1.1. TCP Reliable Data Pipes4.1.2. TCP Streams Are Segmented and Shipped by IP Packets4.1.3. Keeping TCP Connections Straight4.1.4. Programming with TCP Sockets
4.2. TCP Performance Considerations
4.2.1. HTTP Transaction Delays4.2.2. Performance Focus Areas4.2.3. TCP Connection Handshake Delays4.2.4. Delayed Acknowledgments4.2.5. TCP Slow Start4.2.6. Nagle’s Algorithm and TCP_NODELAY4.2.7. TIME_WAIT Accumulation and Port Exhaustion
4.3. HTTP Connection Handling
4.3.1. The Oft-Misunderstood Connection Header4.3.2. Serial Transaction Delays
4.4. Parallel Connections
4.4.1. Parallel Connections May Make Pages Load Faster4.4.2. Parallel Connections Are Not Always Faster4.4.3. Parallel Connections May “Feel” Faster
4.5. Persistent Connections
4.5.1. Persistent Versus Parallel Connections4.5.2. HTTP/1.0+ Keep-Alive Connections4.5.3. Keep-Alive Operation4.5.4. Keep-Alive Options4.5.5. Keep-Alive Connection Restrictions and Rules4.5.6. Keep-Alive and Dumb Proxies4.5.6.1. The Connection header and blind relays4.5.6.2. Proxies and hop-by-hop headers4.5.7. The Proxy-Connection Hack4.5.8. HTTP/1.1 Persistent Connections4.5.9. Persistent Connection Restrictions and Rules
4.6. Pipelined Connections
4.7. The Mysteries of Connection Close
4.7.1. “At Will” Disconnection4.7.2. Content-Length and Truncation4.7.3. Connection Close Tolerance, Retries, and Idempotency4.7.4. Graceful Connection Close4.7.4.1. Full and half closes4.7.4.2. TCP close and reset errors4.7.4.3. Graceful close
4.8. For More Information
4.8.1. HTTP Connections4.8.2. HTTP Performance Issues4.8.3. TCP/IP
II. HTTP Architecture
5. Web Servers
5.1. Web Servers Come in All Shapes and Sizes
5.1.1. Web Server Implementations5.1.2. General-Purpose Software Web Servers5.1.3. Web Server Appliances5.1.4. Embedded Web Servers
5.2. A Minimal Perl Web Server
5.3. What Real Web Servers Do
5.4. Step 1: Accepting Client Connections
5.4.1. Handling New Connections5.4.2. Client Hostname Identification5.4.3. Determining the Client User Through ident
5.5. Step 2: Receiving Request Messages
5.5.1. Internal Representations of Messages5.5.2. Connection Input/Output Processing Architectures
5.6. Step 3: Processing Requests
5.7. Step 4: Mapping and Accessing Resources
5.7.1. Docroots5.7.1.1. Virtually hosted docroots5.7.1.2. User home directory docroots5.7.2. Directory Listings5.7.3. Dynamic Content Resource Mapping5.7.4. Server-Side Includes (SSI)5.7.5. Access Controls
5.8. Step 5: Building Responses
5.8.1. Response Entities5.8.2. MIME Typing5.8.3. Redirection
5.9. Step 6: Sending Responses
5.10. Step 7: Logging
5.11. For More Information
6. Proxies
6.1. Web Intermediaries
6.1.1. Private and Shared Proxies6.1.2. Proxies Versus Gateways
6.2. Why Use Proxies?
6.3. Where Do Proxies Go?
6.3.1. Proxy Server Deployment6.3.2. Proxy Hierarchies6.3.2.1. Proxy hierarchy content routing6.3.3. How Proxies Get Traffic
6.4. Client Proxy Settings
6.4.1. Client Proxy Configuration: Manual6.4.2. Client Proxy Configuration: PAC Files6.4.3. Client Proxy Configuration: WPAD
6.5. Tricky Things About Proxy Requests
6.5.1. Proxy URIs Differ from Server URIs6.5.2. The Same Problem with Virtual Hosting6.5.3. Intercepting Proxies Get Partial URIs6.5.4. Proxies Can Handle Both Proxy and Server Requests6.5.5. In-Flight URI Modification6.5.6. URI Client Auto-Expansion and Hostname Resolution6.5.7. URI Resolution Without a Proxy6.5.8. URI Resolution with an Explicit Proxy6.5.9. URI Resolution with an Intercepting Proxy
6.6. Tracing Messages
6.6.1. The Via Header6.6.1.1. Via syntax6.6.1.2. Via request and response paths6.6.1.3. Via and gateways6.6.1.4. The Server and Via headers6.6.1.5. Privacy and security implications of Via6.6.2. The TRACE Method6.6.2.1. Max-Forwards
6.7. Proxy Authentication
6.8. Proxy Interoperation
6.8.1. Handling Unsupported Headers and Methods6.8.2. OPTIONS: Discovering Optional Feature Support6.8.3. The Allow Header
6.9. For More Information
7. Caching
7.1. Redundant Data Transfers
7.2. Bandwidth Bottlenecks
7.3. Flash Crowds
7.4. Distance Delays
7.5. Hits and Misses
7.5.1. Revalidations7.5.2. Hit Rate7.5.3. Byte Hit Rate7.5.4. Distinguishing Hits and Misses
7.6. Cache Topologies
7.6.1. Private Caches7.6.2. Public Proxy Caches7.6.3. Proxy Cache Hierarchies7.6.4. Cache Meshes, Content Routing, and Peering
7.7. Cache Processing Steps
7.7.1. Step 1: Receiving7.7.2. Step 2: Parsing7.7.3. Step 3: Lookup7.7.4. Step 4: Freshness Check7.7.5. Step 5: Response Creation7.7.6. Step 6: Sending7.7.7. Step 7: Logging7.7.8. Cache Processing Flowchart
7.8. Keeping Copies Fresh
7.8.1. Document Expiration7.8.2. Expiration Dates and Ages7.8.3. Server Revalidation7.8.4. Revalidation with Conditional Methods7.8.5. If-Modified-Since: Date Revalidation7.8.6. If-None-Match: Entity Tag Revalidation7.8.7. Weak and Strong Validators7.8.8. When to Use Entity Tags and Last-Modified Dates
7.9. Controlling Cachability
7.9.1. No-Cache and No-Store Headers7.9.2. Max-Age Response Headers7.9.3. Expires Response Headers7.9.4. Must-Revalidate Response Headers7.9.5. Heuristic Expiration7.9.6. Client Freshness Constraints7.9.7. Cautions
7.10. Setting Cache Controls
7.10.1. Controlling HTTP Headers with Apache7.10.2. Controlling HTML Caching Through HTTP-EQUIV
7.11. Detailed Algorithms
7.11.1. Age and Freshness Lifetime7.11.2. Age Computation7.11.2.1. Apparent age is based on the Date header7.11.2.2. Hop-by-hop age calculations7.11.2.3. Compensating for network delays7.11.3. Complete Age-Calculation Algorithm7.11.4. Freshness Lifetime Computation7.11.5. Complete Server-Freshness Algorithm
7.12. Caches and Advertising
7.12.1. The Advertiser’s Dilemma7.12.2. The Publisher’s Response7.12.3. Log Migration7.12.4. Hit Metering and Usage Limiting
7.13. For More Information
8. Integration Points: Gateways, Tunnels, and Relays
8.1. Gateways
8.1.1. Client-Side and Server-Side Gateways
8.2. Protocol Gateways
8.2.1. HTTP/*: Server-Side Web Gateways8.2.2. HTTP/HTTPS: Server-Side Security Gateways8.2.3. HTTPS/HTTP: Client-Side Security Accelerator Gateways
8.3. Resource Gateways
8.3.1. Common Gateway Interface (CGI)8.3.2. Server Extension APIs
8.4. Application Interfaces and Web Services
8.5. Tunnels
8.5.1. Establishing HTTP Tunnels with CONNECT8.5.1.1. CONNECT requests8.5.1.2. CONNECT responses8.5.2. Data Tunneling, Timing, and Connection Management8.5.3. SSL Tunneling8.5.4. SSL Tunneling Versus HTTP/HTTPS Gateways8.5.5. Tunnel Authentication8.5.6. Tunnel Security Considerations
8.6. Relays
8.7. For More Information
9. Web Robots
9.1. Crawlers and Crawling
9.1.1. Where to Start: The “Root Set”9.1.2. Extracting Links and Normalizing Relative Links9.1.3. Cycle Avoidance9.1.4. Loops and Dups9.1.5. Trails of Breadcrumbs9.1.6. Aliases and Robot Cycles9.1.7. Canonicalizing URLs9.1.8. Filesystem Link Cycles9.1.9. Dynamic Virtual Web Spaces9.1.10. Avoiding Loops and Dups
9.2. Robotic HTTP
9.2.1. Identifying Request Headers9.2.2. Virtual Hosting9.2.3. Conditional Requests9.2.4. Response Handling9.2.4.1. Status codes9.2.4.2. Entities9.2.5. User-Agent Targeting
9.3. Misbehaving Robots
9.4. Excluding Robots
9.4.1. The Robots Exclusion Standard9.4.2. Web Sites and robots.txt Files9.4.2.1. Fetching robots.txt9.4.2.2. Response codes9.4.3. robots.txt File Format9.4.3.1. The User-Agent line9.4.3.2. The Disallow and Allow lines9.4.3.3. Disallow/Allow prefix matching9.4.4. Other robots.txt Wisdom9.4.5. Caching and Expiration of robots.txt9.4.6. Robot Exclusion Perl Code9.4.7. HTML Robot-Control META Tags9.4.7.1. Robot META directives9.4.7.2. Search engine META tags
9.5. Robot Etiquette
9.6. Search Engines
9.6.1. Think Big9.6.2. Modern Search Engine Architecture9.6.3. Full-Text Index9.6.4. Posting the Query9.6.5. Sorting and Presenting the Results9.6.6. Spoofing
9.7. For More Information
10. HTTP-NG
10.1. HTTP’s Growing Pains
10.2. HTTP-NG Activity
10.3. Modularize and Enhance
10.4. Distributed Objects
10.5. Layer 1: Messaging
10.6. Layer 2: Remote Invocation
10.7. Layer 3: Web Application
10.8. WebMUX
10.9. Binary Wire Protocol
10.10. Current Status
10.11. For More Information
III. Identification, Authorization, and Security
11. Client Identification and Cookies
11.1. The Personal Touch
11.2. HTTP Headers
11.3. Client IP Address
11.4. User Login
11.5. Fat URLs
11.6. Cookies
11.6.1. Types of Cookies11.6.2. How Cookies Work11.6.3. Cookie Jar: Client-Side State11.6.3.1. Netscape Navigator cookies11.6.3.2. Microsoft Internet Explorer cookies11.6.4. Different Cookies for Different Sites11.6.4.1. Cookie Domain attribute11.6.4.2. Cookie Path attribute11.6.5. Cookie Ingredients11.6.6. Version 0 (Netscape) Cookies11.6.6.1. Version 0 Set-Cookie header11.6.6.2. Version 0 Cookie header11.6.7. Version 1 (RFC 2965) Cookies11.6.7.1. Version 1 Set-Cookie2 header11.6.7.2. Version 1 Cookie header11.6.7.3. Version 1 Cookie2 header and version negotiation11.6.8. Cookies and Session Tracking11.6.9. Cookies and Caching11.6.10. Cookies, Security, and Privacy
11.7. For More Information
12. Basic Authentication
12.1. Authentication
12.1.1. HTTP’s Challenge/Response Authentication Framework12.1.2. Authentication Protocols and Headers12.1.3. Security Realms
12.2. Basic Authentication
12.2.1. Basic Authentication Example12.2.2. Base-64 Username/Password Encoding12.2.3. Proxy Authentication
12.3. The Security Flaws of Basic Authentication
12.4. For More Information
13. Digest Authentication
13.1. The Improvements of Digest Authentication
13.1.1. Using Digests to Keep Passwords Secret13.1.2. One-Way Digests13.1.3. Using Nonces to Prevent Replays13.1.4. The Digest Authentication Handshake
13.2. Digest Calculations
13.2.1. Digest Algorithm Input Data13.2.2. The Algorithms H(d) and KD(s,d)13.2.3. The Security-Related Data (A1)13.2.4. The Message-Related Data (A2)13.2.5. Overall Digest Algorithm13.2.6. Digest Authentication Session13.2.7. Preemptive Authorization13.2.7.1. Next nonce pregeneration13.2.7.2. Limited nonce reuse13.2.7.3. Synchronized nonce generation13.2.8. Nonce Selection13.2.9. Symmetric Authentication
13.3. Quality of Protection Enhancements
13.3.1. Message Integrity Protection13.3.2. Digest Authentication Headers
13.4. Practical Considerations
13.4.1. Multiple Challenges13.4.2. Error Handling13.4.3. Protection Spaces13.4.4. Rewriting URIs13.4.5. Caches
13.5. Security Considerations
13.5.1. Header Tampering13.5.2. Replay Attacks13.5.3. Multiple Authentication Mechanisms13.5.4. Dictionary Attacks13.5.5. Hostile Proxies and Man-in-the-Middle Attacks13.5.6. Chosen Plaintext Attacks13.5.7. Storing Passwords
13.6. For More Information
14. Secure HTTP
14.1. Making HTTP Safe
14.1.1. HTTPS
14.2. Digital Cryptography
14.2.1. The Art and Science of Secret Coding14.2.2. Ciphers14.2.3. Cipher Machines14.2.4. Keyed Ciphers14.2.5. Digital Ciphers
14.3. Symmetric-Key Cryptography
14.3.1. Key Length and Enumeration Attacks14.3.2. Establishing Shared Keys
14.4. Public-Key Cryptography
14.4.1. RSA14.4.2. Hybrid Cryptosystems and Session Keys
14.5. Digital Signatures
14.5.1. Signatures Are Cryptographic Checksums
14.6. Digital Certificates
14.6.1. The Guts of a Certificate14.6.2. X.509 v3 Certificates14.6.3. Using Certificates to Authenticate Servers
14.7. HTTPS: The Details
14.7.1. HTTPS Overview14.7.2. HTTPS Schemes14.7.3. Secure Transport Setup14.7.4. SSL Handshake14.7.5. Server Certificates14.7.6. Site Certificate Validation14.7.7. Virtual Hosting and Certificates
14.8. A Real HTTPS Client
14.8.1. OpenSSL14.8.2. A Simple HTTPS Client14.8.3. Executing Our Simple OpenSSL Client
14.9. Tunneling Secure Traffic Through Proxies
14.10. For More Information
14.10.1. HTTP Security14.10.2. SSL and TLS14.10.3. Public-Key Infrastructure14.10.4. Digital Cryptography
IV. Entities, Encodings, and Internationalization
15. Entities and Encodings
15.1. Messages Are Crates, Entities Are Cargo
15.1.1. Entity Bodies
15.2. Content-Length: The Entity’s Size
15.2.1. Detecting Truncation15.2.2. Incorrect Content-Length15.2.3. Content-Length and Persistent Connections15.2.4. Content Encoding15.2.5. Rules for Determining Entity Body Length
15.3. Entity Digests
15.4. Media Type and Charset
15.4.1. Character Encodings for Text Media15.4.2. Multipart Media Types15.4.3. Multipart Form Submissions15.4.4. Multipart Range Responses
15.5. Content Encoding
15.5.1. The Content-Encoding Process15.5.2. Content-Encoding Types15.5.3. Accept-Encoding Headers
15.6. Transfer Encoding and Chunked Encoding
15.6.1. Safe Transport15.6.2. Transfer-Encoding Headers15.6.3. Chunked Encoding15.6.3.1. Chunking and persistent connections15.6.3.2. Trailers in chunked messages15.6.4. Combining Content and Transfer Encodings15.6.5. Transfer-Encoding Rules
15.7. Time-Varying Instances
15.8. Validators and Freshness
15.8.1. Freshness15.8.2. Conditionals and Validators
15.9. Range Requests
15.10. Delta Encoding
15.10.1. Instance Manipulations, Delta Generators, and Delta Appliers
15.11. For More Information
16. Internationalization
16.1. HTTP Support for International Content
16.2. Character Sets and HTTP
16.2.1. Charset Is a Character-to-Bits Encoding16.2.2. How Character Sets and Encodings Work16.2.3. The Wrong Charset Gives the Wrong Characters16.2.4. Standardized MIME Charset Values16.2.5. Content-Type Charset Header and META Tags16.2.6. The Accept-Charset Header
16.3. Multilingual Character Encoding Primer
16.3.1. Character Set Terminology16.3.2. Charset Is Poorly Named16.3.3. Characters16.3.4. Glyphs, Ligatures, and Presentation Forms16.3.5. Coded Character Sets16.3.5.1. US-ASCII: The mother of all character sets16.3.5.2. iso-885916.3.5.3. JIS X 020116.3.5.4. JIS X 0208 and JIS X 021216.3.5.5. UCS16.3.6. Character Encoding Schemes16.3.6.1. 8-bit16.3.6.2. UTF-816.3.6.3. iso-2022-jp16.3.6.4. euc-jp
16.4. Language Tags and HTTP
16.4.1. The Content-Language Header16.4.2. The Accept-Language Header16.4.3. Types of Language Tags16.4.4. Subtags16.4.5. Capitalization16.4.6. IANA Language Tag Registrations16.4.7. First Subtag: Namespace16.4.8. Second Subtag: Namespace16.4.9. Remaining Subtags: Namespace16.4.10. Configuring Language Preferences16.4.11. Language Tag Reference Tables
16.5. Internationalized URIs
16.5.1. Global Transcribability Versus Meaningful Characters16.5.2. URI Character Repertoire16.5.3. Escaping and Unescaping16.5.4. Escaping International Characters16.5.5. Modal Switches in URIs
16.6. Other Considerations
16.6.1. Headers and Out-of-Spec Data16.6.2. Dates16.6.3. Domain Names
16.7. For More Information
16.7.1. Appendixes16.7.2. Internet Internationalization16.7.3. International Standards
17. Content Negotiation and Transcoding
17.1. Content-Negotiation Techniques
17.2. Client-Driven Negotiation
17.3. Server-Driven Negotiation
17.3.1. Content-Negotiation Headers17.3.2. Content-Negotiation Header Quality Values17.3.3. Varying on Other Headers17.3.4. Content Negotiation on Apache17.3.4.1. Using type-map files17.3.4.2. Using MultiViews17.3.5. Server-Side Extensions
17.4. Transparent Negotiation
17.4.1. Caching and Alternates17.4.2. The Vary Header
17.5. Transcoding
17.5.1. Format Conversion17.5.2. Information Synthesis17.5.3. Content Injection17.5.4. Transcoding Versus Static Pregeneration
17.6. Next Steps
17.7. For More Information
V. Content Publishing and Distribution
18. Web Hosting
18.1. Hosting Services
18.1.1. A Simple Example: Dedicated Hosting
18.2. Virtual Hosting
18.2.1. Virtual Server Request Lacks Host Information18.2.2. Making Virtual Hosting Work18.2.2.1. Virtual hosting by URL path18.2.2.2. Virtual hosting by port number18.2.2.3. Virtual hosting by IP address18.2.2.4. Virtual hosting by Host header18.2.3. HTTP/1.1 Host Headers18.2.3.1. Syntax and usage18.2.3.2. Missing Host headers18.2.3.3. Interpreting Host headers18.2.3.4. Host headers and proxies
18.3. Making Web Sites Reliable
18.3.1. Mirrored Server Farms18.3.2. Content Distribution Networks18.3.3. Surrogate Caches in CDNs18.3.4. Proxy Caches in CDNs
18.4. Making Web Sites Fast
18.5. For More Information
19. Publishing Systems
19.1. FrontPage Server Extensions for Publishing Support
19.1.1. FrontPage Server Extensions19.1.2. FrontPage Vocabulary19.1.3. The FrontPage RPC Protocol19.1.3.1. Request19.1.3.2. Response19.1.4. FrontPage Security Model
19.2. WebDAV and Collaborative Authoring
19.2.1. WebDAV Methods19.2.2. WebDAV and XML19.2.3. WebDAV Headers19.2.4. WebDAV Locking and Overwrite Prevention19.2.5. The LOCK Method19.2.5.1. The opaquelocktoken scheme19.2.5.2. The <lockdiscovery> XML element19.2.5.3. Lock refreshes and the Timeout header19.2.6. The UNLOCK Method19.2.7. Properties and META Data19.2.8. The PROPFIND Method19.2.9. The PROPPATCH Method19.2.10. Collections and Namespace Management19.2.11. The MKCOL Method19.2.12. The DELETE Method19.2.13. The COPY and MOVE Methods19.2.13.1. Overwrite header effect19.2.13.2. COPY/MOVE of properties19.2.13.3. Locked resources and COPY/MOVE19.2.14. Enhanced HTTP/1.1 Methods19.2.14.1. The PUT method19.2.14.2. The OPTIONS method19.2.15. Version Management in WebDAV19.2.16. Future of WebDAV
19.3. For More Information
20. Redirection and Load Balancing
20.1. Why Redirect?
20.2. Where to Redirect
20.3. Overview of Redirection Protocols
20.4. General Redirection Methods
20.4.1. HTTP Redirection20.4.2. DNS Redirection20.4.2.1. DNS round robin20.4.2.2. Multiple addresses and round-robin address rotation20.4.2.3. DNS round robin for load balancing20.4.2.4. The impact of DNS caching20.4.2.5. Other DNS-based redirection algorithms20.4.3. Anycast Addressing20.4.4. IP MAC Forwarding20.4.5. IP Address Forwarding20.4.6. Network Element Control Protocol20.4.6.1. Messages
20.5. Proxy Redirection Methods
20.5.1. Explicit Browser Configuration20.5.2. Proxy Auto-configuration20.5.3. Web Proxy Autodiscovery Protocol20.5.3.1. PAC file autodiscovery20.5.3.2. WPAD algorithm20.5.3.3. CURL discovery using DHCP20.5.3.4. DNS A record lookup20.5.3.5. Retrieving the PAC file20.5.3.6. When to execute WPAD20.5.3.7. WPAD spoofing20.5.3.8. Timeouts20.5.3.9. Administrator considerations
20.6. Cache Redirection Methods
20.6.1. WCCP Redirection20.6.1.1. How WCCP redirection works20.6.1.2. WCCP2 messages20.6.1.3. Message components20.6.1.4. Service groups20.6.1.5. GRE packet encapsulation20.6.1.6. WCCP load balancing
20.7. Internet Cache Protocol
20.8. Cache Array Routing Protocol
20.9. Hyper Text Caching Protocol
20.9.1. HTCP Authentication20.9.2. Setting Caching Policies
20.10. For More Information
21. Logging and Usage Tracking
21.1. What to Log?
21.2. Log Formats
21.2.1. Common Log Format21.2.2. Combined Log Format21.2.3. Netscape Extended Log Format21.2.4. Netscape Extended 2 Log Format21.2.5. Squid Proxy Log Format
21.3. Hit Metering
21.3.1. Overview21.3.2. The Meter Header
21.4. A Word on Privacy
21.5. For More Information
VI. Appendixes
A. URI Schemes
B. HTTP Status Codes
B.1. Status Code Classifications
B.2. Status Codes
C. HTTP Header Reference
Accept
Accept-Charset
Accept-Encoding
Accept-Language
Accept-Ranges
Age
Allow
Authorization
Cache-Control
Client-ip
Connection
Content-Base
Content-Encoding
Content-Language
Content-Length
Content-Location
Content-MD5
Content-Range
Content-Type
Cookie
Cookie2
Date
ETag
Expect
Expires
From
Host
If-Modified-Since
If-Match
If-None-Match
If-Range
If-Unmodified-Since
Last-Modified
Location
Max-Forwards
MIME-Version
Pragma
Proxy-Authenticate
Proxy-Authorization
Proxy-Connection
Public
Range
Referer
Retry-After
Server
Set-Cookie
Set-Cookie2
TE
Trailer
Title
Transfer-Encoding
UA-(CPU, Disp, OS, Color, Pixels)
Upgrade
User-Agent
Vary
Via
Warning
WWW-Authenticate
X-Cache
X-Forwarded-For
X-Pad
X-Serial-Number
D. MIME Types
D.1. Background
D.2. MIME Type Structure
D.2.1. Discrete TypesD.2.2. Composite TypesD.2.3. Multipart TypesD.2.4. Syntax
D.3. MIME Type IANA Registration
D.3.1. Registration TreesD.3.2. Registration ProcessD.3.3. Registration RulesD.3.4. Registration TemplateD.3.5. MIME Media Type Registry
D.4. MIME Type Tables
D.4.1. application/*D.4.2. audio/*D.4.3. chemical/*D.4.4. image/*D.4.5. message/*D.4.6. model/*D.4.7. multipart/*D.4.8. text/*D.4.9. video/*D.4.10. Experimental Types
E. Base-64 Encoding
E.1. Base-64 Encoding Makes Binary Data Safe
E.2. Eight Bits to Six Bits
E.3. Base-64 Padding
E.4. Perl Implementation
E.5. For More Information
F. Digest Authentication
F.1. Digest WWW-Authenticate Directives
F.2. Digest Authorization Directives
F.3. Digest Authentication-Info Directives
F.4. Reference Code
F.4.1. File “digcalc.h”F.4.2. File “digcalc.c”F.4.3. File “digtest.c”
G. Language Tags
G.1. First Subtag Rules
G.2. Second Subtag Rules
G.3. IANA-Registered Language Tags
G.4. ISO 639 Language Codes
G.5. ISO 3166 Country Codes
G.6. Language Administrative Organizations
H. MIME Charset Registry
H.1. MIME Charset Registry
H.2. Preferred MIME Names
H.3. Registered Charsets
Index
About the Authors
Colophon
Copyright

Content preview from HTTP: The Definitive Guide

Chapter 9. Web Robots

We continue our tour of HTTP architecture with a close look at the self-animating user agents called web robots.

Web robots are software programs that automate a series of web transactions without human interaction. Many robots wander from web site to web site, fetching content, following hyperlinks, and processing the data they find. These kinds of robots are given colorful names such as “crawlers,” “spiders,” “worms,” and “bots” because of the way they automatically explore web sites, seemingly with minds of their own.

Here are a few examples of web robots:

Stock-graphing robots issue HTTP GETs to stock market servers every few minutes and use the data to build stock price trend graphs.
Web-census robots gather “census” information about the scale and evolution of the World Wide Web. They wander the Web counting the number of pages and recording the size, language, and media type of each page.^[1]
Search-engine robots collect all the documents they find to create search databases.
Comparison-shopping robots gather web pages from online store catalogs to build databases of products and their prices.

^[1]http://www.netcraft.com collects great census metrics on what flavors of servers are being used by sites around the Web.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 1565925092Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

HTTP: The Definitive Guide

by David Gourley, Brian Totty, Marjorie Sayer, Anshu Aggarwal, Sailu Reddy

Chapter 9. Web Robots

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Ultimate Go Programming, Second Edition

Clean Code Fundamentals

Kubernetes: Up and Running, 3rd Edition

System Design on AWS

Publisher Resources