As we've mentioned earlier in this book, automated clients, or robots, might be considered an invasion of resources by many servers. A robot is defined as a web client that may retrieve documents in an automated, rapid-fire succession. Examples of robots are indexers for search engines, content mirroring programs, and link traversal programs. While many server administrators welcome robots--how else will they be listed by search engines and attract potential customers?--others would prefer that they stay out.
The Robot Exclusion Standard was devised in 1994 to give administrators an opportunity to make their preferences known. It describes how a web server administrator can designate certain areas of a website as “off limits” for certain (or all) web robots. The creator of the document, Martijn Koster, maintains this document at http://info.webcrawler.com/mak/projects/robots/norobots.html and also provides an informational RFC at http://info.webcrawler.com/mak/projects/robots/norobots-rfc.txt. The informational RFC adds some additional features to those in the original 1994 document.
The success of the Robot Exclusion Standard depends on web application programmers being good citizens and heeding it carefully. While it can't serve as a locked door, it can serve as a clear “Do Not Disturb” sign. You ignore it at the peril of (at best) being called a cad, and (at worst) being explicitly locked out if you persist, and having angry complaints sent to your boss or system administrator or both. This appendix gives you the basic idea behind the Robot Exclusion Standard, but you should also check the RFC itself.
In a nutshell, the Robot Exclusion Standard declares that a web server administrator should create a document accessible at the relative URL /robots.txt. For example, a remote client would access a robots.txt file at the server hypothetical.ora.com using the following URL:
If the web server returns a status of 200 (OK) for the URL, the client should parse and interpret the resulting entity-body (described below). In other cases, status codes in the range of 300-399 indicate URL redirections, which should be followed by the client. Status codes of 401 (Unauthorized) or 403 (Forbidden) indicate access restrictions and the client should avoid the entire site. A 404 (Not Found) indicates that the administrator did not specify any Robot Exclusion Standard and the entire site is okay to visit.
Here's the good news if you use LWP for your programs: LWP::RobotUA takes care of all this for you. While it's still good to know about the standard, you can rest easy--yet another perk of using LWP. See Chapter 5 for an example using LWP::RobotUA.
When clients receive the robots.txt file, they need to parse it to determine whether they are allowed access to the site. There are three basic directives that can be in the robots.txt file: User-agent, Allow, and Disallow.
The User-agent directive specifies that subsequent Allow and Disallow statements apply to it. The robot should use a case-insensitive comparison of this value with its own user agent name. Version numbers are not used in the comparison.
If the robots.txt file specifies a * as a User-Agent, it indicates all robots, not any particular robot. So if an administrator wants to shut out all robots from an entire site, the robots.txt file only needs the following two lines:
User-agent: * Disallow: /
The Allow and Disallow directives indicate areas of the site that the previously-listed User-agent is allowed or denied access. Instead of listing all the URLs that the User-Agent is allowed and disallowed, the directive specifies the general prefix that describes what is allowed or disallowed. For example:
would match both /index.html and /index/summary.html, while:
would match only URLs in /index/. In the extreme case,
specifies the entire web site.
Multiple User-agents can be specified within a robots.txt file. For example,
User-agent: friendly-indexer User-agent: search-thingy Disallow: /cgi-bin/ Allow: /
specifies that the allow and disallow statements apply to both the friendly-indexer and search-thingy robots.
The robots.txt file moves from general to specific; that is, subsequent listings can override previous ones. For example:
User-agent: * Disallow: / User-agent: search-thingy Allow: /
would specify that all robots should go away, except the search-thingy robot.