Robot Etiquette
In 1993, Martijn Koster, a pioneer in the web robot community, wrote up a list of guidelines for authors of web robots. While some of the advice is dated, much of it still is quite useful. Martijn’s original treatise, “Guidelines for Robot Writers,” can be found at http://www.robotstxt.org/wc/guidelines.html.
Table 9-6 provides a modern update for robot designers and operators, based heavily on the spirit and content of the original list. Most of these guidelines are targeted at World Wide Web robots; however, they are applicable to smaller-scale crawlers too.
Table 9-6. Guidelines for web robot operators
Guideline |
Description |
---|---|
(1) Identification | |
Identify Your Robot |
Use the HTTP User-Agent field to tell web servers the name of your robot. This will help administrators understand what your robot is doing. Some robots also include a URL describing the purpose and policies of the robot in the User-Agent header. |
Identify Your Machine |
Make sure your robot runs from a machine with a DNS entry, so web sites can reverse-DNS the robot IP address into a hostname. This will help the administrator identify the organization responsible for the robot. |
Identify a Contact |
Use the HTTP From field to provide a contact email address. |
(2) Operations | |
Be Alert |
Your robot will generate questions and complaints. Some of this is caused by robots that run astray. You must be cautious and watchful that your robot is behaving correctly. If your robot runs around the clock, ... |
Get HTTP: The Definitive Guide now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.