HTTP: The Definitive Guide
by David Gourley, Brian Totty, Marjorie Sayer, Anshu Aggarwal, Sailu Reddy
Misbehaving Robots
There are many ways that wayward robots can cause mayhem. Here are a few mistakes robots can make, and the impact of their misdeeds:
- Runaway robots
Robots issue HTTP requests much faster than human web surfers, and they commonly run on fast computers with fast network links. If a robot contains a programming logic error, or gets caught in a cycle, it can throw intense load against a web server—quite possibly enough to overload the server and deny service to anyone else. All robot authors must take extreme care to design in safeguards to protect against runaway robots.
- Stale URLs
Some robots visit lists of URLs. These lists can be old. If a web site makes a big change in its content, robots may request large numbers of nonexistent URLs. This annoys some web site administrators, who don’t like their error logs filling with access requests for nonexistent documents and don’t like having their web server capacity reduced by the overhead of serving error pages.
- Long, wrong URLs
As a result of cycles and programming errors, robots may request large, nonsense URLs from web sites. If the URL is long enough, it may reduce the performance of the web server, clutter the web server access logs, and even cause fragile web servers to crash.
- Nosy robots
Some robots may get URLs that point to private data and make that data easily accessible through Internet search engines and other applications. If the owner of the data didn’t actively advertise the web pages, she may view the robotic ...