
This is the Title of the Book, eMatter Edition
Copyright © 2007 O’Reilly & Associates, Inc. All rights reserved.
Web Content
|
327
Robots and Spiders
Some hits to your web site will come from programs called robots. Some of these
gather data for search engines and are also called spiders. A well-behaved robot is
supposed to read and obey the robots.txt file in your site’s home directory. This file
tells it which files and directories may be searched. You should have a robots.txt file
in the top directory of each web site. Exclude all directories with CGI scripts (any-
thing marked as ScriptAlias, such as /cgi-bin), images, access-controlled content, or
any other content that should not be exposed to the world. Here’s a simple example:
User-agent: *
Disallow: /image_dir
Disallow: /cgi-bin
Many robots are spiders, used by web search engines to help catalogue the Web’s
vast expanses. Good ones obey the robots.txt rules and have other indexing heuris-
tics. They try to examine only static content and ignore things that look like CGI
scripts (such as URLs containing ? or /cgi-bin). Web scripts can use the
PATH_INFO
environment variable and Apache rewriting rules to make CGI scripts search-engine
friendly.
The robot exclusion standard is documented at http://www.robotstxt.org/wc/noro-
bots.html and http://www.robotstxt.org/wc/robots.html.
Rude robots can be excluded with environment variables and access control: ...