Excluding the Bot

There are a number of reasons you might want to block bots from all, or part, of your site. For example, if your site is not complete, if you have broken links, or if you haven’t prepared your site for a search engine visit, you probably don’t want to be indexed yet. You may also want to protect parts of your site from being indexed if those parts contain sensitive information or pages that you know cannot be accurately traversed or parsed.

Note

Google requests that you block URLs that will give the bot hiccups—for example, dynamic URLs that include calendar information that have the potential for infinite expansion. You can block individual URLs using a nofollow attribute value in the anchor tag of the URL itself. For example:

<a rel="nofollow" href="botcantgohere" />No follow me</a>
Lynx Viewer makes it easy to focus on text and links without the distraction of the image-rich version

Figure 4-5. Lynx Viewer makes it easy to focus on text and links without the distraction of the image-rich version

If you need to, you can make sure that part of your site does not get indexed by any search engine.

Note

Following the no-robots protocol is voluntary and based on the honor system. So all you can really be sure of is that a legitimate search engine that follows the protocol will not index the prohibited parts of your site from the root of your site (if there are external links to excluded pages, these may still be traversed regardless of your policy file). Don’t rely on search engine exclusion for security. Information that needs to be protected should be in password-protected locations, and protected by software hardened for security purposes.

Compared with the identical page in a text-only view (), it’s hard to focus on just the text and links

Figure 4-6. Compared with the identical page in a text-only view (Figure 4-5), it’s hard to focus on just the text and links

The robots.txt File

To block bots from traversing your site, place a text file named robots.txt in your site’s web root directory (where the HTML files for your site are placed). The following syntax in the robots.txt file blocks all compliant bots from traversing your entire site:

User-agent: *
Disallow: /

You can exercise more granular control over which bots you ban and which parts of your site are off-limits as follows:

  • The User-agent line specifies the bot that is to be banished.

  • The Disallow line specifies a path relative to your root directory that is banned territory.

Note

A single robots.txt file can include multiple User-agent bot bannings, each disallowing different paths.

For example, you would tell the Google search bot not to look in your cgi-bin directory (assuming the cgi-bin directory is right beneath your web root directory) by placing the following two lines in your robots.txt file:

User-agent: googlebot
Disallow: /images

Warning

As I’ve mentioned, the robots.txt mechanism relies on the honor system. By definition, it is a text file that can be read by anyone with a browser. Don’t rely on every bot honoring the request within a robots.txt file, and don’t use robots.txt in an attempt to protect sensitive information from being uncovered on your site by humans (this is a different issue from using it to avoid publishing sensitive information in honest search engine indexes like Google). In fact, someone trying to hack your site might specifically read your robots.txt file in an attempt to uncover site areas that you deem sensitive.

For more information about working with the robots.txt file, see the Web Robots FAQ. You can also find tools for managing and generating custom robots.txt files and robot meta tags (explained later) at http://www.rietta.com/robogen/ (an evaluation version is available for free download).

Meta Robot Tags

The Googlebot and many other web robots can be instructed not to index specific pages (rather than entire directories), not to follow links on a specific page, and to index but not cache a specific page, all via the HTML meta tag placed inside of the head tag.

Note

Google maintains a cache of documents it has indexed. The Google search results provide a link to the cached version in addition to the version on the Web. The cached version can be useful when the Web version has changed and also because the cached version highlights the search terms (so you can easily find them).

The meta tag used to block a robot has two attributes: name and content. The name attribute is the name of the bot you are excluding. To exclude all robots, you’d include the attribute name="robots" in the meta tag.

To exclude a specific robot, the robot’s identifier is used. The Googlebot’s identifier is googlebot, and it is excluded by using the attribute name="googlebot". You can find the entire database of registered and excludable robots and their identifiers (currently about 300) at http://www.robotstxt.org/db.html.

Note

The more than 300 robots in the official database are the tip of the iceberg. There are at least 200,000 robots and crawlers “in the wild.” Some of these software programs have malicious intent; all of them eat up valuable web bandwidth. For more information about wild (and rogue) robots, visit Bots vs. Browsers.

The possible values of the content attribute are shown in Table 4-1. You can use multiple attribute values, separated by commas, but you should not use contradictory attribute values together (such as content="follow, nofollow").

Table 4-1. Content attribute values and their meanings

Attribute value

Meaning

follow

Bot can follow links on the page

index

Bot can index the page

noarchive

Only works with the Googlebot; tells the Googlebot not to cache the page

nofollow

Bot should not follow links on the page

noindex

Bot should not index the page

For example, you can block Google from indexing a page, following links on a page, and caching the page using this meta tag:

<meta name="googlebot" content="noindex, nofollow, noarchive">

More generally, the following tag tells legitimate bots (including the Googlebot) not to index a page or follow any of the links on the page:

<meta name="robots" content="noindex, nofollow">

For more information about Google’s page-specific tags that exclude bots, and about the Googlebot in general, see http://www.google.com/bot.html.

Get Google Advertising Tools, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.