Exclude Robots and Spiders from Your Analysis

One of the major complaints about web server logfiles is that they are often littered with activity from nonhuman user agents (“robots” and “spiders”). While they are not necessarily bad, you need to exclude robots and spiders from your “human” analysis or risk getting dramatically skewed results.

Robots and spiders (also known as "crawlers” or “agents”) are computer programs that scour the Web to collect information or take measurements. There are thousands of robots and spiders in use on the Web at any time, and their numbers increase every day. Common examples include:

  • Search engine robots that crawl over the pages in sites on the Web and feed the information they collect to the indexes of search engines like Google, Yahoo!, or industry-specific engines that search for information such as airfares, flight schedules, or product prices.

  • Competitive intelligence robots that spider a site to collect competitive analysis data. For instance, your competitor may construct robots to regularly gather information from your online product catalog to understand how they should price, or to make product and price comparisons in their marketing.

  • Account aggregator robots that regularly collect data from online accounts (usually with the permission of the account owner) and feed that data to web-based “account consolidators.” Users of such account management sites benefit from having current information from their financial accounts, loyal program ...

Get Web Site Measurement Hacks now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.