Chapter 5. Data Harvesting

This category captures a range of traffic types that will access the publicly available information contained within your website and capture that data for use elsewhere. Typically this will involve accessing many pages and extracting relevant data using text pattern matching.

Let’s now look at some specific examples of data harvesting bots.

Search Engine Spiders

The most common form of data harvesting and the one without which the internet as we know it today wouldn’t function is the search engine spider. The most common of which is, of course, GoogleBot, but there are many others from a range of global as well as regional or specialist search engines. These bots will usually enter your site via the homepage or via a deep link from another site and then follow all active ...

Get Managing and Mitigating Bots now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.