Being good netizens

While our end-goal is to be able to crawl and index the entire internet, the truth of the matter is that the links that we are retrieving and indexing point to content that belongs to someone else. It can so happen that those third parties object to us indexing some or all links to the domains under their control.

Fortunately, there is a standardized way for web-masters to notify crawlers not only about which links they can crawl and which we are not allowed to but also to dictate an acceptable crawl speed to not incur a high load on the remote host. This is all achieved by authoring a robots.txt file and placing it at the root of each domain. The file contains a set of directives like the following:

  • User-Agent: The name ...

Get Hands-On Software Engineering with Golang now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.