Background research
Before diving into crawling a website, we should develop an understanding about the scale and structure of our target website. The website itself can help us through their robots.txt
and Sitemap
files, and there are also external tools available to provide further details such as Google Search and WHOIS
.
Checking robots.txt
Most websites define a robots.txt
file to let crawlers know of any restrictions about crawling their website. These restrictions are just a suggestion but good web citizens will follow them. The robots.txt
file is a valuable resource to check before crawling to minimize the chance of being blocked, and also to discover hints about a website's structure. More information about the robots.txt
protocol is available ...
Get Web Scraping with Python now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.