Crawling with delays
Fast scraping is considered a bad practice. Continuously pounding a website for pages can burn up CPU and bandwidth, and a robust site will identify you doing this and block your IP. And if you are unlucky, you might get a nasty letter for violating terms of service!
The technique of delaying requests in your crawler depends upon how your crawler is implemented. If you are using Scrapy, then you can set a parameter that informs the crawler how long to wait between requests. In a simple crawler just sequentially processing URLs in a list, you can insert a thread.sleep statement.
Things can get more complicated if you have implemented a distributed cluster of crawlers that spread the load of page requests, such as using ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access