In this chapter, we will discuss a crawling framework called Scrapy and go through the steps necessary to crawl and upload the web crawl data to an S3 bucket.
We will also talk about some of the practical workarounds for common antibot measures such as proxy IP and user-agent rotation, CAPTCHA solving services, and so on.
Scrapy
Scrapy is a very popular production-ready web crawling framework in Python; it contains all the features of a good web crawler such as robots.txt parser, crawl delay, and Selenium support that we talked about in Chapter 2 right out of ...