J. M. PatelGetting Structured Data from the Internethttps://doi.org/10.1007/978-1-4842-6576-5_8

8. Advanced Web Crawlers

Jay M. Patel¹

(1)

Specrom Analytics, Ahmedabad, India

In this chapter, we will discuss a crawling framework called Scrapy and go through the steps necessary to crawl and upload the web crawl data to an S3 bucket.

We will also talk about some of the practical workarounds for common antibot measures such as proxy IP and user-agent rotation, CAPTCHA solving services, and so on.

Scrapy

Scrapy is a very popular production-ready web crawling framework in Python; it contains all the features of a good web crawler such as robots.txt parser, crawl delay, and Selenium support that we talked about in Chapter 2 right out of ...

Get Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale by Jay M. Patel

8. Advanced Web Crawlers

Scrapy

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly