© Jay M. Patel 2020
J. M. PatelGetting Structured Data from the Internethttps://doi.org/10.1007/978-1-4842-6576-5_7

7. Web Crawl Processing on Big Data Scale

Jay M. Patel1 
(1)
Specrom Analytics, Ahmedabad, India
 

In this chapter, we’ll learn about processing web crawl data on a big data scale using distributed computing architecture using Amazon Web Services (AWS).

There are distinct advantages to processing the data where it is stored, so that we do not waste our server time on downloading the data which is rate limiting based on your Internet speed.

We will also learn about Amazon Athena which can be used to query data located in S3 using the SQL language without setting up a server.

The overall goal of this chapter is to get you to a stage where you ...

Get Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.