J. M. PatelGetting Structured Data from the Internethttps://doi.org/10.1007/978-1-4842-6576-5_7

7. Web Crawl Processing on Big Data Scale

Jay M. Patel¹

(1)

Specrom Analytics, Ahmedabad, India

In this chapter, we’ll learn about processing web crawl data on a big data scale using distributed computing architecture using Amazon Web Services (AWS).

There are distinct advantages to processing the data where it is stored, so that we do not waste our server time on downloading the data which is rate limiting based on your Internet speed.

We will also learn about Amazon Athena which can be used to query data located in S3 using the SQL language without setting up a server.

The overall goal of this chapter is to get you to a stage where you ...

Get Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale by Jay M. Patel

7. Web Crawl Processing on Big Data Scale

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly