In this chapter, we’ll learn about processing web crawl data on a big data scale using distributed computing architecture using Amazon Web Services (AWS).
There are distinct advantages to processing the data where it is stored, so that we do not waste our server time on downloading the data which is rate limiting based on your Internet speed.
We will also learn about Amazon Athena which can be used to query data located in S3 using the SQL language without setting up a server.
The overall goal of this chapter is to get you to a stage where you ...