5
Data Preparation in the Cloud
In this chapter, we will learn how data preparation can be set up in the cloud by leveraging various AWS cloud services. Considering the importance of extract, transform, and load (ETL) operations within data preparation, we will take a deeper look into setting up and scheduling ETL jobs in a cost-efficient manner. We will cover four different setups: ETL running on a single-node EC2 instance and an EMR cluster, and then utilizing Glue and SageMaker for ETL jobs. This chapter will also introduce Apache Spark, the most popular framework for ETL. By completing this chapter, you will be able to leverage the different advantages of the presented setups and select the right set of tools for your project.
In this chapter, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access