Skip to Content
Simplify Big Data Analytics with Amazon EMR
book

Simplify Big Data Analytics with Amazon EMR

by Sakti Mishra
March 2022
Beginner to intermediate
430 pages
9h 24m
English
Packt Publishing
Content preview from Simplify Big Data Analytics with Amazon EMR

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

In the previous two chapters, we learned how to implement a batch ETL pipeline with Amazon EMR and real-time streaming with Spark Streaming. In this chapter, we will learn how to implement UPSERT or merge on your Amazon S3 data lake using the Apache Hudi framework integrated with Apache Spark.

Amazon S3 is immutable by default, which means you cannot update the content of an object or file in S3. Instead, you have to read its content, then modify it and write a new object. Currently, as data lake and lake house architectures are becoming popular, organizations look for update capability on Amazon S3 or other object stores. Frameworks such as Apache Hudi, Apache ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

AWS Certified Data Analytics Specialty (2023) Hands-on

AWS Certified Data Analytics Specialty (2023) Hands-on

Frank Kane, Stéphane Maarek
Advanced Analytics with PySpark

Advanced Analytics with PySpark

Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
Serverless ETL and Analytics with AWS Glue

Serverless ETL and Analytics with AWS Glue

Vishal Pathak, Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Albert Quiroga, Ishan Gaur

Publisher Resources

ISBN: 9781801071079Supplemental Content