Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

In the previous two chapters, we learned how to implement a batch ETL pipeline with Amazon EMR and real-time streaming with Spark Streaming. In this chapter, we will learn how to implement UPSERT or merge on your Amazon S3 data lake using the Apache Hudi framework integrated with Apache Spark.

Amazon S3 is immutable by default, which means you cannot update the content of an object or file in S3. Instead, you have to read its content, then modify it and write a new object. Currently, as data lake and lake house architectures are becoming popular, organizations look for update capability on Amazon S3 or other object stores. Frameworks such as Apache Hudi, Apache ...

Get Simplify Big Data Analytics with Amazon EMR now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.