Apache Hudi: The Definitive Guide
by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro
Chapter 2. Getting Started with Hudi
In Chapter 1, we explored the foundational concepts that make Apache Hudi a compelling choice for modern data architectures. We explored how data lakes have evolved into lakehouses, discussed Hudi’s position in this ecosystem, and reviewed its high-level architecture, the Hudi stack, and key feature highlights. While these concepts provide essential context, the best way to truly understand Hudi’s capabilities is through hands-on experience.
This chapter shifts from theory to practice. Rather than simply listing features, we’ll demonstrate how Hudi tables behave under different configurations and operations, allowing you to observe firsthand how the underlying table layout evolves as you perform common lakehouse operations.
We’ll start with a simple purchase tracking table and use Apache Spark to perform typical Create, Read, Update, and Delete (CRUD) operations. As we execute these commands, we’ll examine the resulting changes to the table’s physical structure, helping you develop an intuitive understanding of how Hudi organizes and manages your data behind the scenes.
The chapter is organized into three progressive sections that build upon each other. “Basic Operations” creates a Hudi table using the default Copy-on-Write (COW) table type and explores fundamental CRUD operations. As we execute SQL examples, we’ll examine how each operation affects the table layout and learn core concepts like record keys, partitioning, and the timeline internals. ...