Apache Hudi: The Definitive Guide
by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro
Foreword
When we began building Apache Hudi in 2016, our goal was clear but ambitious: bring transactional database capabilities to the data lake. At the time, this idea sounded counterintuitive—even controversial. Data lakes were, by design, append-only file stores optimized for high throughput and scale, not fine-grained updates or consistent reads. At Uber, where Hudi was first conceived, our data volumes doubled every few months, and the traditional data warehouse could no longer keep up. Streaming systems were too expensive and lacked the capabilities we needed.
We needed a new kind of data platform—one that could scale like a data lake, provide transactional capabilities like a data warehouse, and deliver data incrementally like streaming systems.
That idea became Apache Hudi, and the first data lakehouse was born, even before the term was coined.
Hudi introduced several foundational concepts that have since become synonymous with the modern lakehouse architecture: incremental change capture, write-optimized storage formats like Merge-on-Read, record-level upserts, and background table services for compaction, clustering, and cleaning. These ideas were novel at the time but have since become core pillars across the ecosystem. Systems like Delta Lake and Apache Iceberg, which followed Hudi, adopted many of these principles and extended the conversation around openness and interoperability.
At the time, these ideas were radical. Today, they’re foundational.
In many ways, ...