Skip to Content
Apache Hudi: The Definitive Guide
book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro
October 2025
Intermediate to advanced
290 pages
7h 43m
English
O'Reilly Media, Inc.
Book available
Content preview from Apache Hudi: The Definitive Guide

Foreword

When we began building Apache Hudi in 2016, our goal was clear but ambitious: bring transactional database capabilities to the data lake. At the time, this idea sounded counterintuitive—even controversial. Data lakes were, by design, append-only file stores optimized for high throughput and scale, not fine-grained updates or consistent reads. At Uber, where Hudi was first conceived, our data volumes doubled every few months, and the traditional data warehouse could no longer keep up. Streaming systems were too expensive and lacked the capabilities we needed.

We needed a new kind of data platform—one that could scale like a data lake, provide transactional capabilities like a data warehouse, and deliver data incrementally like streaming systems.

That idea became Apache Hudi, and the first data lakehouse was born, even before the term was coined.

Hudi introduced several foundational concepts that have since become synonymous with the modern lakehouse architecture: incremental change capture, write-optimized storage formats like Merge-on-Read, record-level upserts, and background table services for compaction, clustering, and cleaning. These ideas were novel at the time but have since become core pillars across the ecosystem. Systems like Delta Lake and Apache Iceberg, which followed Hudi, adopted many of these principles and extended the conversation around openness and interoperability.

At the time, these ideas were radical. Today, they’re foundational.

In many ways, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

gRPC: Up and Running

gRPC: Up and Running

Kasun Indrasiri, Danesh Kuruppu
Stream Processing with Apache Flink

Stream Processing with Apache Flink

Fabian Hueske, Vasiliki Kalavri
Apache Iceberg: The Definitive Guide

Apache Iceberg: The Definitive Guide

Tomer Shiran, Jason Hughes, Alex Merced
Command-Line Rust

Command-Line Rust

Ken Youens-Clark

Publisher Resources

ISBN: 9781098173821Errata Page