book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro

October 2025

Intermediate to advanced

290 pages

7h 43m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Why This Book, and Why NowWho This Book Is ForThe Technology and Its MomentWhat’s in This BookHow to Use This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
The Evolution of Data Management ArchitecturesThe Rise of Data LakehousesUber’s “Transactional Data Lake” ProblemWhat Is Hudi?The Hudi StackNative Table FormatPluggable Table FormatStorage EngineProgramming APIUser AccessShared Platform ComponentsHudi in the Real WorldSummary
Basic OperationsCreate the TableInsert, Update, Delete, and Fetch RecordsChoose a Table TypeCreate a Merge-on-Read TableMOR Table’s Layout After WritesCopy-on-Write Versus Merge-on-ReadAdvanced UsageCreate Table As SelectMerge Source Data into the TableUpdate and Delete Using Nonrecord Key FieldsTime Travel QueryIncremental QuerySummary
Breaking Down the Write FlowStart CommitPrepare RecordsPartition RecordsWrite to StorageCommit ChangesSummarize the Upsert FlowExploring Write OperationsDefine Table PropertiesUse INSERT INTOPerform Partial Merge with MERGE INTOPerform DeletionOverwrite Partition or TableHighlighting Noteworthy FeaturesKey GeneratorsMerge ModesSchema Evolution on WriteBootstrappingSummary
Integrating with Query EnginesQuery LifecycleData CatalogHudi IntegrationExploring Query TypesSnapshot QueryTime Travel QueryIncremental Query: The Latest-State ModeIncremental Query: The Change Data Capture ModeHighlighting Noteworthy FeaturesStreaming ReadSchema Evolution on ReadRead Using Rust or PythonSummary
Overview of the Indexes in HudiIndex Acceleration for WritesGeneral-Purpose Multimodal IndexingWriter-Side IndexesComparison of Writer Indexing ChoicesIndex Acceleration for ReadsData SkippingEquality MatchingIndexing on ExpressionsBuild the Right IndexesSummary
Table Service OverviewDeployment Mode: InlineDeployment Mode: Async ExecutionDeployment Mode: StandaloneChoosing a Suitable ModeCompactionSchedule CompactionExecute CompactionClusteringSchedule ClusteringExecute ClusteringLayout Optimization StrategiesClustering Versus CompactionCleaningSchedule CleaningExecute CleaningIndexingSummary
Why Concurrency Control Is Harder in Data LakehousesConcurrency Control TechniquesMultiwriter ScenariosWhy Multiwriters Are NecessaryMultiwriter Scenarios for OCCMultiwriter Scenarios for NBCC and MVCCThe Simple Default: Single Writer with Table ServicesHow Hudi Handles Concurrency ControlThe Foundations of Hudi’s Concurrency ControlThe Three-Step Commit ProcessConflict Detection and ResolutionLocking MechanismsChallenges in Multiwriter SystemsUsing Multiwriter Support in HudiEnabling Multiwriter SupportConfiguring the Locking MechanismMultiwriters Using Hudi StreamerMultiwriters Using Spark Data Source WriterSingle Writer and Multiple Table ServicesDisabling Multiwriter SupportTips and Best PracticesImplement Partitioning and File GroupingEnable Early Conflict DetectionOptimize Locking MechanismsRun Asynchronous Table ServicesReduce Write Conflicts and Wasted ResourcesPrevent Data Duplication When Using Multiple WritersSummary
Alcubierre’s Data Silo WoesData Quality Assurance and DeduplicationHeterogeneous Data and Schema EvolutionData Management, Localization, and ConsistencyProblem RecapLakehouse Architecture to the RescueWhat Is Hudi Streamer?Getting Started with Hudi StreamerIngesting Data from S3Ingesting Data from KafkaIngesting Data from RDBMSHudi Streamer in ActionPreparing the Upstream SourceSetting Up Hudi StreamerUnlocking the Power of AnalyticsExploring the Hudi Streamer OptionsGeneral OptionsSource OptionsOperational OptionsSummary

Operating with EaseGetting to Know the CLIPerforming Table OperationsIntegrating into the PlatformTriggering Post-Commit CallbacksWiring Up Monitoring SystemsSyncing with CatalogsPerformance TuningStorage Layout TuningWrite Performance TuningRead Performance TuningTable Services TuningSummary
Architecture OverviewRetailMax Corp: A Real-World Lakehouse ScenarioImplementing Medallion Architecture with HudiConfiguring RetailMax’s Hudi TablesRecord KeysOrdering FieldPartitioningTable TypesBronze Layer: Ingesting Upstream DataSetting Up Upstream Data SourcesStreaming Mutable, Transactional Data with Debezium, Flink, and HudiIngesting Application Event Streams with Hudi Kafka Connect SinkSilver Layer: Creating Derived DatasetsGoals of the Silver Layer for RetailMaxStreaming-Based Transformations with Hudi StreamerBatch and Incremental Transformations with Spark SQLMaintaining Data Quality and Consistency in the Silver LayerGold Layer: Querying the Lakehouse for InsightsInteractive Analytics with TrinoBatch Analytics and Reporting with Spark SQLAdvanced Querying: Time Travel and Point-in-Time AnalysisBusiness Layer: AI-Driven Insights for RetailMaxPreparing Data for AI/Machine Learning in the Gold LayerBuilding a Knowledge Base for LLM-Powered Applications with Ray and HudiOperationalizing and Optimizing the Hudi LakehouseConcurrency Control and Multiwriter ScenariosMonitoring the LakehouseData ResiliencePerformance Benchmarks and ConsiderationsSummary

Content preview from Apache Hudi: The Definitive Guide

Foreword

When we began building Apache Hudi in 2016, our goal was clear but ambitious: bring transactional database capabilities to the data lake. At the time, this idea sounded counterintuitive—even controversial. Data lakes were, by design, append-only file stores optimized for high throughput and scale, not fine-grained updates or consistent reads. At Uber, where Hudi was first conceived, our data volumes doubled every few months, and the traditional data warehouse could no longer keep up. Streaming systems were too expensive and lacked the capabilities we needed.

We needed a new kind of data platform—one that could scale like a data lake, provide transactional capabilities like a data warehouse, and deliver data incrementally like streaming systems.

That idea became Apache Hudi, and the first data lakehouse was born, even before the term was coined.

Hudi introduced several foundational concepts that have since become synonymous with the modern lakehouse architecture: incremental change capture, write-optimized storage formats like Merge-on-Read, record-level upserts, and background table services for compaction, clustering, and cleaning. These ideas were novel at the time but have since become core pillars across the ecosystem. Systems like Delta Lake and Apache Iceberg, which followed Hudi, adopted many of these principles and extended the conversation around openness and interoperability.

At the time, these ideas were radical. Today, they’re foundational.

In many ways, ...