book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro

October 2025

Intermediate to advanced

290 pages

7h 43m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Why This Book, and Why NowWho This Book Is ForThe Technology and Its MomentWhat’s in This BookHow to Use This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
The Evolution of Data Management ArchitecturesThe Rise of Data LakehousesUber’s “Transactional Data Lake” ProblemWhat Is Hudi?The Hudi StackNative Table FormatPluggable Table FormatStorage EngineProgramming APIUser AccessShared Platform ComponentsHudi in the Real WorldSummary
Basic OperationsCreate the TableInsert, Update, Delete, and Fetch RecordsChoose a Table TypeCreate a Merge-on-Read TableMOR Table’s Layout After WritesCopy-on-Write Versus Merge-on-ReadAdvanced UsageCreate Table As SelectMerge Source Data into the TableUpdate and Delete Using Nonrecord Key FieldsTime Travel QueryIncremental QuerySummary
Breaking Down the Write FlowStart CommitPrepare RecordsPartition RecordsWrite to StorageCommit ChangesSummarize the Upsert FlowExploring Write OperationsDefine Table PropertiesUse INSERT INTOPerform Partial Merge with MERGE INTOPerform DeletionOverwrite Partition or TableHighlighting Noteworthy FeaturesKey GeneratorsMerge ModesSchema Evolution on WriteBootstrappingSummary
Integrating with Query EnginesQuery LifecycleData CatalogHudi IntegrationExploring Query TypesSnapshot QueryTime Travel QueryIncremental Query: The Latest-State ModeIncremental Query: The Change Data Capture ModeHighlighting Noteworthy FeaturesStreaming ReadSchema Evolution on ReadRead Using Rust or PythonSummary
Overview of the Indexes in HudiIndex Acceleration for WritesGeneral-Purpose Multimodal IndexingWriter-Side IndexesComparison of Writer Indexing ChoicesIndex Acceleration for ReadsData SkippingEquality MatchingIndexing on ExpressionsBuild the Right IndexesSummary
Table Service OverviewDeployment Mode: InlineDeployment Mode: Async ExecutionDeployment Mode: StandaloneChoosing a Suitable ModeCompactionSchedule CompactionExecute CompactionClusteringSchedule ClusteringExecute ClusteringLayout Optimization StrategiesClustering Versus CompactionCleaningSchedule CleaningExecute CleaningIndexingSummary
Why Concurrency Control Is Harder in Data LakehousesConcurrency Control TechniquesMultiwriter ScenariosWhy Multiwriters Are NecessaryMultiwriter Scenarios for OCCMultiwriter Scenarios for NBCC and MVCCThe Simple Default: Single Writer with Table ServicesHow Hudi Handles Concurrency ControlThe Foundations of Hudi’s Concurrency ControlThe Three-Step Commit ProcessConflict Detection and ResolutionLocking MechanismsChallenges in Multiwriter SystemsUsing Multiwriter Support in HudiEnabling Multiwriter SupportConfiguring the Locking MechanismMultiwriters Using Hudi StreamerMultiwriters Using Spark Data Source WriterSingle Writer and Multiple Table ServicesDisabling Multiwriter SupportTips and Best PracticesImplement Partitioning and File GroupingEnable Early Conflict DetectionOptimize Locking MechanismsRun Asynchronous Table ServicesReduce Write Conflicts and Wasted ResourcesPrevent Data Duplication When Using Multiple WritersSummary
Alcubierre’s Data Silo WoesData Quality Assurance and DeduplicationHeterogeneous Data and Schema EvolutionData Management, Localization, and ConsistencyProblem RecapLakehouse Architecture to the RescueWhat Is Hudi Streamer?Getting Started with Hudi StreamerIngesting Data from S3Ingesting Data from KafkaIngesting Data from RDBMSHudi Streamer in ActionPreparing the Upstream SourceSetting Up Hudi StreamerUnlocking the Power of AnalyticsExploring the Hudi Streamer OptionsGeneral OptionsSource OptionsOperational OptionsSummary

Operating with EaseGetting to Know the CLIPerforming Table OperationsIntegrating into the PlatformTriggering Post-Commit CallbacksWiring Up Monitoring SystemsSyncing with CatalogsPerformance TuningStorage Layout TuningWrite Performance TuningRead Performance TuningTable Services TuningSummary
Architecture OverviewRetailMax Corp: A Real-World Lakehouse ScenarioImplementing Medallion Architecture with HudiConfiguring RetailMax’s Hudi TablesRecord KeysOrdering FieldPartitioningTable TypesBronze Layer: Ingesting Upstream DataSetting Up Upstream Data SourcesStreaming Mutable, Transactional Data with Debezium, Flink, and HudiIngesting Application Event Streams with Hudi Kafka Connect SinkSilver Layer: Creating Derived DatasetsGoals of the Silver Layer for RetailMaxStreaming-Based Transformations with Hudi StreamerBatch and Incremental Transformations with Spark SQLMaintaining Data Quality and Consistency in the Silver LayerGold Layer: Querying the Lakehouse for InsightsInteractive Analytics with TrinoBatch Analytics and Reporting with Spark SQLAdvanced Querying: Time Travel and Point-in-Time AnalysisBusiness Layer: AI-Driven Insights for RetailMaxPreparing Data for AI/Machine Learning in the Gold LayerBuilding a Knowledge Base for LLM-Powered Applications with Ray and HudiOperationalizing and Optimizing the Hudi LakehouseConcurrency Control and Multiwriter ScenariosMonitoring the LakehouseData ResiliencePerformance Benchmarks and ConsiderationsSummary

Content preview from Apache Hudi: The Definitive Guide

Preface

Why This Book, and Why Now

Modern data platforms are being asked to do more than ever before. They must serve fresh data to dashboards, power machine learning features in real time, and support operational applications alongside traditional analytics. At the same time, volumes of data are growing rapidly, pipelines are increasingly complex, and organizations cannot afford downtime or inconsistency. The gap between what businesses expect and what legacy systems can deliver has only widened.

Apache Hudi emerged to address exactly this gap. By bringing transactions, incremental ingestion, and advanced table services to the data lake, Hudi redefined what was possible. It pioneered the data lakehouse architecture, which unifies the openness and scalability of lakes with the reliability and performance of warehouses. In recent years, Hudi has matured into one of the most widely adopted open table formats, supported by a vibrant community and deployed at scale in industries ranging from technology and finance to retail and research.

The world of data architecture is at an inflection point. Lakehouses have transitioned from a cutting-edge idea to an industry standard. Hudi has kept pace, introducing powerful features such as multiwriter concurrency control, metadata-driven optimizations, and integrated streaming ingestion. Yet with this power comes the responsibility to make the right choices—there are design trade-offs, operational considerations, and architectural choices that ...