book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro

October 2025

Intermediate to advanced

290 pages

7h 43m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Why This Book, and Why NowWho This Book Is ForThe Technology and Its MomentWhat’s in This BookHow to Use This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
The Evolution of Data Management ArchitecturesThe Rise of Data LakehousesUber’s “Transactional Data Lake” ProblemWhat Is Hudi?The Hudi StackNative Table FormatPluggable Table FormatStorage EngineProgramming APIUser AccessShared Platform ComponentsHudi in the Real WorldSummary
Basic OperationsCreate the TableInsert, Update, Delete, and Fetch RecordsChoose a Table TypeCreate a Merge-on-Read TableMOR Table’s Layout After WritesCopy-on-Write Versus Merge-on-ReadAdvanced UsageCreate Table As SelectMerge Source Data into the TableUpdate and Delete Using Nonrecord Key FieldsTime Travel QueryIncremental QuerySummary
Breaking Down the Write FlowStart CommitPrepare RecordsPartition RecordsWrite to StorageCommit ChangesSummarize the Upsert FlowExploring Write OperationsDefine Table PropertiesUse INSERT INTOPerform Partial Merge with MERGE INTOPerform DeletionOverwrite Partition or TableHighlighting Noteworthy FeaturesKey GeneratorsMerge ModesSchema Evolution on WriteBootstrappingSummary
Integrating with Query EnginesQuery LifecycleData CatalogHudi IntegrationExploring Query TypesSnapshot QueryTime Travel QueryIncremental Query: The Latest-State ModeIncremental Query: The Change Data Capture ModeHighlighting Noteworthy FeaturesStreaming ReadSchema Evolution on ReadRead Using Rust or PythonSummary
Overview of the Indexes in HudiIndex Acceleration for WritesGeneral-Purpose Multimodal IndexingWriter-Side IndexesComparison of Writer Indexing ChoicesIndex Acceleration for ReadsData SkippingEquality MatchingIndexing on ExpressionsBuild the Right IndexesSummary
Table Service OverviewDeployment Mode: InlineDeployment Mode: Async ExecutionDeployment Mode: StandaloneChoosing a Suitable ModeCompactionSchedule CompactionExecute CompactionClusteringSchedule ClusteringExecute ClusteringLayout Optimization StrategiesClustering Versus CompactionCleaningSchedule CleaningExecute CleaningIndexingSummary
Why Concurrency Control Is Harder in Data LakehousesConcurrency Control TechniquesMultiwriter ScenariosWhy Multiwriters Are NecessaryMultiwriter Scenarios for OCCMultiwriter Scenarios for NBCC and MVCCThe Simple Default: Single Writer with Table ServicesHow Hudi Handles Concurrency ControlThe Foundations of Hudi’s Concurrency ControlThe Three-Step Commit ProcessConflict Detection and ResolutionLocking MechanismsChallenges in Multiwriter SystemsUsing Multiwriter Support in HudiEnabling Multiwriter SupportConfiguring the Locking MechanismMultiwriters Using Hudi StreamerMultiwriters Using Spark Data Source WriterSingle Writer and Multiple Table ServicesDisabling Multiwriter SupportTips and Best PracticesImplement Partitioning and File GroupingEnable Early Conflict DetectionOptimize Locking MechanismsRun Asynchronous Table ServicesReduce Write Conflicts and Wasted ResourcesPrevent Data Duplication When Using Multiple WritersSummary
Alcubierre’s Data Silo WoesData Quality Assurance and DeduplicationHeterogeneous Data and Schema EvolutionData Management, Localization, and ConsistencyProblem RecapLakehouse Architecture to the RescueWhat Is Hudi Streamer?Getting Started with Hudi StreamerIngesting Data from S3Ingesting Data from KafkaIngesting Data from RDBMSHudi Streamer in ActionPreparing the Upstream SourceSetting Up Hudi StreamerUnlocking the Power of AnalyticsExploring the Hudi Streamer OptionsGeneral OptionsSource OptionsOperational OptionsSummary

Operating with EaseGetting to Know the CLIPerforming Table OperationsIntegrating into the PlatformTriggering Post-Commit CallbacksWiring Up Monitoring SystemsSyncing with CatalogsPerformance TuningStorage Layout TuningWrite Performance TuningRead Performance TuningTable Services TuningSummary
Architecture OverviewRetailMax Corp: A Real-World Lakehouse ScenarioImplementing Medallion Architecture with HudiConfiguring RetailMax’s Hudi TablesRecord KeysOrdering FieldPartitioningTable TypesBronze Layer: Ingesting Upstream DataSetting Up Upstream Data SourcesStreaming Mutable, Transactional Data with Debezium, Flink, and HudiIngesting Application Event Streams with Hudi Kafka Connect SinkSilver Layer: Creating Derived DatasetsGoals of the Silver Layer for RetailMaxStreaming-Based Transformations with Hudi StreamerBatch and Incremental Transformations with Spark SQLMaintaining Data Quality and Consistency in the Silver LayerGold Layer: Querying the Lakehouse for InsightsInteractive Analytics with TrinoBatch Analytics and Reporting with Spark SQLAdvanced Querying: Time Travel and Point-in-Time AnalysisBusiness Layer: AI-Driven Insights for RetailMaxPreparing Data for AI/Machine Learning in the Gold LayerBuilding a Knowledge Base for LLM-Powered Applications with Ray and HudiOperationalizing and Optimizing the Hudi LakehouseConcurrency Control and Multiwriter ScenariosMonitoring the LakehouseData ResiliencePerformance Benchmarks and ConsiderationsSummary

Content preview from Apache Hudi: The Definitive Guide

Chapter 2. Getting Started with Hudi

In Chapter 1, we explored the foundational concepts that make Apache Hudi a compelling choice for modern data architectures. We explored how data lakes have evolved into lakehouses, discussed Hudi’s position in this ecosystem, and reviewed its high-level architecture, the Hudi stack, and key feature highlights. While these concepts provide essential context, the best way to truly understand Hudi’s capabilities is through hands-on experience.

This chapter shifts from theory to practice. Rather than simply listing features, we’ll demonstrate how Hudi tables behave under different configurations and operations, allowing you to observe firsthand how the underlying table layout evolves as you perform common lakehouse operations.

We’ll start with a simple purchase tracking table and use Apache Spark to perform typical Create, Read, Update, and Delete (CRUD) operations. As we execute these commands, we’ll examine the resulting changes to the table’s physical structure, helping you develop an intuitive understanding of how Hudi organizes and manages your data behind the scenes.

The chapter is organized into three progressive sections that build upon each other. “Basic Operations” creates a Hudi table using the default Copy-on-Write (COW) table type and explores fundamental CRUD operations. As we execute SQL examples, we’ll examine how each operation affects the table layout and learn core concepts like record keys, partitioning, and the timeline internals. ...