book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro

October 2025

Intermediate to advanced

290 pages

7h 43m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Why This Book, and Why NowWho This Book Is ForThe Technology and Its MomentWhat’s in This BookHow to Use This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
The Evolution of Data Management ArchitecturesThe Rise of Data LakehousesUber’s “Transactional Data Lake” ProblemWhat Is Hudi?The Hudi StackNative Table FormatPluggable Table FormatStorage EngineProgramming APIUser AccessShared Platform ComponentsHudi in the Real WorldSummary
Basic OperationsCreate the TableInsert, Update, Delete, and Fetch RecordsChoose a Table TypeCreate a Merge-on-Read TableMOR Table’s Layout After WritesCopy-on-Write Versus Merge-on-ReadAdvanced UsageCreate Table As SelectMerge Source Data into the TableUpdate and Delete Using Nonrecord Key FieldsTime Travel QueryIncremental QuerySummary
Breaking Down the Write FlowStart CommitPrepare RecordsPartition RecordsWrite to StorageCommit ChangesSummarize the Upsert FlowExploring Write OperationsDefine Table PropertiesUse INSERT INTOPerform Partial Merge with MERGE INTOPerform DeletionOverwrite Partition or TableHighlighting Noteworthy FeaturesKey GeneratorsMerge ModesSchema Evolution on WriteBootstrappingSummary
Integrating with Query EnginesQuery LifecycleData CatalogHudi IntegrationExploring Query TypesSnapshot QueryTime Travel QueryIncremental Query: The Latest-State ModeIncremental Query: The Change Data Capture ModeHighlighting Noteworthy FeaturesStreaming ReadSchema Evolution on ReadRead Using Rust or PythonSummary
Overview of the Indexes in HudiIndex Acceleration for WritesGeneral-Purpose Multimodal IndexingWriter-Side IndexesComparison of Writer Indexing ChoicesIndex Acceleration for ReadsData SkippingEquality MatchingIndexing on ExpressionsBuild the Right IndexesSummary
Table Service OverviewDeployment Mode: InlineDeployment Mode: Async ExecutionDeployment Mode: StandaloneChoosing a Suitable ModeCompactionSchedule CompactionExecute CompactionClusteringSchedule ClusteringExecute ClusteringLayout Optimization StrategiesClustering Versus CompactionCleaningSchedule CleaningExecute CleaningIndexingSummary
Why Concurrency Control Is Harder in Data LakehousesConcurrency Control TechniquesMultiwriter ScenariosWhy Multiwriters Are NecessaryMultiwriter Scenarios for OCCMultiwriter Scenarios for NBCC and MVCCThe Simple Default: Single Writer with Table ServicesHow Hudi Handles Concurrency ControlThe Foundations of Hudi’s Concurrency ControlThe Three-Step Commit ProcessConflict Detection and ResolutionLocking MechanismsChallenges in Multiwriter SystemsUsing Multiwriter Support in HudiEnabling Multiwriter SupportConfiguring the Locking MechanismMultiwriters Using Hudi StreamerMultiwriters Using Spark Data Source WriterSingle Writer and Multiple Table ServicesDisabling Multiwriter SupportTips and Best PracticesImplement Partitioning and File GroupingEnable Early Conflict DetectionOptimize Locking MechanismsRun Asynchronous Table ServicesReduce Write Conflicts and Wasted ResourcesPrevent Data Duplication When Using Multiple WritersSummary
Alcubierre’s Data Silo WoesData Quality Assurance and DeduplicationHeterogeneous Data and Schema EvolutionData Management, Localization, and ConsistencyProblem RecapLakehouse Architecture to the RescueWhat Is Hudi Streamer?Getting Started with Hudi StreamerIngesting Data from S3Ingesting Data from KafkaIngesting Data from RDBMSHudi Streamer in ActionPreparing the Upstream SourceSetting Up Hudi StreamerUnlocking the Power of AnalyticsExploring the Hudi Streamer OptionsGeneral OptionsSource OptionsOperational OptionsSummary

Operating with EaseGetting to Know the CLIPerforming Table OperationsIntegrating into the PlatformTriggering Post-Commit CallbacksWiring Up Monitoring SystemsSyncing with CatalogsPerformance TuningStorage Layout TuningWrite Performance TuningRead Performance TuningTable Services TuningSummary
Architecture OverviewRetailMax Corp: A Real-World Lakehouse ScenarioImplementing Medallion Architecture with HudiConfiguring RetailMax’s Hudi TablesRecord KeysOrdering FieldPartitioningTable TypesBronze Layer: Ingesting Upstream DataSetting Up Upstream Data SourcesStreaming Mutable, Transactional Data with Debezium, Flink, and HudiIngesting Application Event Streams with Hudi Kafka Connect SinkSilver Layer: Creating Derived DatasetsGoals of the Silver Layer for RetailMaxStreaming-Based Transformations with Hudi StreamerBatch and Incremental Transformations with Spark SQLMaintaining Data Quality and Consistency in the Silver LayerGold Layer: Querying the Lakehouse for InsightsInteractive Analytics with TrinoBatch Analytics and Reporting with Spark SQLAdvanced Querying: Time Travel and Point-in-Time AnalysisBusiness Layer: AI-Driven Insights for RetailMaxPreparing Data for AI/Machine Learning in the Gold LayerBuilding a Knowledge Base for LLM-Powered Applications with Ray and HudiOperationalizing and Optimizing the Hudi LakehouseConcurrency Control and Multiwriter ScenariosMonitoring the LakehouseData ResiliencePerformance Benchmarks and ConsiderationsSummary

Content preview from Apache Hudi: The Definitive Guide

Chapter 6. Maintaining and Optimizing Hudi Tables

Just as we regularly maintain a house to keep it in optimal condition, maintaining Apache Hudi tables is essential for a well-functioning data lakehouse. Just as a house requires regular sorting, decluttering, and reorganization to remain spacious and easy to navigate, tables must also be periodically reviewed and organized to keep them efficient and accessible.

When writing data, users often focus more on minimizing read and write delays than on perfectly organizing the data, and this is a serious oversight, especially for high-throughput tables. As we discussed at the beginning of Chapter 1, Hudi is conceived as a data lakehouse platform that can anticipate such pitfalls and guard against them from the get-go. This saves users from inefficiencies and difficulties in operating their data lakehouses later on.

For instance, unmaintained Hudi tables can suffer from:

Increased storage costs: Too many small files lead to high storage access latencies and inefficient compression on storage, increasing storage costs for the lakehouse. Too many objects in cloud storage can also balloon storage API costs.
Slow query performance: Suboptimal table organization can result in long query execution times, due to an unclustered and poorly partitioned data layout. Large numbers of small files also contribute to metadata bloat, especially for lakehouses retaining multiple versions of a table.
Increased compute costs: Without index maintenance, ...