book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro

October 2025

Intermediate to advanced

290 pages

7h 43m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Why This Book, and Why NowWho This Book Is ForThe Technology and Its MomentWhat’s in This BookHow to Use This BookConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
The Evolution of Data Management ArchitecturesThe Rise of Data LakehousesUber’s “Transactional Data Lake” ProblemWhat Is Hudi?The Hudi StackNative Table FormatPluggable Table FormatStorage EngineProgramming APIUser AccessShared Platform ComponentsHudi in the Real WorldSummary
Basic OperationsCreate the TableInsert, Update, Delete, and Fetch RecordsChoose a Table TypeCreate a Merge-on-Read TableMOR Table’s Layout After WritesCopy-on-Write Versus Merge-on-ReadAdvanced UsageCreate Table As SelectMerge Source Data into the TableUpdate and Delete Using Nonrecord Key FieldsTime Travel QueryIncremental QuerySummary
Breaking Down the Write FlowStart CommitPrepare RecordsPartition RecordsWrite to StorageCommit ChangesSummarize the Upsert FlowExploring Write OperationsDefine Table PropertiesUse INSERT INTOPerform Partial Merge with MERGE INTOPerform DeletionOverwrite Partition or TableHighlighting Noteworthy FeaturesKey GeneratorsMerge ModesSchema Evolution on WriteBootstrappingSummary
Integrating with Query EnginesQuery LifecycleData CatalogHudi IntegrationExploring Query TypesSnapshot QueryTime Travel QueryIncremental Query: The Latest-State ModeIncremental Query: The Change Data Capture ModeHighlighting Noteworthy FeaturesStreaming ReadSchema Evolution on ReadRead Using Rust or PythonSummary
Overview of the Indexes in HudiIndex Acceleration for WritesGeneral-Purpose Multimodal IndexingWriter-Side IndexesComparison of Writer Indexing ChoicesIndex Acceleration for ReadsData SkippingEquality MatchingIndexing on ExpressionsBuild the Right IndexesSummary
Table Service OverviewDeployment Mode: InlineDeployment Mode: Async ExecutionDeployment Mode: StandaloneChoosing a Suitable ModeCompactionSchedule CompactionExecute CompactionClusteringSchedule ClusteringExecute ClusteringLayout Optimization StrategiesClustering Versus CompactionCleaningSchedule CleaningExecute CleaningIndexingSummary
Why Concurrency Control Is Harder in Data LakehousesConcurrency Control TechniquesMultiwriter ScenariosWhy Multiwriters Are NecessaryMultiwriter Scenarios for OCCMultiwriter Scenarios for NBCC and MVCCThe Simple Default: Single Writer with Table ServicesHow Hudi Handles Concurrency ControlThe Foundations of Hudi’s Concurrency ControlThe Three-Step Commit ProcessConflict Detection and ResolutionLocking MechanismsChallenges in Multiwriter SystemsUsing Multiwriter Support in HudiEnabling Multiwriter SupportConfiguring the Locking MechanismMultiwriters Using Hudi StreamerMultiwriters Using Spark Data Source WriterSingle Writer and Multiple Table ServicesDisabling Multiwriter SupportTips and Best PracticesImplement Partitioning and File GroupingEnable Early Conflict DetectionOptimize Locking MechanismsRun Asynchronous Table ServicesReduce Write Conflicts and Wasted ResourcesPrevent Data Duplication When Using Multiple WritersSummary
Alcubierre’s Data Silo WoesData Quality Assurance and DeduplicationHeterogeneous Data and Schema EvolutionData Management, Localization, and ConsistencyProblem RecapLakehouse Architecture to the RescueWhat Is Hudi Streamer?Getting Started with Hudi StreamerIngesting Data from S3Ingesting Data from KafkaIngesting Data from RDBMSHudi Streamer in ActionPreparing the Upstream SourceSetting Up Hudi StreamerUnlocking the Power of AnalyticsExploring the Hudi Streamer OptionsGeneral OptionsSource OptionsOperational OptionsSummary

Operating with EaseGetting to Know the CLIPerforming Table OperationsIntegrating into the PlatformTriggering Post-Commit CallbacksWiring Up Monitoring SystemsSyncing with CatalogsPerformance TuningStorage Layout TuningWrite Performance TuningRead Performance TuningTable Services TuningSummary
Architecture OverviewRetailMax Corp: A Real-World Lakehouse ScenarioImplementing Medallion Architecture with HudiConfiguring RetailMax’s Hudi TablesRecord KeysOrdering FieldPartitioningTable TypesBronze Layer: Ingesting Upstream DataSetting Up Upstream Data SourcesStreaming Mutable, Transactional Data with Debezium, Flink, and HudiIngesting Application Event Streams with Hudi Kafka Connect SinkSilver Layer: Creating Derived DatasetsGoals of the Silver Layer for RetailMaxStreaming-Based Transformations with Hudi StreamerBatch and Incremental Transformations with Spark SQLMaintaining Data Quality and Consistency in the Silver LayerGold Layer: Querying the Lakehouse for InsightsInteractive Analytics with TrinoBatch Analytics and Reporting with Spark SQLAdvanced Querying: Time Travel and Point-in-Time AnalysisBusiness Layer: AI-Driven Insights for RetailMaxPreparing Data for AI/Machine Learning in the Gold LayerBuilding a Knowledge Base for LLM-Powered Applications with Ray and HudiOperationalizing and Optimizing the Hudi LakehouseConcurrency Control and Multiwriter ScenariosMonitoring the LakehouseData ResiliencePerformance Benchmarks and ConsiderationsSummary

Content preview from Apache Hudi: The Definitive Guide

Chapter 8. Building a Lakehouse Using Hudi Streamer

In modern organizations, data silos create more than just fragmented data; they foster fragmented efforts. Teams across the business often find themselves independently solving the same data engineering problems, building similar ETL tools, and defining their own conventions for schemas and formats. This redundancy not only wastes valuable resources but also erects significant barriers to sharing and normalizing data. The core challenge becomes a strategic one: how can an organization move beyond this inefficiency to provide a standardized set of tools and a unified platform? How can it empower teams to collaborate on ingesting and transforming data, while sharing common datasets, catalogs, and monitoring dashboards?

The modern answer to this challenge is the data lakehouse, and Apache Hudi is a particularly strong choice for building one. If your organization is suffering from data silos and has not yet converged on a single data storage solution, Hudi offers more flexibility than the alternatives. Not only does Hudi permit different parts of an organization to maintain sovereignty over their data stacks and architectures, but it also provides a specialized ingestion tool—Hudi Streamer—that can connect to a wide array of upstream sources and streamline the construction of a data lakehouse.

In this chapter, we’ll meet Alcubierre, a fictional airline company grappling with these common data silo challenges. As we imagine ourselves ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781098173821Errata Page

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro

Chapter 8. Building a Lakehouse Using Hudi Streamer

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

gRPC: Up and Running

Stream Processing with Apache Flink

Apache Iceberg: The Definitive Guide

Command-Line Rust

Publisher Resources

Chapter 8. Building a Lakehouse Using Hudi Streamer

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

gRPC: Up and Running

Stream Processing with Apache Flink

Apache Iceberg: The Definitive Guide

Command-Line Rust

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.