Skip to Content
Apache Hudi: The Definitive Guide
book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro
October 2025
Intermediate to advanced
290 pages
7h 43m
English
O'Reilly Media, Inc.
Book available
Content preview from Apache Hudi: The Definitive Guide

Chapter 5. Achieving Efficiency with Indexing

Lakehouses must be able to manage petabyte-scale datasets with complex, often unpredictable mutation patterns while maintaining both write efficiency and query performance. These systems operate at a massive scale on distributed storage and need to support a mix of analytical and transactional workloads. To meet these demands, lakehouse tables require versatile indexing capabilities, similar to OLTP databases. On the write path, the indexes have to be maintained as new writes happen, and then they will be used to efficiently locate existing records for updates and deletes across massive datasets. On the read path, the indexes need to handle diverse query patterns with equal efficiency: range predicates benefit from file statistics pruning, equality predicates benefit from index lookups, and function-based predicates need specialized expression handling.

As of this writing, Apache Hudi is the only lakehouse storage system that natively supports indexing capabilities. In this chapter, we discuss how Hudi keeps read and write operations performant at scale, by employing indexing techniques. We will also see why getting your indexing strategy right is what makes near-real-time lakehouse performance possible. We’ll cover:

  • The essentials of indexing for lakehouse tables, with a look at how indexing techniques in readers and writers optimize performance

  • How multimodal indexing works via the Hudi metadata table, along with the different ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

gRPC: Up and Running

gRPC: Up and Running

Kasun Indrasiri, Danesh Kuruppu
Stream Processing with Apache Flink

Stream Processing with Apache Flink

Fabian Hueske, Vasiliki Kalavri
Apache Iceberg: The Definitive Guide

Apache Iceberg: The Definitive Guide

Tomer Shiran, Jason Hughes, Alex Merced
Command-Line Rust

Command-Line Rust

Ken Youens-Clark

Publisher Resources

ISBN: 9781098173821Errata Page