Skip to Content
Apache Hudi: The Definitive Guide
book

Apache Hudi: The Definitive Guide

by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro
October 2025
Intermediate to advanced
290 pages
7h 43m
English
O'Reilly Media, Inc.
Book available
Content preview from Apache Hudi: The Definitive Guide

Chapter 6. Maintaining and Optimizing Hudi Tables

Just as we regularly maintain a house to keep it in optimal condition, maintaining Apache Hudi tables is essential for a well-functioning data lakehouse. Just as a house requires regular sorting, decluttering, and reorganization to remain spacious and easy to navigate, tables must also be periodically reviewed and organized to keep them efficient and accessible.

When writing data, users often focus more on minimizing read and write delays than on perfectly organizing the data, and this is a serious oversight, especially for high-throughput tables. As we discussed at the beginning of Chapter 1, Hudi is conceived as a data lakehouse platform that can anticipate such pitfalls and guard against them from the get-go. This saves users from inefficiencies and difficulties in operating their data lakehouses later on.

For instance, unmaintained Hudi tables can suffer from:

Increased storage costs

Too many small files lead to high storage access latencies and inefficient compression on storage, increasing storage costs for the lakehouse. Too many objects in cloud storage can also balloon storage API costs.

Slow query performance

Suboptimal table organization can result in long query execution times, due to an unclustered and poorly partitioned data layout. Large numbers of small files also contribute to metadata bloat, especially for lakehouses retaining multiple versions of a table.

Increased compute costs

Without index maintenance, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

gRPC: Up and Running

gRPC: Up and Running

Kasun Indrasiri, Danesh Kuruppu
Stream Processing with Apache Flink

Stream Processing with Apache Flink

Fabian Hueske, Vasiliki Kalavri
Apache Iceberg: The Definitive Guide

Apache Iceberg: The Definitive Guide

Tomer Shiran, Jason Hughes, Alex Merced
Command-Line Rust

Command-Line Rust

Ken Youens-Clark

Publisher Resources

ISBN: 9781098173821Errata Page