Apache Hudi: The Definitive Guide
by Shiyan Xu, Prashant Wason, Bhavani Sudha Saktheeswaran, Rebecca Bilbro
Chapter 6. Maintaining and Optimizing Hudi Tables
Just as we regularly maintain a house to keep it in optimal condition, maintaining Apache Hudi tables is essential for a well-functioning data lakehouse. Just as a house requires regular sorting, decluttering, and reorganization to remain spacious and easy to navigate, tables must also be periodically reviewed and organized to keep them efficient and accessible.
When writing data, users often focus more on minimizing read and write delays than on perfectly organizing the data, and this is a serious oversight, especially for high-throughput tables. As we discussed at the beginning of Chapter 1, Hudi is conceived as a data lakehouse platform that can anticipate such pitfalls and guard against them from the get-go. This saves users from inefficiencies and difficulties in operating their data lakehouses later on.
For instance, unmaintained Hudi tables can suffer from:
- Increased storage costs
-
Too many small files lead to high storage access latencies and inefficient compression on storage, increasing storage costs for the lakehouse. Too many objects in cloud storage can also balloon storage API costs.
- Slow query performance
-
Suboptimal table organization can result in long query execution times, due to an unclustered and poorly partitioned data layout. Large numbers of small files also contribute to metadata bloat, especially for lakehouses retaining multiple versions of a table.
- Increased compute costs
-
Without index maintenance, ...