book

Architecting an Apache Iceberg Lakehouse

Name: Architecting an Apache Iceberg Lakehouse
Author: Alex Merced
ISBN: 9781633435100

by Alex Merced

April 2026

Beginner to intermediate

408 pages

13h 26m

English

Manning Publications

Read now

Unlock full access

Architecting an Apache Iceberg Lakehouse
copyright
contents
dedication
foreword
preface
acknowledgments
about this book
about the author
about the cover illustration

Part 1 The value of the Apache Iceberg lakehouse
1 The world of the data lakehouse
1.1 Evolution from database to data lakehouse1.2 The rise of data warehouses1.3 The move to cloud data warehouses1.4 The data lake and the Hadoop era1.5 Apache Iceberg: Giving data lakes data warehouse capabilities1.6 The data lakehouse: Best of both worlds
2 Apache Iceberg and the lakehouse
2.1 What does it mean that Iceberg is a table format?2.2 Why you need a table format2.3 How Apache Iceberg manages metadata2.4 Key features of Apache Iceberg2.5 Apache Iceberg: An open source standard2.6 Benefits of Apache Iceberg2.6.1 ACID transactions2.6.2 How tables evolve2.6.3 Time travel and snapshot-based queries2.6.4 Hidden partitioning to reduce accidental full-table scans2.6.5 Cost efficiency and query performance2.7 Apache Iceberg lakehouse components2.7.1 Storage layer: Foundation of your lakehouse2.7.2 Ingestion layer: Feeding data into Iceberg tables2.7.3 Catalog layer: Your entry point to the lakehouse2.7.4 Federation layer: Modeling and accelerating data2.7.5 Consumption layer: Delivering business value
3 Hands-on with Apache Iceberg
3.1 Our example3.2 Setting up an Apache Iceberg environment3.2.1 Prerequisite: Installing Docker3.2.2 Creating the Docker Compose file3.2.3 Running the environment3.2.4 Accessing services3.3 Creating Iceberg tables in Spark3.3.1 Populating the PostgreSQL database3.3.2 Starting Apache Spark3.3.3 Configuring Apache Spark for Iceberg3.3.4 Loading data from PostgreSQL into Iceberg3.3.5 Verifying data storage in MinIO3.4 Reading Iceberg tables with Dremio3.4.1 Starting Dremio3.4.2 Connecting Dremio to the Nessie catalog3.4.3 Querying Iceberg tables in Dremio3.5 Creating a BI dashboard from your Iceberg tables3.5.1 Starting Apache Superset3.5.2 Connecting Superset to Dremio3.5.3 Creating a dataset from Iceberg tables3.5.4 Building charts and dashboards
Part 2 Designing your Iceberg architecture
4 Preparing for your move to Apache Iceberg
4.1 Auditing your data platform4.1.1 Who are the stakeholders?4.1.2 What should you ask stakeholders?4.1.3 Conducting a technological audit4.2 Hamerliva Bank’s audit in action4.2.1 Hamerliva Bank interviews its stakeholders4.2.2 Hamerliva Bank audits its technology stack4.2.3 Hamerliva Bank summarizes the audit findings4.3 From audit to requirements: Laying the groundwork for design4.3.1 Defining storage requirements4.3.2 Defining ingestion requirements4.3.3 Defining catalog requirements4.3.4 Defining federation requirements4.3.5 Defining consumption requirements4.4 Hamerliva Bank defines its requirements4.4.1 Storage requirements4.4.2 Ingestion requirements4.4.3 Catalog requirements4.4.4 Federation requirements4.4.5 Consumption requirements4.4.6 From requirements to design4.5 Architectural plan and road show4.5.1 Hamerliva Bank creates its architectural plan4.5.2 Hamerliva Bank conducts a road show
5 Selecting the storage layer
5.1 Storage requirements5.1.1 Performance requirements for file retrieval5.1.2 Security requirements5.1.3 Integrity requirements5.1.4 Cost and operational overhead requirements5.2 Block vs. object storage5.2.1 Block storage5.2.2 Object storage5.3 Storage layer standards5.3.1 Apache Parquet5.3.2 The S3 API5.4 Storage solutions5.4.1 Vendor comparison summary5.4.2 Hadoop5.4.3 Amazon S35.4.4 Google Cloud Storage5.4.5 Azure Blob Storage and ADLS5.4.6 MinIO5.4.7 Ceph5.4.8 NetApp StorageGRID5.4.9 Everpure5.4.10 Dell ECS5.4.11 Wasabi5.5 Selecting storage based on requirements5.5.1 Performance requirements5.5.2 Security requirements5.5.3 Integrity requirements5.5.4 Cost and operational requirements
6 Architecting the ingestion layer
6.1 Ingestion requirements6.1.1 Ingestion throughput and latency6.1.2 Reliability and fault tolerance6.1.3 Schema management and evolution6.1.4 Operational complexity and maintainability6.2 Ingestion models and architectures6.2.1 Batch ingestion6.2.2 Micro-batch and incremental ingestion6.2.3 Streaming ingestion6.3 How Iceberg manages writes6.3.1 Write semantics in Iceberg6.3.2 Commit protocols and conflict handling6.4 Tools and frameworks for ingestion6.4.1 Apache Spark6.4.2 Apache Flink6.4.3 Apache NiFi6.4.4 Fivetran6.4.5 Qlik6.4.6 Airbyte6.4.7 Confluent6.4.8 Redpanda6.4.9 Cloud-native ingestion services6.4.10 Tool selection considerations6.5 Applying ingestion requirements in context6.5.1 Prioritizing low latency6.5.2 Managing high throughput6.5.3 Supporting complex transformations6.5.4 Handling schema evolution6.5.5 Balancing operational overhead6.5.6 Considering existing cloud environments
7 Implementing the catalog layer
7.1 The role of the catalog in Apache Iceberg lakehouses7.1.1 Responsibilities of the catalog7.1.2 Catalog interactions with query and processing engines7.2 Evaluating catalog requirements7.2.1 Performance, availability, and scale7.2.2 Metadata governance and lineage7.2.3 Security and compliance7.2.4 Deployment flexibility and ecosystem compatibility7.2.5 Cost and operational overhead7.2.6 Catalog federation and mesh architectures7.3 Apache Iceberg REST Catalog specification7.3.1 Before the Apache Iceberg REST Catalog specification7.3.2 The solution7.4 Catalog options: Exploring the ecosystem7.4.1 Hadoop catalog7.4.2 Hive catalog7.4.3 JDBC catalog7.4.4 Apache Polaris7.4.5 Project Nessie7.4.6 Apache Gravitino7.4.7 Lakekeeper7.4.8 AWS Glue Data Catalog7.4.9 Dremio catalog7.4.10 Snowflake Open Catalog7.4.11 Databricks Unity Catalog7.5 Choosing the right catalog: Evaluating options through scenarios7.5.1 Scenario: A mid-sized data team migrating from Hive7.5.2 Scenario: A rapidly scaling cloud-native startup7.5.3 Scenario: A multinational enterprise with strict data governance7.5.4 Scenario: A SaaS startup prioritizing operational simplicity7.5.5 Scenario: A large enterprise with multicloud and federated governance needs7.5.6 Scenario: A financial firm requiring daily environment cloning for stress testing7.5.7 Scenario: Phased Iceberg migration with query federation across legacy systems7.5.8 Scenario: A lightweight lakehouse adoption with Hadoop catalog and Python7.6 Catalog-based access control
8 Designing the federation layer
8.1 What data federation is and why it matters8.1.1 Common use cases and challenges driving federation needs8.1.2 How federation aligns with agility and accessibility8.2 Key requirements for federation8.2.1 Supporting diverse data sources without duplication8.2.2 Ensuring consistent semantics and business logic8.2.3 Providing seamless connectivity for analytics tools8.3 Introducing Dremio and Trino8.3.1 Dremio8.3.2 Dremio’s architecture8.3.3 Dremio’s connector ecosystem and Iceberg-centric focus8.3.4 Dremio’s performance enhancements8.3.5 Trino8.3.6 Trino’s modular architecture for wide-source support8.3.7 Trino’s flexibility and configurability for complex environments8.3.8 Trino’s community-led evolution and vendor extensions8.3.9 Semantic layer considerations in Trino8.4 Deployment models8.4.1 Deploying Dremio8.4.2 Deploying Trino8.5 Federation platform decision scenarios8.5.1 Fragmented multisource environment: Trino for connector breadth8.5.2 Building a native Iceberg lakehouse: Dremio for Iceberg-native features8.5.3 Empowering business users with UI and governed datasets: Dremio8.5.4 Lightweight querying of Hudi datasets: Trino via AWS Athena8.5.5 On-prem Cloudera modernization: Trino replacing Impala for performance8.5.6 Hybrid cloud Iceberg strategy: Dremio bridging on-prem and ADLS8.6 Federation alternatives8.6.1 Virtualization via shortcuts in OneLake8.6.2 AI-native data virtualization with Spice.ai8.6.3 Choosing the right fit
9 Understanding the consumption layer
9.1 Revisiting the benefits of the lakehouse for consumption9.1.1 Connecting the lakehouse to the people9.2 Revisiting requirements from our audit9.2.1 Interpreting requirements for consumption9.2.2 Requirements for BI tools9.2.3 Requirements for interactive notebook environments9.2.4 Requirements for AI and specialized data consumption tools9.3 Open interfaces for seamless consumption9.3.1 JDBC and ODBC9.3.2 Arrow Flight9.3.3 Model Context Protocol (MCP)9.4 Business intelligence tools in the lakehouse9.4.1 Open source BI tools9.4.2 Commercial BI tools9.4.3 Tools for AI and machine learning workloads9.5 Choosing the right consumption tools: Ten illustrated scenarios9.5.1 Startup with a data science focus9.5.2 Large financial institution with strict governance9.5.3 Mid-sized e-commerce platform building embedded analytics9.5.4 Decentralized media organization enabling self-service analytics9.5.5 Government agency balancing public transparency and internal control9.5.6 Healthcare provider with compliance and data locality constraints9.5.7 Logistics company unifying real-time operations and historical analysis9.5.8 SaaS company offering customizable data access to clients9.5.9 Nonprofit organization supporting collaborative research9.5.10 Manufacturing company enabling predictive maintenance
Part 3 Operating your Apache Iceberg lakehouse
10 Maintaining an Iceberg lakehouse
10.1 Problem: Suboptimal data files10.1.1 Small files10.1.2 Poorly colocated data10.1.3 Metadata sprawl10.1.4 Merge-on-Read performance hits10.2 Solution: Compaction10.2.1 What is compaction?10.2.2 Target file size10.2.3 Files to be included10.2.4 Using filters to scope compaction10.3 Storage footprint management and data retention10.3.1 Running snapshot expiration10.3.2 COW vs. MOR: Implications for data retention10.3.3 Regulatory considerations for data deletion10.4 Exploring Apache Iceberg’s metadata tables
11 Operationalizing Apache Iceberg
11.1 Orchestrating the lakehouse11.1.1 Choosing orchestration tools and patterns11.1.2 Metadata-driven triggers for proactive maintenance11.1.3 Per-table maintenance policies11.1.4 Monitoring and alerting integration11.1.5 Putting orchestration into practice11.2 Auditing the lakehouse11.2.1 Using snapshot history for change tracking11.2.2 Using branching and tagging for governance11.2.3 Implementing file and snapshot retention policies11.2.4 Practical retention policy orchestration11.2.5 Secure data deletion11.2.6 Access auditing and governance11.2.7 Practical auditing with Iceberg: Example workflows11.3 Disaster recovery in the lakehouse11.3.1 The role of the metadata catalog in disaster recovery11.3.2 Protecting against data loss and corruption11.3.3 Cross-region and multi-environment recovery11.3.4 Rollback and time travel in incident response11.3.5 Automating disaster recovery procedures11.3.6 Validating recovery readiness11.3.7 Disaster recovery through automation11.3.8 Practical examples: Automating recovery workflows
appendix A The metadata tables
A.1 Querying Iceberg metadata tablesA.2 The history metadata tableA.3 The snapshots metadata tableA.4 The metadata_log_entries metadata tableA.5 The manifests metadata tableA.6 The partitions metadata tableA.7 The files metadata tableA.8 The position_deletes metadata tableA.9 The all_data_files metadata tableA.10 The all_delete_files metadata tableA.11 The all_entries metadata tableA.12 The all_manifests metadata tableA.13 The refs metadata tableA.14 Monitoring table health with metadata tablesA.14.1 Example: Triggering compaction based on file metricsA.14.2 Example: Monitoring snapshot frequencyA.14.3 Automating maintenance with insights
appendix B Python for Apache Iceberg
B.1 PyIcebergB.2 PolarsB.3 DuckDBB.4 DaftB.5 DremioB.6 BauplanB.7 Spice AIB.8 Summary and best practices
appendix C The Apache Iceberg specification
C.1 Understanding the Iceberg specificationC.1.1 What is a table format specification?C.1.2 Why Iceberg formalizes table behaviorC.1.3 Evolution of the spec: Versioning principles and compatibilityC.2 Iceberg table format versionsC.2.1 Version 1: Foundation for analytical tablesC.2.2 Version 2: Row-level deletes and stricter writesC.2.3 Version 3: Extended types and advanced capabilitiesC.2.4 Version 4: Performance, portability, and real-time readinessC.3 Snapshot management and table metadataC.3.1 Table metadata filesC.3.2 Snapshots and the manifest listC.3.3 Sequence numbers and optimistic concurrencyC.4 The REST Catalog specificationC.4.1 Overview and purposeC.4.2 Catalog configuration and default endpointsC.4.3 Namespaces, tables, and viewsC.4.4 Table registration, metrics, and transactionsC.4.5 OAuth2 support and security considerationsC.4.6 The scan-planning endpointC.5 Puffin file format specificationC.5.1 What is a Puffin file?C.5.2 Storing column-level metrics and custom indexesC.5.3 Integration with Iceberg table metadataC.6 Compatibility and migrationC.6.1 Reading and writing across format versionsC.6.2 Upgrading tables to newer spec versionsC.6.3 Handling backward compatibility in practice
afterword

Content preview from Architecting an Apache Iceberg Lakehouse

2 Apache Iceberg and the lakehouse

This chapter covers

What is Apache’s Iceberg table format?
The benefits of Apache Iceberg
Components of an Apache Iceberg–based data lakehouse

Apache Iceberg is a community-driven table format that defines how large analytical datasets are organized, versioned, and accessed on a data lake. It doesn’t change how data is stored at the file level. Instead, it adds a standard metadata layer on top of files, typically stored in Apache Parquet, which lets collections of files be treated as coherent, relational tables while remaining on low-cost object storage. This chapter will explore the architecture and value of Apache Iceberg as an open table format for data lakehouses.

2.1 What does it mean that Iceberg ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Three Essentials for Agentic AI Security

Publisher Resources

ISBN: 9781633435100Publisher Support Other Publisher Website Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Architecting an Apache Iceberg Lakehouse

by Alex Merced