book

Apache Iceberg: The Definitive Guide

by Tomer Shiran, Jason Hughes, Alex Merced

May 2024

Intermediate to advanced

344 pages

8h 40m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword by Gerrit Kazmaier
Foreword by Raghu Ramakrishnan
Foreword by Rick Sears
Preface
About This BookWhy We Wrote This BookWhat You Will Find InsideHow to Use This BookFeedback and QuestionsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Fundamentals of Apache Iceberg
1. Introduction to Apache Iceberg
How Did We Get Here? A Brief HistoryFoundational Components of a System Designed for OLAP WorkloadsBringing It All TogetherThe Data WarehouseA Brief HistoryPros and Cons of a Data WarehouseThe Data LakeA Brief HistoryPros and Cons of a Data LakeShould I Run Analytics on a Data Lake or a Data Warehouse?The Data LakehouseWhat Is a Table Format?Hive: The Original Table FormatModern Data Lake Table FormatsWhat Is Apache Iceberg?How Apache Iceberg Came to BeThe Apache Iceberg ArchitectureKey Features of Apache IcebergConclusion
2. The Architecture of Apache Iceberg
The Data LayerDatafilesDelete FilesThe Metadata LayerManifest FilesManifest ListsMetadata FilesPuffin FilesThe CatalogConclusion
3. Lifecycle of Write and Read Queries
Writing Queries in Apache IcebergCreate the TableInsert the QueryMerge QueryReading Queries in Apache IcebergThe SELECT QueryThe Time-Travel QueryConclusion
4. Optimizing the Performance of Iceberg Tables
CompactionHands-on with CompactionCompaction StrategiesAutomating CompactionSortingZ-orderPartitioningHidden PartitioningPartition EvolutionOther Partitioning ConsiderationsCopy-on-Write Versus Merge-on-ReadCopy-on-WriteMerge-on-ReadConfiguring COW and MOROther ConsiderationsMetrics CollectionRewriting ManifestsOptimizing StorageWrite Distribution ModeObject Storage ConsiderationsDatafile Bloom FiltersConclusion
5. Iceberg Catalogs
Requirements of an Iceberg CatalogCatalog ComparisonThe Hadoop CatalogThe Hive CatalogThe AWS Glue CatalogThe Nessie CatalogThe REST CatalogThe JDBC CatalogOther CatalogsCatalog MigrationUsing the Apache Iceberg Catalog Migration CLIUsing an EngineConclusion

II. Hands-on with Apache Iceberg
6. Apache Spark
ConfigurationConfiguring Apache Iceberg and SparkConfiguring the CatalogsStarting Spark with All the Configurations (AWS Glue Example)Data Definition Language OperationsCREATE TABLEALTER TABLEAlter a Table with Iceberg’s Spark SQL ExtensionsDROP TABLEReading DataThe Select All QueryThe Filter Rows QueryAggregation QueriesUsing Window FunctionsWriting DataINSERT INTOMERGE INTOINSERT OVERWRITEDELETE FROMUPDATEIceberg Table Maintenance ProceduresExpire SnapshotsRewrite DatafilesRewrite ManifestsRemove Orphan FilesConclusion
7. Dremio’s SQL Query Engine
ConfigurationData Definition Language OperationsCREATE TABLEALTER TABLEDROP TABLEReading DataUsing the SELECT QueryFiltering RowsUsing Aggregated QueriesUsing Window FunctionsWriting DataINSERT INTOCOPY INTOMERGE INTODELETEUPDATEIceberg Table MaintenanceExpire SnapshotsRewrite DatafilesRewrite ManifestsConclusion
8. AWS Glue
ConfigurationCreating a Glue DatabaseConfiguring the Glue ETL JobCreate a Table Using the Glue Data CatalogRead the TableInsert the DataConclusion
9. Apache Flink
ConfigurationPrerequisitesStart the Flink Cluster and Flink SQL ClientData Definition Language OperationsCREATE CATALOGCREATE DATABASECREATE TABLEALTER TABLEDROP TABLEReading DataFlink SQL Batch ReadFlink SQL Streaming ReadMetadata TableWriting DataINSERT INTOINSERT OVERWRITEUPSERTFlink DataFrame and Table API with Apache Iceberg TablesPrerequisitesConfiguring the Flink JobStarting the Cluster and Building the PackageRunning the JobConclusion
III. Apache Iceberg in Practice
10. Apache Iceberg in Production
Apache Iceberg Metadata TablesThe history Metadata TableThe metadata_log_entries Metadata TableThe snapshots Metadata TableThe files Metadata TableThe manifests Metadata TableThe partitions Metadata TableThe all_data_files Metadata TableThe all_manifests Metadata TableThe refs Metadata TableThe entries Metadata TableUsing the Metadata Tables in ConjunctionIsolation of Changes with BranchesTable Branching and TaggingCatalog Branching and TaggingMultitable TransactionsRolling Back ChangesRolling Back at the Table LevelRolling Back at the Catalog LevelConclusion
11. Streaming with Apache Iceberg
Streaming with SparkStreaming into Iceberg with SparkStreaming from Iceberg with SparkStreaming with FlinkStreaming into Iceberg with FlinkExample of Streaming into Iceberg with FlinkStreaming with Kafka ConnectThe Iceberg Kafka SinkStreaming with AWSConclusion
12. Governance and Security
Securing DatafilesSecuring Files: Best PracticesHadoop Distributed File SystemAmazon Simple Storage ServiceAzure Data Lake StorageGoogle Cloud StorageSecuring and Governing at the Semantic LayerSemantic Layer Best PracticesDremioTrinoSecuring and Governing at the Catalog LevelNessieTabularAWS Glue and Lake FormationAdditional Security and Governance ConsiderationsConclusion
13. Migrating to Apache Iceberg
Migration ConsiderationsThree-Step In-Place Migration PlanFour-Phase Shadow Migration PlanMigrating Hive Tables to Apache IcebergThe Snapshot ProcedureThe Migrate ProcedureMigrating Delta Lake to Apache IcebergMigrating Apache Hudi to Apache IcebergMigrating Individual Files to Apache IcebergUsing the add_files ProcedureMigrating from Delta Lake or Apache Hudi Without Preserving HistoryMigrating from Anywhere by Rewriting DataMigrating Data to a New Iceberg TableMigrating Data into an Existing Iceberg TableConclusion
14. Real-World Use Cases of Apache Iceberg
Ensuring High-Quality Data with Write-Audit-Publish in Apache IcebergWAP Using Iceberg’s Branching FeatureRunning BI Workloads on the Data LakeLand the Raw Data into the Data LakeCurate Virtual Data Marts/Data ProductsCreate a Reflection to Accelerate Our DashboardConnect Our View to Our BI ToolBenefits of Running BI Workloads on the Data LakeImplementing Change Data Capture with Apache IcebergCreate Apache Iceberg TablesApply Updates from Operational SystemsCreate the Change Log View to Capture ChangesMerge Changed Data in the Aggregated TableConclusion
Index
About the Authors

Content preview from Apache Iceberg: The Definitive Guide

Chapter 4. Optimizing the Performance of Iceberg Tables

As you saw in Chapter 3, Apache Iceberg tables provide a layer of metadata that allows the query engine to create smarter query plans for better performance. However, this metadata is only the beginning of how you can optimize the performance of your data.

You have various optimization levers at your disposal, including reducing the number of datafiles, data sorting, table partitioning, row-level update handling, metrics collection, and external factors. These levers play a vital role in enhancing data performance, and this chapter explores each of them, addressing potential slowdowns and providing acceleration insights. Implementing robust monitoring with preferred tools is crucial for identifying optimization needs, including the use of Apache Iceberg metadata tables, which we will cover in Chapter 10.

Compaction

Every procedure or process comes at a cost in terms of time, meaning longer queries and higher compute costs. Stated differently, the more steps you need to take to do something, the longer it will take for you to do it. When you are querying your Apache Iceberg tables, you need to open and scan each file and then close the file when you’re done. The more files you have to scan for a query, the greater the cost these file operations will put on your query. This problem is magnified in the world of streaming or “real-time” data, where data is ingested as it is created, generating lots of files with only a few records ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Build a Large Language Model (From Scratch)

Publisher Resources

ISBN: 9781098148614Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Apache Iceberg: The Definitive Guide

by Tomer Shiran, Jason Hughes, Alex Merced

Chapter 4. Optimizing the Performance of Iceberg Tables

Compaction

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.