book

Apache Iceberg: The Definitive Guide

by Tomer Shiran, Jason Hughes, Alex Merced

May 2024

Intermediate to advanced

344 pages

8h 40m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword by Gerrit Kazmaier
Foreword by Raghu Ramakrishnan
Foreword by Rick Sears
Preface
About This BookWhy We Wrote This BookWhat You Will Find InsideHow to Use This BookFeedback and QuestionsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Fundamentals of Apache Iceberg
1. Introduction to Apache Iceberg
How Did We Get Here? A Brief HistoryFoundational Components of a System Designed for OLAP WorkloadsBringing It All TogetherThe Data WarehouseA Brief HistoryPros and Cons of a Data WarehouseThe Data LakeA Brief HistoryPros and Cons of a Data LakeShould I Run Analytics on a Data Lake or a Data Warehouse?The Data LakehouseWhat Is a Table Format?Hive: The Original Table FormatModern Data Lake Table FormatsWhat Is Apache Iceberg?How Apache Iceberg Came to BeThe Apache Iceberg ArchitectureKey Features of Apache IcebergConclusion
2. The Architecture of Apache Iceberg
The Data LayerDatafilesDelete FilesThe Metadata LayerManifest FilesManifest ListsMetadata FilesPuffin FilesThe CatalogConclusion
3. Lifecycle of Write and Read Queries
Writing Queries in Apache IcebergCreate the TableInsert the QueryMerge QueryReading Queries in Apache IcebergThe SELECT QueryThe Time-Travel QueryConclusion
4. Optimizing the Performance of Iceberg Tables
CompactionHands-on with CompactionCompaction StrategiesAutomating CompactionSortingZ-orderPartitioningHidden PartitioningPartition EvolutionOther Partitioning ConsiderationsCopy-on-Write Versus Merge-on-ReadCopy-on-WriteMerge-on-ReadConfiguring COW and MOROther ConsiderationsMetrics CollectionRewriting ManifestsOptimizing StorageWrite Distribution ModeObject Storage ConsiderationsDatafile Bloom FiltersConclusion
5. Iceberg Catalogs
Requirements of an Iceberg CatalogCatalog ComparisonThe Hadoop CatalogThe Hive CatalogThe AWS Glue CatalogThe Nessie CatalogThe REST CatalogThe JDBC CatalogOther CatalogsCatalog MigrationUsing the Apache Iceberg Catalog Migration CLIUsing an EngineConclusion

II. Hands-on with Apache Iceberg
6. Apache Spark
ConfigurationConfiguring Apache Iceberg and SparkConfiguring the CatalogsStarting Spark with All the Configurations (AWS Glue Example)Data Definition Language OperationsCREATE TABLEALTER TABLEAlter a Table with Iceberg’s Spark SQL ExtensionsDROP TABLEReading DataThe Select All QueryThe Filter Rows QueryAggregation QueriesUsing Window FunctionsWriting DataINSERT INTOMERGE INTOINSERT OVERWRITEDELETE FROMUPDATEIceberg Table Maintenance ProceduresExpire SnapshotsRewrite DatafilesRewrite ManifestsRemove Orphan FilesConclusion
7. Dremio’s SQL Query Engine
ConfigurationData Definition Language OperationsCREATE TABLEALTER TABLEDROP TABLEReading DataUsing the SELECT QueryFiltering RowsUsing Aggregated QueriesUsing Window FunctionsWriting DataINSERT INTOCOPY INTOMERGE INTODELETEUPDATEIceberg Table MaintenanceExpire SnapshotsRewrite DatafilesRewrite ManifestsConclusion
8. AWS Glue
ConfigurationCreating a Glue DatabaseConfiguring the Glue ETL JobCreate a Table Using the Glue Data CatalogRead the TableInsert the DataConclusion
9. Apache Flink
ConfigurationPrerequisitesStart the Flink Cluster and Flink SQL ClientData Definition Language OperationsCREATE CATALOGCREATE DATABASECREATE TABLEALTER TABLEDROP TABLEReading DataFlink SQL Batch ReadFlink SQL Streaming ReadMetadata TableWriting DataINSERT INTOINSERT OVERWRITEUPSERTFlink DataFrame and Table API with Apache Iceberg TablesPrerequisitesConfiguring the Flink JobStarting the Cluster and Building the PackageRunning the JobConclusion
III. Apache Iceberg in Practice
10. Apache Iceberg in Production
Apache Iceberg Metadata TablesThe history Metadata TableThe metadata_log_entries Metadata TableThe snapshots Metadata TableThe files Metadata TableThe manifests Metadata TableThe partitions Metadata TableThe all_data_files Metadata TableThe all_manifests Metadata TableThe refs Metadata TableThe entries Metadata TableUsing the Metadata Tables in ConjunctionIsolation of Changes with BranchesTable Branching and TaggingCatalog Branching and TaggingMultitable TransactionsRolling Back ChangesRolling Back at the Table LevelRolling Back at the Catalog LevelConclusion
11. Streaming with Apache Iceberg
Streaming with SparkStreaming into Iceberg with SparkStreaming from Iceberg with SparkStreaming with FlinkStreaming into Iceberg with FlinkExample of Streaming into Iceberg with FlinkStreaming with Kafka ConnectThe Iceberg Kafka SinkStreaming with AWSConclusion
12. Governance and Security
Securing DatafilesSecuring Files: Best PracticesHadoop Distributed File SystemAmazon Simple Storage ServiceAzure Data Lake StorageGoogle Cloud StorageSecuring and Governing at the Semantic LayerSemantic Layer Best PracticesDremioTrinoSecuring and Governing at the Catalog LevelNessieTabularAWS Glue and Lake FormationAdditional Security and Governance ConsiderationsConclusion
13. Migrating to Apache Iceberg
Migration ConsiderationsThree-Step In-Place Migration PlanFour-Phase Shadow Migration PlanMigrating Hive Tables to Apache IcebergThe Snapshot ProcedureThe Migrate ProcedureMigrating Delta Lake to Apache IcebergMigrating Apache Hudi to Apache IcebergMigrating Individual Files to Apache IcebergUsing the add_files ProcedureMigrating from Delta Lake or Apache Hudi Without Preserving HistoryMigrating from Anywhere by Rewriting DataMigrating Data to a New Iceberg TableMigrating Data into an Existing Iceberg TableConclusion
14. Real-World Use Cases of Apache Iceberg
Ensuring High-Quality Data with Write-Audit-Publish in Apache IcebergWAP Using Iceberg’s Branching FeatureRunning BI Workloads on the Data LakeLand the Raw Data into the Data LakeCurate Virtual Data Marts/Data ProductsCreate a Reflection to Accelerate Our DashboardConnect Our View to Our BI ToolBenefits of Running BI Workloads on the Data LakeImplementing Change Data Capture with Apache IcebergCreate Apache Iceberg TablesApply Updates from Operational SystemsCreate the Change Log View to Capture ChangesMerge Changed Data in the Aggregated TableConclusion
Index
About the Authors

Content preview from Apache Iceberg: The Definitive Guide

Chapter 2. The Architecture of Apache Iceberg

In this chapter, we’ll discuss the architecture and specification that enable Apache Iceberg to resolve the problems inherent in the Hive table format by looking under the covers of an Iceberg table. We’ll cover the different structures of an Iceberg table and what each structure provides and enables so that you can understand what’s happening under the hood as well as best architect your Apache Iceberg–based lakehouse.

As mentioned in Chapter 1, there are three different layers of an Apache Iceberg table: the catalog layer, the metadata layer, and the data layer. Figure 2-1 shows the different components that make up each layer.

In the following sections, we’ll go through each of these components in detail. Since it can be easier to understand concepts new to you by starting with a familiar one, we’ll work from the bottom up, starting with the data layer.

The Data Layer

The data layer of an Apache Iceberg table is what stores the actual data of the table and is primarily made up of the datafiles themselves, although delete files are also included. The data layer is what provides the user with the data needed for their query. While there are some exceptions where structures in the metadata layer can provide a result (e.g., get me the max value for column X), most commonly the ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098148614Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Apache Iceberg: The Definitive Guide

by Tomer Shiran, Jason Hughes, Alex Merced

Chapter 2. The Architecture of Apache Iceberg

Figure 2-1. The architecture of an Apache Iceberg table

The Data Layer

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.