book

Apache Iceberg: The Definitive Guide

by Tomer Shiran, Jason Hughes, Alex Merced

May 2024

Intermediate to advanced

344 pages

8h 40m

English

O'Reilly Media, Inc.

Read now

Unlock full access

About This BookWhy We Wrote This BookWhat You Will Find InsideHow to Use This BookFeedback and QuestionsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
How Did We Get Here? A Brief HistoryFoundational Components of a System Designed for OLAP WorkloadsBringing It All TogetherThe Data WarehouseA Brief HistoryPros and Cons of a Data WarehouseThe Data LakeA Brief HistoryPros and Cons of a Data LakeShould I Run Analytics on a Data Lake or a Data Warehouse?The Data LakehouseWhat Is a Table Format?Hive: The Original Table FormatModern Data Lake Table FormatsWhat Is Apache Iceberg?How Apache Iceberg Came to BeThe Apache Iceberg ArchitectureKey Features of Apache IcebergConclusion
The Data LayerDatafilesDelete FilesThe Metadata LayerManifest FilesManifest ListsMetadata FilesPuffin FilesThe CatalogConclusion
Writing Queries in Apache IcebergCreate the TableInsert the QueryMerge QueryReading Queries in Apache IcebergThe SELECT QueryThe Time-Travel QueryConclusion
CompactionHands-on with CompactionCompaction StrategiesAutomating CompactionSortingZ-orderPartitioningHidden PartitioningPartition EvolutionOther Partitioning ConsiderationsCopy-on-Write Versus Merge-on-ReadCopy-on-WriteMerge-on-ReadConfiguring COW and MOROther ConsiderationsMetrics CollectionRewriting ManifestsOptimizing StorageWrite Distribution ModeObject Storage ConsiderationsDatafile Bloom FiltersConclusion
Requirements of an Iceberg CatalogCatalog ComparisonThe Hadoop CatalogThe Hive CatalogThe AWS Glue CatalogThe Nessie CatalogThe REST CatalogThe JDBC CatalogOther CatalogsCatalog MigrationUsing the Apache Iceberg Catalog Migration CLIUsing an EngineConclusion

ConfigurationConfiguring Apache Iceberg and SparkConfiguring the CatalogsStarting Spark with All the Configurations (AWS Glue Example)Data Definition Language OperationsCREATE TABLEALTER TABLEAlter a Table with Iceberg’s Spark SQL ExtensionsDROP TABLEReading DataThe Select All QueryThe Filter Rows QueryAggregation QueriesUsing Window FunctionsWriting DataINSERT INTOMERGE INTOINSERT OVERWRITEDELETE FROMUPDATEIceberg Table Maintenance ProceduresExpire SnapshotsRewrite DatafilesRewrite ManifestsRemove Orphan FilesConclusion
ConfigurationData Definition Language OperationsCREATE TABLEALTER TABLEDROP TABLEReading DataUsing the SELECT QueryFiltering RowsUsing Aggregated QueriesUsing Window FunctionsWriting DataINSERT INTOCOPY INTOMERGE INTODELETEUPDATEIceberg Table MaintenanceExpire SnapshotsRewrite DatafilesRewrite ManifestsConclusion
ConfigurationCreating a Glue DatabaseConfiguring the Glue ETL JobCreate a Table Using the Glue Data CatalogRead the TableInsert the DataConclusion
ConfigurationPrerequisitesStart the Flink Cluster and Flink SQL ClientData Definition Language OperationsCREATE CATALOGCREATE DATABASECREATE TABLEALTER TABLEDROP TABLEReading DataFlink SQL Batch ReadFlink SQL Streaming ReadMetadata TableWriting DataINSERT INTOINSERT OVERWRITEUPSERTFlink DataFrame and Table API with Apache Iceberg TablesPrerequisitesConfiguring the Flink JobStarting the Cluster and Building the PackageRunning the JobConclusion
Apache Iceberg Metadata TablesThe history Metadata TableThe metadata_log_entries Metadata TableThe snapshots Metadata TableThe files Metadata TableThe manifests Metadata TableThe partitions Metadata TableThe all_data_files Metadata TableThe all_manifests Metadata TableThe refs Metadata TableThe entries Metadata TableUsing the Metadata Tables in ConjunctionIsolation of Changes with BranchesTable Branching and TaggingCatalog Branching and TaggingMultitable TransactionsRolling Back ChangesRolling Back at the Table LevelRolling Back at the Catalog LevelConclusion
Streaming with SparkStreaming into Iceberg with SparkStreaming from Iceberg with SparkStreaming with FlinkStreaming into Iceberg with FlinkExample of Streaming into Iceberg with FlinkStreaming with Kafka ConnectThe Iceberg Kafka SinkStreaming with AWSConclusion
Securing DatafilesSecuring Files: Best PracticesHadoop Distributed File SystemAmazon Simple Storage ServiceAzure Data Lake StorageGoogle Cloud StorageSecuring and Governing at the Semantic LayerSemantic Layer Best PracticesDremioTrinoSecuring and Governing at the Catalog LevelNessieTabularAWS Glue and Lake FormationAdditional Security and Governance ConsiderationsConclusion
Migration ConsiderationsThree-Step In-Place Migration PlanFour-Phase Shadow Migration PlanMigrating Hive Tables to Apache IcebergThe Snapshot ProcedureThe Migrate ProcedureMigrating Delta Lake to Apache IcebergMigrating Apache Hudi to Apache IcebergMigrating Individual Files to Apache IcebergUsing the add_files ProcedureMigrating from Delta Lake or Apache Hudi Without Preserving HistoryMigrating from Anywhere by Rewriting DataMigrating Data to a New Iceberg TableMigrating Data into an Existing Iceberg TableConclusion
Ensuring High-Quality Data with Write-Audit-Publish in Apache IcebergWAP Using Iceberg’s Branching FeatureRunning BI Workloads on the Data LakeLand the Raw Data into the Data LakeCurate Virtual Data Marts/Data ProductsCreate a Reflection to Accelerate Our DashboardConnect Our View to Our BI ToolBenefits of Running BI Workloads on the Data LakeImplementing Change Data Capture with Apache IcebergCreate Apache Iceberg TablesApply Updates from Operational SystemsCreate the Change Log View to Capture ChangesMerge Changed Data in the Aggregated TableConclusion

Content preview from Apache Iceberg: The Definitive Guide

Chapter 13. Migrating to Apache Iceberg

Organizations are constantly seeking innovative solutions to manage their data more efficiently and effectively. Apache Iceberg has emerged as a powerful framework for data lakes, offering a high-performance table format that operates like a relational database management system (RDBMS) table. This chapter delves into the process of migrating your data architecture to leverage the benefits of Apache Iceberg.

Why would you migrate to Apache Iceberg?

You don’t have a data lakehouse or are using the Hive table format: Apache Iceberg will supercharge the data on your data lake with ACID transactions, schema/partition evolution, time travel, and more, effectively turning your data lake into a data lakehouse that gives you the flexibility of data lakes with the performance/features of data warehouses.
Iceberg offers unique benefits over other table formats: Apache Iceberg’s unique features include an open specification, open source libraries, transparent project governance, diversity in project governance, no vendor lock-in, and a diverse ecosystem.

While migrating to Apache Iceberg promises a more streamlined data architecture, the process itself, as with any migration, can be intricate and demanding. The transition involves adapting existing data structures, modifying data ingestion pipelines, and updating data processing workflows. Moreover, organizations may need to refactor existing data models and restructure data storage in Iceberg-compatible ...