Chapter 2. The Architecture of Apache Iceberg
In this chapter, we’ll discuss the architecture and specification that enable Apache Iceberg to resolve the problems inherent in the Hive table format by looking under the covers of an Iceberg table. We’ll cover the different structures of an Iceberg table and what each structure provides and enables so that you can understand what’s happening under the hood as well as best architect your Apache Iceberg–based lakehouse.
As mentioned in Chapter 1, there are three different layers of an Apache Iceberg table: the catalog layer, the metadata layer, and the data layer. Figure 2-1 shows the different components that make up each layer.
In the following sections, we’ll go through each of these components in detail. Since it can be easier to understand concepts new to you by starting with a familiar one, we’ll work from the bottom up, starting with the data layer.
Figure 2-1. The architecture of an Apache Iceberg table
The Data Layer
The data layer of an Apache Iceberg table is what stores the actual data of the table and is primarily made up of the datafiles themselves, although delete files are also included. The data layer is what provides the user with the data needed for their query. While there are some exceptions where structures in the metadata layer can provide a result (e.g., get me the max value for column X), most commonly the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access