Chapter 11. Streaming with Apache Iceberg

Streaming data refers to the continuous generation and processing of data, often coming from various sources. These sources can include logfiles, sensor data, social media feeds, and financial transactions, among others. The data is sent in small sizes (or packets) to allow real-time insights and reactions. The nature of streaming data is that it is in constant motion and does not have a finite beginning or end.

The concept of streaming data is essential in the current age of digital information, where businesses, research institutions, and government agencies often need to analyze and make decisions based on the freshest data possible. For example, financial institutions may use streaming data to detect fraudulent transactions as they occur. Similarly, social media platforms use streaming data to customize and update user feeds based on real-time engagement metrics.

There are several reasons why one might want to stream data into an Apache Iceberg table:

Scalability and performance

Apache Iceberg is designed to efficiently store and retrieve information from large datasets. The file management procedures enable it to optimize performance of an ever-changing/growing dataset, making it an excellent choice for streaming analytics.

Schema evolution

As data changes over time, the structure of the data (the schema) may need to evolve as well. Apache Iceberg allows for schema evolution without interrupting ongoing data streaming processes, ...

Get Apache Iceberg: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.