Chapter 9. Apache Flink

Apache Flink is an efficient stream processing framework that can process batch and real-time data with high throughput and low latency. It has robust features, such as event-time processing, exactly-once semantics, and diverse windowing mechanisms. The combination of Apache Flink and Apache Iceberg brings several advantages. Capabilities in Iceberg, such as snapshot isolation for reads and writes, the ability to handle multiple concurrent operations, ACID-compliant queries, and incremental reads, allow Flink to do operations that were typically difficult with older table formats. Together they provide an efficient and scalable platform for processing large-scale data, specifically for streaming use cases.

In this chapter, we will delve into hands-on usage of Apache Flink with Apache Iceberg. We will primarily look at configuring and setting up the Flink SQL Client with an Iceberg catalog for most of the examples, such as running DDL commands, executing read and write queries, and showing how to do some of these operations using the Flink DataStream and Table APIs in Java. All of these can run on your local machine with the steps provided.

Configuration

Let’s start by going over the basic configuration and setup of a Flink cluster, whether you are using standard Flink with jobs written in Java or whether you are using PyFlink, which compiles jobs from Python to Java.

Prerequisites

You can either download and unpack the latest binary from the official ...

Get Apache Iceberg: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Iceberg: The Definitive Guide by Tomer Shiran, Jason Hughes, Alex Merced

Chapter 9. Apache Flink

Configuration

Prerequisites

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly