O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Stream Processing with Apache Flink

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You’ll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

Get started with Apache Flink, the open source framework that enables you to process streaming data—such as user interactions, sensor data, and machine logs—as it arrives. With this practical guide, you’ll learn how to use Apache Flink’s stream processing APIs to implement, continuously run, and maintain real-world applications.

Authors Fabian Hueske, one of Flink’s creators, and Vasia Kalavri, a core contributor to Flink’s graph processing API (Gelly), explains the fundamental concepts of parallel stream processing and shows you how streaming analytics differs from traditional batch data analysis. Software engineers, data engineers, and system administrators will learn the basics of Flink’s DataStream API, including the structure and components of a common Flink streaming application.

  • Solve real-world problems with Apache Flink’s DataStream API
  • Set up an environment for developing stream processing applications for Flink
  • Design streaming applications and migrate periodic batch workloads to continuous streaming workloads
  • Learn about windowed operations that process groups of records
  • Ingest data streams into a DataStream application and emit a result stream into different storage systems
  • Implement stateful and custom operators common in stream processing applications
  • Operate, maintain, and update continuously running Flink streaming applications
  • Explore several deployment options, including the setup of highly available installations

Table of Contents

  1. 1. Introduction to Stateful Stream Processing
    1. Traditional Data Infrastructures
    2. Stateful Stream Processing
      1. Event-Driven Applications
      2. Data Pipelines and Real-Time ETL
      3. Streaming Analytics
    3. The Evolution of Open Source Stream Processing
    4. A Taste of Flink
    5. What You Will Learn in This Book
  2. 2. Stream Processing Fundamentals
    1. Introduction to Dataflow Programming
      1. Dataflow Graphs
      2. Data Parallelism and Task Parallelism
      3. Data Exchange Strategies
    2. Processing Infinite Streams in Parallel
      1. Latency and Throughput
      2. Operations on Data Streams
    3. Time Semantics
      1. What Is the Meaning of One Minute?
      2. Processing Time
      3. Event Time
      4. Watermarks
      5. Processing Time vs. Event Time
    4. State and Consistency Models
      1. Task Failures
      2. Result Guarantees
    5. Summary
  3. 3. The Architecture of Apache Flink
    1. System Architecture
      1. Components of a Flink Setup
      2. Application Deployment
      3. Task Execution
      4. Highly-Available Setup
    2. Data Transfer in Flink
      1. Credit-Based Flow Control
    3. Event Time Processing
      1. Timestamps
      2. Watermarks
      3. Watermark Propagation and Event Time
      4. Timestamp Assignment and Watermark Generation
    4. State Management
      1. Operator State
      2. Keyed State
      3. State Backends
      4. Scaling Stateful Operators
    5. Checkpoints, Savepoints, and State Recovery
      1. Consistent Checkpoints
      2. Recovery from a Consistent Checkpoint
      3. Flink’s Checkpointing Algorithm
      4. Savepoints
    6. Summary
  4. 4. Setting Up a Development Environment for Apache Flink
    1. Required Software
    2. Run and Debug Flink Applications in an IDE
      1. Import the Book’s Examples in Your IDE
      2. Run Flink Applications in an IDE
      3. Debug Flink Applications in an IDE
    3. Bootstrap a Flink Maven Project
  5. 5. The DataStream API (v1.4.0)
    1. Hello, Flink!
      1. Set Up the Execution Environment
      2. Read an Input Stream
      3. Apply Transformations
      4. Output the Result
      5. Execute
    2. Types
      1. Supported Data Types
      2. Type Hints
      3. TypeInformation
    3. Transformations
      1. Basic Transformations
      2. KeyedStream Transformations
      3. Multi-Stream Transformations
      4. Partitioning Transformations
    4. Setting the Parallelism
    5. Referencing Fields and Defining Keys
    6. Defining UDFs
    7. Including External and Flink Dependencies
    8. Summary
  6. 6. Time-Based and Window Operators
    1. Configuring Time Characteristics
      1. Timestamps and Watermarks for Event-Time Applications
      2. Watermarks, Latency, and Completeness
    2. Process Functions
      1. The TimerService and Timers
      2. Emitting to Side Outputs
      3. The CoProcessFunction
    3. Window Operators
      1. Defining Window Operators
      2. Built-in Window Assigners
      3. Applying Functions on Windows
      4. Customizing Window Operators
    4. Joining Streams on Time
      1. The Interval Join
      2. The Window Join
    5. Handling Late Data
      1. Dropping Late Events
      2. Redirecting Late Events
      3. Updating Results by Including Late Events
    6. Summary
  7. 7. Stateful Operators and User Functions
    1. Implementing Stateful Functions
      1. Declaring Keyed State at the RuntimeContext
      2. Implementing Operator List State with the ListCheckpointed Interface
      3. Using Connected Broadcast State
      4. Using the CheckpointedFunction Interface
      5. Receiving Notifications about Completed Checkpoints
    2. Robustness and Performance of Stateful Applications
      1. Choosing a State Backend
      2. Enabling Checkpointing
      3. Updating Stateful Operators
      4. Tuning the Performance of Stateful Applications
      5. Preventing Leaking State
    3. Queryable State
      1. Architecture and Enabling Queryable State
      2. Exposing Queryable State
      3. Querying State from External Applications
    4. Summary
  8. 8. Reading from and Writing to External Systems
    1. Application Consistency Guarantees
    2. Provided Connectors
      1. Apache Kafka Source Connector
      2. Apache Kafka Sink Connector
      3. File System Source Connector
      4. File System Sink Connector
      5. Apache Cassandra Sink Connector
    3. Implementing a Custom Source Function
      1. Resettable Source Functions
      2. Source Functions, Timestamps, and Watermarks
    4. Implementing a Custom Sink Function
      1. Idempotent Sink Connectors
      2. Transactional Sink Connectors
    5. Asynchronously Accessing External Systems
    6. Summary
  9. 9. Setting Flink Up for Streaming Applications
    1. Deployment Modes
      1. Stand-Alone Cluster
      2. Docker
      3. Apache Hadoop YARN
      4. Kubernetes
    2. Highly-Available Setups
      1. Highly-Available Stand-Alone Setup
      2. Highly-Available YARN Setup
      3. Highly-Available Kubernetes Setup
    3. Integration with Hadoop Components
    4. File System Configuration
    5. System Configuration
      1. Java and Classloading
      2. CPU
      3. Main Memory
      4. Disk Storage
      5. State Backends, Checkpointing, and Recovery
      6. Security
    6. Summary
  10. 10. Operating Flink and Streaming Applications
    1. Running and Managing Streaming Applications
      1. Savepoints
      2. Managing Applications with the Command-Line Client
      3. Managing Applications with the REST API
    2. Monitoring Flink Clusters and Applications
      1. Flink Web UI
      2. Metric System
      3. Monitoring Latency
    3. Configuring the Logging Behavior
    4. Summary
  11. 11. Where To Go from Here?
    1. The Rest of the Flink Ecosystem
    2. A Welcoming Community
  12. Index