Mastering Kafka Streams and ksqlDB

Book description

Working with unbounded and fast-moving data streams has historically been difficult. But with Kafka Streams and ksqlDB, building stream processing applications is easy and fun. This practical guide shows data engineers how to use these tools to build highly scalable stream processing applications for moving, enriching, and transforming large amounts of data in real time.

Mitch Seymour, data services engineer at Mailchimp, explains important stream processing concepts against a backdrop of several interesting business problems. You'll learn the strengths of both Kafka Streams and ksqlDB to help you choose the best tool for each unique stream processing project. Non-Java developers will find the ksqlDB path to be an especially gentle introduction to stream processing.

  • Learn the basics of Kafka and the pub/sub communication pattern
  • Build stateless and stateful stream processing applications using Kafka Streams and ksqlDB
  • Perform advanced stateful operations, including windowed joins and aggregations
  • Understand how stateful processing works under the hood
  • Learn about ksqlDB's data integration features, powered by Kafka Connect
  • Work with different types of collections in ksqlDB and perform push and pull queries
  • Deploy your Kafka Streams and ksqlDB applications to production

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Who Should Read This Book
    2. Navigating This Book
    3. Source Code
    4. Kafka Streams Version
    5. ksqlDB Version
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
  3. I. Kafka
  4. 1. A Rapid Introduction to Kafka
    1. Communication Model
    2. How Are Streams Stored?
    3. Topics and Partitions
    4. Events
    5. Kafka Cluster and Brokers
    6. Consumer Groups
    7. Installing Kafka
    8. Hello, Kafka
    9. Summary
  5. II. Kafka Streams
  6. 2. Getting Started with Kafka Streams
    1. The Kafka Ecosystem
      1. Before Kafka Streams
      2. Enter Kafka Streams
    2. Features at a Glance
    3. Operational Characteristics
      1. Scalability
      2. Reliability
      3. Maintainability
    4. Comparison to Other Systems
      1. Deployment Model
      2. Processing Model
      3. Kappa Architecture
    5. Use Cases
    6. Processor Topologies
      1. Sub-Topologies
      2. Depth-First Processing
      3. Benefits of Dataflow Programming
      4. Tasks and Stream Threads
    7. High-Level DSL Versus Low-Level Processor API
    8. Introducing Our Tutorial: Hello, Streams
      1. Project Setup
      2. Creating a New Project
      3. Adding the Kafka Streams Dependency
      4. DSL
      5. Processor API
    9. Streams and Tables
      1. Stream/Table Duality
      2. KStream, KTable, GlobalKTable
    10. Summary
  7. 3. Stateless Processing
    1. Stateless Versus Stateful Processing
    2. Introducing Our Tutorial: Processing a Twitter Stream
    3. Project Setup
    4. Adding a KStream Source Processor
    5. Serialization/Deserialization
      1. Building a Custom Serdes
      2. Defining Data Classes
      3. Implementing a Custom Deserializer
      4. Implementing a Custom Serializer
      5. Building the Tweet Serdes
    6. Filtering Data
    7. Branching Data
    8. Translating Tweets
    9. Merging Streams
    10. Enriching Tweets
      1. Avro Data Class
      2. Sentiment Analysis
    11. Serializing Avro Data
      1. Registryless Avro Serdes
      2. Schema Registry–Aware Avro Serdes
    12. Adding a Sink Processor
    13. Running the Code
    14. Empirical Verification
    15. Summary
  8. 4. Stateful Processing
    1. Benefits of Stateful Processing
    2. Preview of Stateful Operators
    3. State Stores
      1. Common Characteristics
      2. Persistent Versus In-Memory Stores
    4. Introducing Our Tutorial: Video Game Leaderboard
    5. Project Setup
    6. Data Models
    7. Adding the Source Processors
      1. KStream
      2. KTable
      3. GlobalKTable
    8. Registering Streams and Tables
    9. Joins
      1. Join Operators
      2. Join Types
      3. Co-Partitioning
      4. Value Joiners
      5. KStream to KTable Join (players Join)
      6. KStream to GlobalKTable Join (products Join)
    10. Grouping Records
      1. Grouping Streams
      2. Grouping Tables
    11. Aggregations
      1. Aggregating Streams
      2. Aggregating Tables
    12. Putting It All Together
    13. Interactive Queries
      1. Materialized Stores
      2. Accessing Read-Only State Stores
      3. Querying Nonwindowed Key-Value Stores
      4. Local Queries
      5. Remote Queries
    14. Summary
  9. 5. Windows and Time
    1. Introducing Our Tutorial: Patient Monitoring Application
    2. Project Setup
    3. Data Models
    4. Time Semantics
    5. Timestamp Extractors
      1. Included Timestamp Extractors
      2. Custom Timestamp Extractors
      3. Registering Streams with a Timestamp Extractor
    6. Windowing Streams
      1. Window Types
      2. Selecting a Window
      3. Windowed Aggregation
    7. Emitting Window Results
      1. Grace Period
      2. Suppression
    8. Filtering and Rekeying Windowed KTables
    9. Windowed Joins
    10. Time-Driven Dataflow
      1. Alerts Sink
      2. Querying Windowed Key-Value Stores
    11. Summary
  10. 6. Advanced State Management
    1. Persistent Store Disk Layout
    2. Fault Tolerance
      1. Changelog Topics
      2. Standby Replicas
    3. Rebalancing: Enemy of the State (Store)
    4. Preventing State Migration
      1. Sticky Assignment
      2. Static Membership
    5. Reducing the Impact of Rebalances
      1. Incremental Cooperative Rebalancing
      2. Controlling State Size
    6. Deduplicating Writes with Record Caches
    7. State Store Monitoring
      1. Adding State Listeners
      2. Adding State Restore Listeners
    8. Built-in Metrics
    9. Interactive Queries
    10. Custom State Stores
    11. Summary
  11. 7. Processor API
    1. When to Use the Processor API
    2. Introducing Our Tutorial: IoT Digital Twin Service
    3. Project Setup
    4. Data Models
    5. Adding Source Processors
    6. Adding Stateless Stream Processors
    7. Creating Stateless Processors
    8. Creating Stateful Processors
    9. Periodic Functions with Punctuate
    10. Accessing Record Metadata
    11. Adding Sink Processors
    12. Interactive Queries
    13. Putting It All Together
    14. Combining the Processor API with the DSL
    15. Processors and Transformers
    16. Putting It All Together: Refactor
    17. Summary
  12. III. ksqlDB
  13. 8. Getting Started with ksqlDB
    1. What Is ksqlDB?
    2. When to Use ksqlDB
    3. Evolution of a New Kind of Database
      1. Kafka Streams Integration
      2. Connect Integration
      3. How Does ksqlDB Compare to a Traditional SQL Database?
      4. Similarities
      5. Differences
    4. Architecture
      1. ksqlDB Server
      2. ksqlDB Clients
    5. Deployment Modes
      1. Interactive Mode
      2. Headless Mode
    6. Tutorial
      1. Installing ksqlDB
      2. Running a ksqlDB Server
      3. Precreating Topics
      4. Using the ksqlDB CLI
      5. Summary
  14. 9. Data Integration with ksqlDB
    1. Kafka Connect Overview
    2. External Versus Embedded Connect
      1. External Mode
      2. Embedded Mode
    3. Configuring Connect Workers
      1. Converters and Serialization Formats
    4. Tutorial
    5. Installing Connectors
      1. Creating Connectors with ksqlDB
      2. Showing Connectors
      3. Describing Connectors
      4. Dropping Connectors
    6. Verifying the Source Connector
    7. Interacting with the Kafka Connect Cluster Directly
    8. Introspecting Managed Schemas
    9. Summary
  15. 10. Stream Processing Basics with ksqlDB
    1. Tutorial: Monitoring Changes at Netflix
    2. Project Setup
    3. Source Topics
    4. Data Types
      1. Custom Types
    5. Collections
      1. Creating Source Collections
      2. With Clause
    6. Working with Streams and Tables
      1. Showing Streams and Tables
      2. Describing Streams and Tables
      3. Altering Streams and Tables
      4. Dropping Streams and Tables
    7. Basic Queries
      1. Insert Values
      2. Simple Selects (Transient Push Queries)
      3. Projection
      4. Filtering
      5. Flattening/Unnesting Complex Structures
    8. Conditional Expressions
      1. Coalesce
      2. IFNULL
      3. Case Statements
    9. Writing Results Back to Kafka (Persistent Queries)
      1. Creating Derived Collections
    10. Putting It All Together
    11. Summary
  16. 11. Intermediate and Advanced Stream Processing with ksqlDB
    1. Project Setup
    2. Bootstrapping an Environment from a SQL File
    3. Data Enrichment
      1. Joins
      2. Windowed Joins
    4. Aggregations
      1. Aggregation Basics
      2. Windowed Aggregations
    5. Materialized Views
    6. Clients
    7. Pull Queries
      1. Curl
    8. Push Queries
      1. Push Queries via Curl
    9. Functions and Operators
      1. Operators
      2. Showing Functions
      3. Describing Functions
      4. Creating Custom Functions
      5. Additional Resources for Custom ksqlDB Functions
    10. Summary
  17. IV. The Road to Production
  18. 12. Testing, Monitoring, and Deployment
    1. Testing
      1. Testing ksqlDB Queries
      2. Testing Kafka Streams
      3. Behavioral Tests
      4. Benchmarking
      5. Kafka Cluster Benchmarking
      6. Final Thoughts on Testing
    2. Monitoring
      1. Monitoring Checklist
      2. Extracting JMX Metrics
    3. Deployment
      1. ksqlDB Containers
      2. Kafka Streams Containers
      3. Container Orchestration
    4. Operations
      1. Resetting a Kafka Streams Application
      2. Rate-Limiting the Output of Your Application
      3. Upgrading Kafka Streams
    5. Upgrading ksqlDB
    6. Summary
  19. A. Kafka Streams Configuration
    1. Configuration Management
    2. Configuration Properties
      1. Consumer-Specific Configurations
  20. B. ksqlDB Configuration
    1. Query Configurations
    2. Server Configurations
    3. Security Configurations
  21. Index

Product information

  • Title: Mastering Kafka Streams and ksqlDB
  • Author(s): Mitch Seymour
  • Release date: February 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492062493