Kafka: The Definitive Guide, 2nd Edition

Book description

Every enterprise application creates data, whether it consists of log messages, metrics, user activity, or outgoing messages. Moving all this data is just as important as the data itself. With this updated edition, application architects, developers, and production engineers new to the Kafka streaming platform will learn how to handle data in motion. Additional chapters cover Kafka's AdminClient API, transactions, new security features, and tooling changes.

Engineers from Confluent and LinkedIn responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream processing applications with this platform. Through detailed examples, you'll learn Kafka's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer.

You'll examine:

  • Best practices for deploying and configuring Kafka
  • Kafka producers and consumers for writing and reading messages
  • Patterns and use-case requirements to ensure reliable data delivery
  • Best practices for building data pipelines and applications with Kafka
  • How to perform monitoring, tuning, and maintenance tasks with Kafka in production
  • The most critical metrics among Kafka's operational measurements
  • Kafka's delivery capabilities for stream processing systems

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword to the Second Edition
  2. Foreword to the First Edition
  3. Preface
    1. Who Should Read This Book
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  4. 1. Meet Kafka
    1. Publish/Subscribe Messaging
      1. How It Starts
      2. Individual Queue Systems
    2. Enter Kafka
      1. Messages and Batches
      2. Schemas
      3. Topics and Partitions
      4. Producers and Consumers
      5. Brokers and Clusters
      6. Multiple Clusters
    3. Why Kafka?
      1. Multiple Producers
      2. Multiple Consumers
      3. Disk-Based Retention
      4. Scalable
      5. High Performance
      6. Platform Features
    4. The Data Ecosystem
      1. Use Cases
    5. Kafka’s Origin
      1. LinkedIn’s Problem
      2. The Birth of Kafka
      3. Open Source
      4. Commercial Engagement
      5. The Name
    6. Getting Started with Kafka
  5. 2. Installing Kafka
    1. Environment Setup
      1. Choosing an Operating System
      2. Installing Java
      3. Installing ZooKeeper
    2. Installing a Kafka Broker
    3. Configuring the Broker
      1. General Broker Parameters
      2. Topic Defaults
    4. Selecting Hardware
      1. Disk Throughput
      2. Disk Capacity
      3. Memory
      4. Networking
      5. CPU
    5. Kafka in the Cloud
      1. Microsoft Azure
      2. Amazon Web Services
    6. Configuring Kafka Clusters
      1. How Many Brokers?
      2. Broker Configuration
      3. OS Tuning
    7. Production Concerns
      1. Garbage Collector Options
      2. Datacenter Layout
      3. Colocating Applications on ZooKeeper
    8. Summary
  6. 3. Kafka Producers: Writing Messages to Kafka
    1. Producer Overview
    2. Constructing a Kafka Producer
    3. Sending a Message to Kafka
      1. Sending a Message Synchronously
      2. Sending a Message Asynchronously
    4. Configuring Producers
      1. client.id
      2. acks
      3. Message Delivery Time
      4. linger.ms
      5. buffer.memory
      6. compression.type
      7. batch.size
      8. max.in.flight.requests.per.connection
      9. max.request.size
      10. receive.buffer.bytes and send.buffer.bytes
      11. enable.idempotence
    5. Serializers
      1. Custom Serializers
      2. Serializing Using Apache Avro
      3. Using Avro Records with Kafka
    6. Partitions
    7. Headers
    8. Interceptors
    9. Quotas and Throttling
    10. Summary
  7. 4. Kafka Consumers: Reading Data from Kafka
    1. Kafka Consumer Concepts
      1. Consumers and Consumer Groups
      2. Consumer Groups and Partition Rebalance
      3. Static Group Membership
    2. Creating a Kafka Consumer
    3. Subscribing to Topics
    4. The Poll Loop
      1. Thread Safety
    5. Configuring Consumers
      1. fetch.min.bytes
      2. fetch.max.wait.ms
      3. fetch.max.bytes
      4. max.poll.records
      5. max.partition.fetch.bytes
      6. session.timeout.ms and heartbeat.interval.ms
      7. max.poll.interval.ms
      8. default.api.timeout.ms
      9. request.timeout.ms
      10. auto.offset.reset
      11. enable.auto.commit
      12. partition.assignment.strategy
      13. client.id
      14. client.rack
      15. group.instance.id
      16. receive.buffer.bytes and send.buffer.bytes
      17. offsets.retention.minutes
    6. Commits and Offsets
      1. Automatic Commit
      2. Commit Current Offset
      3. Asynchronous Commit
      4. Combining Synchronous and Asynchronous Commits
      5. Committing a Specified Offset
    7. Rebalance Listeners
    8. Consuming Records with Specific Offsets
    9. But How Do We Exit?
    10. Deserializers
      1. Custom Deserializers
      2. Using Avro Deserialization with Kafka Consumer
    11. Standalone Consumer: Why and How to Use a Consumer Without a Group
    12. Summary
  8. 5. Managing Apache Kafka Programmatically
    1. AdminClient Overview
      1. Asynchronous and Eventually Consistent API
      2. Options
      3. Flat Hierarchy
      4. Additional Notes
    2. AdminClient Lifecycle: Creating, Configuring, and Closing
      1. client.dns.lookup
      2. request.timeout.ms
    3. Essential Topic Management
    4. Configuration Management
    5. Consumer Group Management
      1. Exploring Consumer Groups
      2. Modifying Consumer Groups
    6. Cluster Metadata
    7. Advanced Admin Operations
      1. Adding Partitions to a Topic
      2. Deleting Records from a Topic
      3. Leader Election
      4. Reassigning Replicas
    8. Testing
    9. Summary
  9. 6. Kafka Internals
    1. Cluster Membership
    2. The Controller
      1. KRaft: Kafka’s New Raft-Based Controller
    3. Replication
    4. Request Processing
      1. Produce Requests
      2. Fetch Requests
      3. Other Requests
    5. Physical Storage
      1. Tiered Storage
      2. Partition Allocation
      3. File Management
      4. File Format
      5. Indexes
      6. Compaction
      7. How Compaction Works
      8. Deleted Events
      9. When Are Topics Compacted?
    6. Summary
  10. 7. Reliable Data Delivery
    1. Reliability Guarantees
    2. Replication
    3. Broker Configuration
      1. Replication Factor
      2. Unclean Leader Election
      3. Minimum In-Sync Replicas
      4. Keeping Replicas In Sync
      5. Persisting to Disk
    4. Using Producers in a Reliable System
      1. Send Acknowledgments
      2. Configuring Producer Retries
      3. Additional Error Handling
    5. Using Consumers in a Reliable System
      1. Important Consumer Configuration Properties for Reliable Processing
      2. Explicitly Committing Offsets in Consumers
    6. Validating System Reliability
      1. Validating Configuration
      2. Validating Applications
      3. Monitoring Reliability in Production
    7. Summary
  11. 8. Exactly-Once Semantics
    1. Idempotent Producer
      1. How Does the Idempotent Producer Work?
      2. Limitations of the Idempotent Producer
      3. How Do I Use the Kafka Idempotent Producer?
    2. Transactions
      1. Transactions Use Cases
      2. What Problems Do Transactions Solve?
      3. How Do Transactions Guarantee Exactly-Once?
      4. What Problems Aren’t Solved by Transactions?
      5. How Do I Use Transactions?
      6. Transactional IDs and Fencing
      7. How Transactions Work
    3. Performance of Transactions
    4. Summary
  12. 9. Building Data Pipelines
    1. Considerations When Building Data Pipelines
      1. Timeliness
      2. Reliability
      3. High and Varying Throughput
      4. Data Formats
      5. Transformations
      6. Security
      7. Failure Handling
      8. Coupling and Agility
    2. When to Use Kafka Connect Versus Producer and Consumer
    3. Kafka Connect
      1. Running Kafka Connect
      2. Connector Example: File Source and File Sink
      3. Connector Example: MySQL to Elasticsearch
      4. Single Message Transformations
      5. A Deeper Look at Kafka Connect
    4. Alternatives to Kafka Connect
      1. Ingest Frameworks for Other Datastores
      2. GUI-Based ETL Tools
      3. Stream Processing Frameworks
    5. Summary
  13. 10. Cross-Cluster Data Mirroring
    1. Use Cases of Cross-Cluster Mirroring
    2. Multicluster Architectures
      1. Some Realities of Cross-Datacenter Communication
      2. Hub-and-Spoke Architecture
      3. Active-Active Architecture
      4. Active-Standby Architecture
      5. Stretch Clusters
    3. Apache Kafka’s MirrorMaker
      1. Configuring MirrorMaker
      2. Multicluster Replication Topology
      3. Securing MirrorMaker
      4. Deploying MirrorMaker in Production
      5. Tuning MirrorMaker
    4. Other Cross-Cluster Mirroring Solutions
      1. Uber uReplicator
      2. LinkedIn Brooklin
      3. Confluent Cross-Datacenter Mirroring Solutions
    5. Summary
  14. 11. Securing Kafka
    1. Locking Down Kafka
    2. Security Protocols
    3. Authentication
      1. SSL
      2. SASL
      3. Reauthentication
      4. Security Updates Without Downtime
    4. Encryption
      1. End-to-End Encryption
    5. Authorization
      1. AclAuthorizer
      2. Customizing Authorization
      3. Security Considerations
    6. Auditing
    7. Securing ZooKeeper
      1. SASL
      2. SSL
      3. Authorization
    8. Securing the Platform
      1. Password Protection
    9. Summary
  15. 12. Administering Kafka
    1. Topic Operations
      1. Creating a New Topic
      2. Listing All Topics in a Cluster
      3. Describing Topic Details
      4. Adding Partitions
      5. Reducing Partitions
      6. Deleting a Topic
    2. Consumer Groups
      1. List and Describe Groups
      2. Delete Group
      3. Offset Management
    3. Dynamic Configuration Changes
      1. Overriding Topic Configuration Defaults
      2. Overriding Client and User Configuration Defaults
      3. Overriding Broker Configuration Defaults
      4. Describing Configuration Overrides
      5. Removing Configuration Overrides
    4. Producing and Consuming
      1. Console Producer
      2. Console Consumer
    5. Partition Management
      1. Preferred Replica Election
      2. Changing a Partition’s Replicas
      3. Dumping Log Segments
      4. Replica Verification
    6. Other Tools
    7. Unsafe Operations
      1. Moving the Cluster Controller
      2. Removing Topics to Be Deleted
      3. Deleting Topics Manually
    8. Summary
  16. 13. Monitoring Kafka
    1. Metric Basics
      1. Where Are the Metrics?
      2. What Metrics Do I Need?
      3. Application Health Checks
    2. Service-Level Objectives
      1. Service-Level Definitions
      2. What Metrics Make Good SLIs?
      3. Using SLOs in Alerting
    3. Kafka Broker Metrics
      1. Diagnosing Cluster Problems
      2. The Art of Under-Replicated Partitions
      3. Broker Metrics
      4. Topic and Partition Metrics
      5. JVM Monitoring
      6. OS Monitoring
      7. Logging
    4. Client Monitoring
      1. Producer Metrics
      2. Consumer Metrics
      3. Quotas
    5. Lag Monitoring
    6. End-to-End Monitoring
    7. Summary
  17. 14. Stream Processing
    1. What Is Stream Processing?
    2. Stream Processing Concepts
      1. Topology
      2. Time
      3. State
      4. Stream-Table Duality
      5. Time Windows
      6. Processing Guarantees
    3. Stream Processing Design Patterns
      1. Single-Event Processing
      2. Processing with Local State
      3. Multiphase Processing/Repartitioning
      4. Processing with External Lookup: Stream-Table Join
      5. Table-Table Join
      6. Streaming Join
      7. Out-of-Sequence Events
      8. Reprocessing
      9. Interactive Queries
    4. Kafka Streams by Example
      1. Word Count
      2. Stock Market Statistics
      3. ClickStream Enrichment
    5. Kafka Streams: Architecture Overview
      1. Building a Topology
      2. Optimizing a Topology
      3. Testing a Topology
      4. Scaling a Topology
      5. Surviving Failures
    6. Stream Processing Use Cases
    7. How to Choose a Stream Processing Framework
    8. Summary
  18. A. Installing Kafka on Other Operating Systems
    1. Installing on Windows
      1. Using Windows Subsystem for Linux
      2. Using Native Java
    2. Installing on macOS
      1. Using Homebrew
      2. Installing Manually
  19. B. Additional Kafka Tools
    1. Comprehensive Platforms
    2. Cluster Deployment and Management
    3. Monitoring and Data Exploration
    4. Client Libraries
    5. Stream Processing
  20. Index

Product information

  • Title: Kafka: The Definitive Guide, 2nd Edition
  • Author(s): Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty
  • Release date: November 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492043089