book

Kafka: The Definitive Guide

by Neha Narkhede, Gwen Shapira, Todd Palino

September 2017

Beginner to intermediate

319 pages

9h 10m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Should Read This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Meet Kafka
Publish/Subscribe MessagingHow It StartsIndividual Queue SystemsEnter KafkaMessages and BatchesSchemasTopics and PartitionsProducers and ConsumersBrokers and ClustersMultiple ClustersWhy Kafka?Multiple ProducersMultiple ConsumersDisk-Based RetentionScalableHigh PerformanceThe Data EcosystemUse CasesKafka’s OriginLinkedIn’s ProblemThe Birth of KafkaOpen SourceThe NameGetting Started with Kafka
2. Installing Kafka
First Things FirstChoosing an Operating SystemInstalling JavaInstalling ZookeeperInstalling a Kafka BrokerBroker ConfigurationGeneral BrokerTopic DefaultsHardware SelectionDisk ThroughputDisk CapacityMemoryNetworkingCPUKafka in the CloudKafka ClustersHow Many Brokers?Broker ConfigurationOS TuningProduction ConcernsGarbage Collector OptionsDatacenter LayoutColocating Applications on ZookeeperSummary
3. Kafka Producers: Writing Messages to Kafka
Producer OverviewConstructing a Kafka ProducerSending a Message to KafkaSending a Message SynchronouslySending a Message AsynchronouslyConfiguring ProducersSerializersCustom SerializersSerializing Using Apache AvroUsing Avro Records with KafkaPartitionsOld Producer APIsSummary
4. Kafka Consumers: Reading Data from Kafka
Kafka Consumer ConceptsConsumers and Consumer GroupsConsumer Groups and Partition RebalanceCreating a Kafka ConsumerSubscribing to TopicsThe Poll LoopConfiguring ConsumersCommits and OffsetsAutomatic CommitCommit Current OffsetAsynchronous CommitCombining Synchronous and Asynchronous CommitsCommit Specified OffsetRebalance ListenersConsuming Records with Specific OffsetsBut How Do We Exit?DeserializersStandalone Consumer: Why and How to Use a Consumer Without a GroupOlder Consumer APIsSummary
5. Kafka Internals
Cluster MembershipThe ControllerReplicationRequest ProcessingProduce RequestsFetch RequestsOther RequestsPhysical StoragePartition AllocationFile ManagementFile FormatIndexesCompactionHow Compaction WorksDeleted EventsWhen Are Topics Compacted?Summary
6. Reliable Data Delivery
Reliability GuaranteesReplicationBroker ConfigurationReplication FactorUnclean Leader ElectionMinimum In-Sync ReplicasUsing Producers in a Reliable SystemSend AcknowledgmentsConfiguring Producer RetriesAdditional Error HandlingUsing Consumers in a Reliable SystemImportant Consumer Configuration Properties for Reliable ProcessingExplicitly Committing Offsets in ConsumersValidating System ReliabilityValidating ConfigurationValidating ApplicationsMonitoring Reliability in ProductionSummary
7. Building Data Pipelines
Considerations When Building Data PipelinesTimelinessReliabilityHigh and Varying ThroughputData FormatsTransformationsSecurityFailure HandlingCoupling and AgilityWhen to Use Kafka Connect Versus Producer and ConsumerKafka ConnectRunning ConnectConnector Example: File Source and File SinkConnector Example: MySQL to ElasticsearchA Deeper Look at ConnectAlternatives to Kafka ConnectIngest Frameworks for Other DatastoresGUI-Based ETL ToolsStream-Processing FrameworksSummary
8. Cross-Cluster Data Mirroring
Use Cases of Cross-Cluster MirroringMulticluster ArchitecturesSome Realities of Cross-Datacenter CommunicationHub-and-Spokes ArchitectureActive-Active ArchitectureActive-Standby ArchitectureStretch ClustersApache Kafka’s MirrorMakerHow to ConfigureDeploying MirrorMaker in ProductionTuning MirrorMakerOther Cross-Cluster Mirroring SolutionsUber uReplicatorConfluent ReplicatorSummary

9. Administering Kafka
Topic OperationsCreating a New TopicAdding PartitionsDeleting a TopicListing All Topics in a ClusterDescribing Topic DetailsConsumer GroupsList and Describe GroupsDelete GroupOffset ManagementDynamic Configuration ChangesOverriding Topic Configuration DefaultsOverriding Client Configuration DefaultsDescribing Configuration OverridesRemoving Configuration OverridesPartition ManagementPreferred Replica ElectionChanging a Partition’s ReplicasChanging Replication FactorDumping Log SegmentsReplica VerificationConsuming and ProducingConsole ConsumerConsole ProducerClient ACLsUnsafe OperationsMoving the Cluster ControllerKilling a Partition MoveRemoving Topics to Be DeletedDeleting Topics ManuallySummary
10. Monitoring Kafka
Metric BasicsWhere Are the Metrics?Internal or External MeasurementsApplication Health ChecksMetric CoverageKafka Broker MetricsUnder-Replicated PartitionsBroker MetricsTopic and Partition MetricsJVM MonitoringOS MonitoringLoggingClient MonitoringProducer MetricsConsumer MetricsQuotasLag MonitoringEnd-to-End MonitoringSummary
11. Stream Processing
What Is Stream Processing?Stream-Processing ConceptsTimeStateStream-Table DualityTime WindowsStream-Processing Design PatternsSingle-Event ProcessingProcessing with Local StateMultiphase Processing/RepartitioningProcessing with External Lookup: Stream-Table JoinStreaming JoinOut-of-Sequence EventsReprocessingKafka Streams by ExampleWord CountStock Market StatisticsClick Stream EnrichmentKafka Streams: Architecture OverviewBuilding a TopologyScaling the TopologySurviving FailuresStream Processing Use CasesHow to Choose a Stream-Processing FrameworkSummary
A. Installing Kafka on Other Operating Systems
Installing on WindowsUsing Windows Subsystem for LinuxUsing Native JavaInstalling on MacOSUsing HomebrewInstalling Manually
Index

Content preview from Kafka: The Definitive Guide

Chapter 1. Meet Kafka

Every enterprise is powered by data. We take information in, analyze it, manipulate it, and create more as output. Every application creates data, whether it is log messages, metrics, user activity, outgoing messages, or something else. Every byte of data has a story to tell, something of importance that will inform the next thing to be done. In order to know what that is, we need to get the data from where it is created to where it can be analyzed. We see this every day on websites like Amazon, where our clicks on items of interest to us are turned into recommendations that are shown to us a little later.

The faster we can do this, the more agile and responsive our organizations can be. The less effort we spend on moving data around, the more we can focus on the core business at hand. This is why the pipeline is a critical component in the data-driven enterprise. How we move the data becomes nearly as important as the data itself.

Any time scientists disagree, it’s because we have insufficient data. Then we can agree on what kind of data to get; we get the data; and the data solves the problem. Either I’m right, or you’re right, or we’re both wrong. And we move on.

Neil deGrasse Tyson

Publish/Subscribe Messaging

Before discussing the specifics of Apache Kafka, it is important for us to understand the concept of publish/subscribe messaging and why it is important. Publish/subscribe messaging is a pattern that is characterized by the sender (publisher) ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Kafka: The Definitive Guide, 2nd Edition

Publisher Resources

ISBN: 9781491936153Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Kafka: The Definitive Guide

by Neha Narkhede, Gwen Shapira, Todd Palino

Chapter 1. Meet Kafka

Publish/Subscribe Messaging

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.