book

Kafka: The Definitive Guide

by Neha Narkhede, Gwen Shapira, Todd Palino

September 2017

Beginner to intermediate

319 pages

9h 10m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Should Read This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Meet Kafka
Publish/Subscribe MessagingHow It StartsIndividual Queue SystemsEnter KafkaMessages and BatchesSchemasTopics and PartitionsProducers and ConsumersBrokers and ClustersMultiple ClustersWhy Kafka?Multiple ProducersMultiple ConsumersDisk-Based RetentionScalableHigh PerformanceThe Data EcosystemUse CasesKafka’s OriginLinkedIn’s ProblemThe Birth of KafkaOpen SourceThe NameGetting Started with Kafka
2. Installing Kafka
First Things FirstChoosing an Operating SystemInstalling JavaInstalling ZookeeperInstalling a Kafka BrokerBroker ConfigurationGeneral BrokerTopic DefaultsHardware SelectionDisk ThroughputDisk CapacityMemoryNetworkingCPUKafka in the CloudKafka ClustersHow Many Brokers?Broker ConfigurationOS TuningProduction ConcernsGarbage Collector OptionsDatacenter LayoutColocating Applications on ZookeeperSummary
3. Kafka Producers: Writing Messages to Kafka
Producer OverviewConstructing a Kafka ProducerSending a Message to KafkaSending a Message SynchronouslySending a Message AsynchronouslyConfiguring ProducersSerializersCustom SerializersSerializing Using Apache AvroUsing Avro Records with KafkaPartitionsOld Producer APIsSummary
4. Kafka Consumers: Reading Data from Kafka
Kafka Consumer ConceptsConsumers and Consumer GroupsConsumer Groups and Partition RebalanceCreating a Kafka ConsumerSubscribing to TopicsThe Poll LoopConfiguring ConsumersCommits and OffsetsAutomatic CommitCommit Current OffsetAsynchronous CommitCombining Synchronous and Asynchronous CommitsCommit Specified OffsetRebalance ListenersConsuming Records with Specific OffsetsBut How Do We Exit?DeserializersStandalone Consumer: Why and How to Use a Consumer Without a GroupOlder Consumer APIsSummary
5. Kafka Internals
Cluster MembershipThe ControllerReplicationRequest ProcessingProduce RequestsFetch RequestsOther RequestsPhysical StoragePartition AllocationFile ManagementFile FormatIndexesCompactionHow Compaction WorksDeleted EventsWhen Are Topics Compacted?Summary
6. Reliable Data Delivery
Reliability GuaranteesReplicationBroker ConfigurationReplication FactorUnclean Leader ElectionMinimum In-Sync ReplicasUsing Producers in a Reliable SystemSend AcknowledgmentsConfiguring Producer RetriesAdditional Error HandlingUsing Consumers in a Reliable SystemImportant Consumer Configuration Properties for Reliable ProcessingExplicitly Committing Offsets in ConsumersValidating System ReliabilityValidating ConfigurationValidating ApplicationsMonitoring Reliability in ProductionSummary
7. Building Data Pipelines
Considerations When Building Data PipelinesTimelinessReliabilityHigh and Varying ThroughputData FormatsTransformationsSecurityFailure HandlingCoupling and AgilityWhen to Use Kafka Connect Versus Producer and ConsumerKafka ConnectRunning ConnectConnector Example: File Source and File SinkConnector Example: MySQL to ElasticsearchA Deeper Look at ConnectAlternatives to Kafka ConnectIngest Frameworks for Other DatastoresGUI-Based ETL ToolsStream-Processing FrameworksSummary
8. Cross-Cluster Data Mirroring
Use Cases of Cross-Cluster MirroringMulticluster ArchitecturesSome Realities of Cross-Datacenter CommunicationHub-and-Spokes ArchitectureActive-Active ArchitectureActive-Standby ArchitectureStretch ClustersApache Kafka’s MirrorMakerHow to ConfigureDeploying MirrorMaker in ProductionTuning MirrorMakerOther Cross-Cluster Mirroring SolutionsUber uReplicatorConfluent ReplicatorSummary

9. Administering Kafka
Topic OperationsCreating a New TopicAdding PartitionsDeleting a TopicListing All Topics in a ClusterDescribing Topic DetailsConsumer GroupsList and Describe GroupsDelete GroupOffset ManagementDynamic Configuration ChangesOverriding Topic Configuration DefaultsOverriding Client Configuration DefaultsDescribing Configuration OverridesRemoving Configuration OverridesPartition ManagementPreferred Replica ElectionChanging a Partition’s ReplicasChanging Replication FactorDumping Log SegmentsReplica VerificationConsuming and ProducingConsole ConsumerConsole ProducerClient ACLsUnsafe OperationsMoving the Cluster ControllerKilling a Partition MoveRemoving Topics to Be DeletedDeleting Topics ManuallySummary
10. Monitoring Kafka
Metric BasicsWhere Are the Metrics?Internal or External MeasurementsApplication Health ChecksMetric CoverageKafka Broker MetricsUnder-Replicated PartitionsBroker MetricsTopic and Partition MetricsJVM MonitoringOS MonitoringLoggingClient MonitoringProducer MetricsConsumer MetricsQuotasLag MonitoringEnd-to-End MonitoringSummary
11. Stream Processing
What Is Stream Processing?Stream-Processing ConceptsTimeStateStream-Table DualityTime WindowsStream-Processing Design PatternsSingle-Event ProcessingProcessing with Local StateMultiphase Processing/RepartitioningProcessing with External Lookup: Stream-Table JoinStreaming JoinOut-of-Sequence EventsReprocessingKafka Streams by ExampleWord CountStock Market StatisticsClick Stream EnrichmentKafka Streams: Architecture OverviewBuilding a TopologyScaling the TopologySurviving FailuresStream Processing Use CasesHow to Choose a Stream-Processing FrameworkSummary
A. Installing Kafka on Other Operating Systems
Installing on WindowsUsing Windows Subsystem for LinuxUsing Native JavaInstalling on MacOSUsing HomebrewInstalling Manually
Index

Content preview from Kafka: The Definitive Guide

Chapter 7. Building Data Pipelines

When people discuss building data pipelines using Apache Kafka, they are usuallly referring to a couple of use cases. The first is building a data pipeline where Apache Kafka is one of the two end points. For example, getting data from Kafka to S3 or getting data from MongoDB into Kafka. The second use case involves building a pipeline between two different systems but using Kafka as an intermediary. An example of this is getting data from Twitter to Elasticsearch by sending the data first from Twitter to Kafka and then from Kafka to Elasticsearch.

When we added Kafka Connect to Apache Kafka in version 0.9, it was after we saw Kafka used in both use cases at LinkedIn and other large organizations. We noticed that there were specific challenges in integrating Kafka into data pipelines that every organization had to solve, and decided to add APIs to Kafka that solve some of those challenges rather than force every organization to figure them out from scratch.

The main value Kafka provides to data pipelines is its ability to serve as a very large, reliable buffer between various stages in the pipeline, effectively decoupling producers and consumers of data within the pipeline. This decoupling, combined with reliability security and efficiency, makes Kafka a good fit for most data pipelines.

Putting Data Integrating in Context

Some organizations think of Kafka as an end point of a pipeline. They look at problems such as “How do I get data from Kafka ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Kafka: The Definitive Guide, 2nd Edition

Publisher Resources

ISBN: 9781491936153Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Kafka: The Definitive Guide

by Neha Narkhede, Gwen Shapira, Todd Palino

Chapter 7. Building Data Pipelines

Putting Data Integrating in Context

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.