book

Kafka: The Definitive Guide

by Neha Narkhede, Gwen Shapira, Todd Palino

September 2017

Beginner to intermediate

319 pages

9h 10m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Who Should Read This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Meet Kafka
Publish/Subscribe MessagingHow It StartsIndividual Queue SystemsEnter KafkaMessages and BatchesSchemasTopics and PartitionsProducers and ConsumersBrokers and ClustersMultiple ClustersWhy Kafka?Multiple ProducersMultiple ConsumersDisk-Based RetentionScalableHigh PerformanceThe Data EcosystemUse CasesKafka’s OriginLinkedIn’s ProblemThe Birth of KafkaOpen SourceThe NameGetting Started with Kafka
2. Installing Kafka
First Things FirstChoosing an Operating SystemInstalling JavaInstalling ZookeeperInstalling a Kafka BrokerBroker ConfigurationGeneral BrokerTopic DefaultsHardware SelectionDisk ThroughputDisk CapacityMemoryNetworkingCPUKafka in the CloudKafka ClustersHow Many Brokers?Broker ConfigurationOS TuningProduction ConcernsGarbage Collector OptionsDatacenter LayoutColocating Applications on ZookeeperSummary
3. Kafka Producers: Writing Messages to Kafka
Producer OverviewConstructing a Kafka ProducerSending a Message to KafkaSending a Message SynchronouslySending a Message AsynchronouslyConfiguring ProducersSerializersCustom SerializersSerializing Using Apache AvroUsing Avro Records with KafkaPartitionsOld Producer APIsSummary
4. Kafka Consumers: Reading Data from Kafka
Kafka Consumer ConceptsConsumers and Consumer GroupsConsumer Groups and Partition RebalanceCreating a Kafka ConsumerSubscribing to TopicsThe Poll LoopConfiguring ConsumersCommits and OffsetsAutomatic CommitCommit Current OffsetAsynchronous CommitCombining Synchronous and Asynchronous CommitsCommit Specified OffsetRebalance ListenersConsuming Records with Specific OffsetsBut How Do We Exit?DeserializersStandalone Consumer: Why and How to Use a Consumer Without a GroupOlder Consumer APIsSummary
5. Kafka Internals
Cluster MembershipThe ControllerReplicationRequest ProcessingProduce RequestsFetch RequestsOther RequestsPhysical StoragePartition AllocationFile ManagementFile FormatIndexesCompactionHow Compaction WorksDeleted EventsWhen Are Topics Compacted?Summary
6. Reliable Data Delivery
Reliability GuaranteesReplicationBroker ConfigurationReplication FactorUnclean Leader ElectionMinimum In-Sync ReplicasUsing Producers in a Reliable SystemSend AcknowledgmentsConfiguring Producer RetriesAdditional Error HandlingUsing Consumers in a Reliable SystemImportant Consumer Configuration Properties for Reliable ProcessingExplicitly Committing Offsets in ConsumersValidating System ReliabilityValidating ConfigurationValidating ApplicationsMonitoring Reliability in ProductionSummary
7. Building Data Pipelines
Considerations When Building Data PipelinesTimelinessReliabilityHigh and Varying ThroughputData FormatsTransformationsSecurityFailure HandlingCoupling and AgilityWhen to Use Kafka Connect Versus Producer and ConsumerKafka ConnectRunning ConnectConnector Example: File Source and File SinkConnector Example: MySQL to ElasticsearchA Deeper Look at ConnectAlternatives to Kafka ConnectIngest Frameworks for Other DatastoresGUI-Based ETL ToolsStream-Processing FrameworksSummary
8. Cross-Cluster Data Mirroring
Use Cases of Cross-Cluster MirroringMulticluster ArchitecturesSome Realities of Cross-Datacenter CommunicationHub-and-Spokes ArchitectureActive-Active ArchitectureActive-Standby ArchitectureStretch ClustersApache Kafka’s MirrorMakerHow to ConfigureDeploying MirrorMaker in ProductionTuning MirrorMakerOther Cross-Cluster Mirroring SolutionsUber uReplicatorConfluent ReplicatorSummary

9. Administering Kafka
Topic OperationsCreating a New TopicAdding PartitionsDeleting a TopicListing All Topics in a ClusterDescribing Topic DetailsConsumer GroupsList and Describe GroupsDelete GroupOffset ManagementDynamic Configuration ChangesOverriding Topic Configuration DefaultsOverriding Client Configuration DefaultsDescribing Configuration OverridesRemoving Configuration OverridesPartition ManagementPreferred Replica ElectionChanging a Partition’s ReplicasChanging Replication FactorDumping Log SegmentsReplica VerificationConsuming and ProducingConsole ConsumerConsole ProducerClient ACLsUnsafe OperationsMoving the Cluster ControllerKilling a Partition MoveRemoving Topics to Be DeletedDeleting Topics ManuallySummary
10. Monitoring Kafka
Metric BasicsWhere Are the Metrics?Internal or External MeasurementsApplication Health ChecksMetric CoverageKafka Broker MetricsUnder-Replicated PartitionsBroker MetricsTopic and Partition MetricsJVM MonitoringOS MonitoringLoggingClient MonitoringProducer MetricsConsumer MetricsQuotasLag MonitoringEnd-to-End MonitoringSummary
11. Stream Processing
What Is Stream Processing?Stream-Processing ConceptsTimeStateStream-Table DualityTime WindowsStream-Processing Design PatternsSingle-Event ProcessingProcessing with Local StateMultiphase Processing/RepartitioningProcessing with External Lookup: Stream-Table JoinStreaming JoinOut-of-Sequence EventsReprocessingKafka Streams by ExampleWord CountStock Market StatisticsClick Stream EnrichmentKafka Streams: Architecture OverviewBuilding a TopologyScaling the TopologySurviving FailuresStream Processing Use CasesHow to Choose a Stream-Processing FrameworkSummary
A. Installing Kafka on Other Operating Systems
Installing on WindowsUsing Windows Subsystem for LinuxUsing Native JavaInstalling on MacOSUsing HomebrewInstalling Manually
Index

Content preview from Kafka: The Definitive Guide

Chapter 8. Cross-Cluster Data Mirroring

For most of the book we discuss the setup, maintenance, and use of a single Kafka cluster. There are, however, a few scenarios in which an architecture may need more than one cluster.

In some cases, the clusters are completely separated. They belong to different departments or different use cases and there is no reason to copy data from one cluster to another. Sometimes, different SLAs or workloads make it difficult to tune a single cluster to serve multiple use cases. Other times, there are different security requirements. Those use cases are fairly easy—managing multiple distinct clusters is the same as running a single cluster multiple times.

In other use cases, the different clusters are interdependent and the administrators need to continuously copy data between the clusters. In most databases, continuously copying data between database servers is called replication. Since we’ve used “replication” to describe movement of data between Kafka nodes that are part of the same cluster, we’ll call copying of data between Kafka clusters mirroring. Apache Kafka’s built-in cross-cluster replicator is called MirrorMaker.

In this chapter we will discuss cross-cluster mirroring of all or part of the data. We’ll start by discussing some of the common use cases for cross-cluster mirroring. Then we’ll show a few architectures that are used to implement these use cases and discuss the pros and cons of each architecture pattern. We’ll then discuss MirrorMaker ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Kafka: The Definitive Guide, 2nd Edition

Publisher Resources

ISBN: 9781491936153Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Kafka: The Definitive Guide

by Neha Narkhede, Gwen Shapira, Todd Palino

Chapter 8. Cross-Cluster Data Mirroring

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.