book

Kafka: The Definitive Guide, 2nd Edition

by Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty

November 2021

Beginner to intermediate

485 pages

14h 22m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Who Should Read This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Publish/Subscribe MessagingHow It StartsIndividual Queue SystemsEnter KafkaMessages and BatchesSchemasTopics and PartitionsProducers and ConsumersBrokers and ClustersMultiple ClustersWhy Kafka?Multiple ProducersMultiple ConsumersDisk-Based RetentionScalableHigh PerformancePlatform FeaturesThe Data EcosystemUse CasesKafka’s OriginLinkedIn’s ProblemThe Birth of KafkaOpen SourceCommercial EngagementThe NameGetting Started with Kafka
Environment SetupChoosing an Operating SystemInstalling JavaInstalling ZooKeeperInstalling a Kafka BrokerConfiguring the BrokerGeneral Broker ParametersTopic DefaultsSelecting HardwareDisk ThroughputDisk CapacityMemoryNetworkingCPUKafka in the CloudMicrosoft AzureAmazon Web ServicesConfiguring Kafka ClustersHow Many Brokers?Broker ConfigurationOS TuningProduction ConcernsGarbage Collector OptionsDatacenter LayoutColocating Applications on ZooKeeperSummary
Producer OverviewConstructing a Kafka ProducerSending a Message to KafkaSending a Message SynchronouslySending a Message AsynchronouslyConfiguring Producersclient.idacksMessage Delivery Timelinger.msbuffer.memorycompression.typebatch.sizemax.in.flight.requests.per.connectionmax.request.sizereceive.buffer.bytes and send.buffer.bytesenable.idempotenceSerializersCustom SerializersSerializing Using Apache AvroUsing Avro Records with KafkaPartitionsHeadersInterceptorsQuotas and ThrottlingSummary
Kafka Consumer ConceptsConsumers and Consumer GroupsConsumer Groups and Partition RebalanceStatic Group MembershipCreating a Kafka ConsumerSubscribing to TopicsThe Poll LoopThread SafetyConfiguring Consumersfetch.min.bytesfetch.max.wait.msfetch.max.bytesmax.poll.recordsmax.partition.fetch.bytessession.timeout.ms and heartbeat.interval.msmax.poll.interval.msdefault.api.timeout.msrequest.timeout.msauto.offset.resetenable.auto.commitpartition.assignment.strategyclient.idclient.rackgroup.instance.idreceive.buffer.bytes and send.buffer.bytesoffsets.retention.minutesCommits and OffsetsAutomatic CommitCommit Current OffsetAsynchronous CommitCombining Synchronous and Asynchronous CommitsCommitting a Specified OffsetRebalance ListenersConsuming Records with Specific OffsetsBut How Do We Exit?DeserializersCustom DeserializersUsing Avro Deserialization with Kafka ConsumerStandalone Consumer: Why and How to Use a Consumer Without a GroupSummary
AdminClient OverviewAsynchronous and Eventually Consistent APIOptionsFlat HierarchyAdditional NotesAdminClient Lifecycle: Creating, Configuring, and Closingclient.dns.lookuprequest.timeout.msEssential Topic ManagementConfiguration ManagementConsumer Group ManagementExploring Consumer GroupsModifying Consumer GroupsCluster MetadataAdvanced Admin OperationsAdding Partitions to a TopicDeleting Records from a TopicLeader ElectionReassigning ReplicasTestingSummary
Cluster MembershipThe ControllerKRaft: Kafka’s New Raft-Based ControllerReplicationRequest ProcessingProduce RequestsFetch RequestsOther RequestsPhysical StorageTiered StoragePartition AllocationFile ManagementFile FormatIndexesCompactionHow Compaction WorksDeleted EventsWhen Are Topics Compacted?Summary
Reliability GuaranteesReplicationBroker ConfigurationReplication FactorUnclean Leader ElectionMinimum In-Sync ReplicasKeeping Replicas In SyncPersisting to DiskUsing Producers in a Reliable SystemSend AcknowledgmentsConfiguring Producer RetriesAdditional Error HandlingUsing Consumers in a Reliable SystemImportant Consumer Configuration Properties for Reliable ProcessingExplicitly Committing Offsets in ConsumersValidating System ReliabilityValidating ConfigurationValidating ApplicationsMonitoring Reliability in ProductionSummary

Idempotent ProducerHow Does the Idempotent Producer Work?Limitations of the Idempotent ProducerHow Do I Use the Kafka Idempotent Producer?TransactionsTransactions Use CasesWhat Problems Do Transactions Solve?How Do Transactions Guarantee Exactly-Once?What Problems Aren’t Solved by Transactions?How Do I Use Transactions?Transactional IDs and FencingHow Transactions WorkPerformance of TransactionsSummary
Considerations When Building Data PipelinesTimelinessReliabilityHigh and Varying ThroughputData FormatsTransformationsSecurityFailure HandlingCoupling and AgilityWhen to Use Kafka Connect Versus Producer and ConsumerKafka ConnectRunning Kafka ConnectConnector Example: File Source and File SinkConnector Example: MySQL to ElasticsearchSingle Message TransformationsA Deeper Look at Kafka ConnectAlternatives to Kafka ConnectIngest Frameworks for Other DatastoresGUI-Based ETL ToolsStream Processing FrameworksSummary
Use Cases of Cross-Cluster MirroringMulticluster ArchitecturesSome Realities of Cross-Datacenter CommunicationHub-and-Spoke ArchitectureActive-Active ArchitectureActive-Standby ArchitectureStretch ClustersApache Kafka’s MirrorMakerConfiguring MirrorMakerMulticluster Replication TopologySecuring MirrorMakerDeploying MirrorMaker in ProductionTuning MirrorMakerOther Cross-Cluster Mirroring SolutionsUber uReplicatorLinkedIn BrooklinConfluent Cross-Datacenter Mirroring SolutionsSummary
Locking Down KafkaSecurity ProtocolsAuthenticationSSLSASLReauthenticationSecurity Updates Without DowntimeEncryptionEnd-to-End EncryptionAuthorizationAclAuthorizerCustomizing AuthorizationSecurity ConsiderationsAuditingSecuring ZooKeeperSASLSSLAuthorizationSecuring the PlatformPassword ProtectionSummary
Topic OperationsCreating a New TopicListing All Topics in a ClusterDescribing Topic DetailsAdding PartitionsReducing PartitionsDeleting a TopicConsumer GroupsList and Describe GroupsDelete GroupOffset ManagementDynamic Configuration ChangesOverriding Topic Configuration DefaultsOverriding Client and User Configuration DefaultsOverriding Broker Configuration DefaultsDescribing Configuration OverridesRemoving Configuration OverridesProducing and ConsumingConsole ProducerConsole ConsumerPartition ManagementPreferred Replica ElectionChanging a Partition’s ReplicasDumping Log SegmentsReplica VerificationOther ToolsUnsafe OperationsMoving the Cluster ControllerRemoving Topics to Be DeletedDeleting Topics ManuallySummary
Metric BasicsWhere Are the Metrics?What Metrics Do I Need?Application Health ChecksService-Level ObjectivesService-Level DefinitionsWhat Metrics Make Good SLIs?Using SLOs in AlertingKafka Broker MetricsDiagnosing Cluster ProblemsThe Art of Under-Replicated PartitionsBroker MetricsTopic and Partition MetricsJVM MonitoringOS MonitoringLoggingClient MonitoringProducer MetricsConsumer MetricsQuotasLag MonitoringEnd-to-End MonitoringSummary
What Is Stream Processing?Stream Processing ConceptsTopologyTimeStateStream-Table DualityTime WindowsProcessing GuaranteesStream Processing Design PatternsSingle-Event ProcessingProcessing with Local StateMultiphase Processing/RepartitioningProcessing with External Lookup: Stream-Table JoinTable-Table JoinStreaming JoinOut-of-Sequence EventsReprocessingInteractive QueriesKafka Streams by ExampleWord CountStock Market StatisticsClickStream EnrichmentKafka Streams: Architecture OverviewBuilding a TopologyOptimizing a TopologyTesting a TopologyScaling a TopologySurviving FailuresStream Processing Use CasesHow to Choose a Stream Processing FrameworkSummary
Installing on WindowsUsing Windows Subsystem for LinuxUsing Native JavaInstalling on macOSUsing HomebrewInstalling Manually
Comprehensive PlatformsCluster Deployment and ManagementMonitoring and Data ExplorationClient LibrariesStream Processing

Content preview from Kafka: The Definitive Guide, 2nd Edition

Chapter 10. Cross-Cluster Data Mirroring

For most of the book we discuss the setup, maintenance, and use of a single Kafka cluster. There are, however, a few scenarios in which an architecture may need more than one cluster.

In some cases, the clusters are completely separated. They belong to different departments or different use cases, and there is no reason to copy data from one cluster to another. Sometimes, different SLAs or workloads make it difficult to tune a single cluster to serve multiple use cases. Other times, there are different security requirements. Those use cases are fairly easy—managing multiple distinct clusters is the same as running a single cluster multiple times.

In other use cases, the different clusters are interdependent, and the administrators need to continuously copy data between the clusters. In most databases, continuously copying data between database servers is called replication. Since we’ve used replication to describe movement of data between Kafka nodes that are part of the same cluster, we’ll call copying of data between Kafka clusters mirroring. Apache Kafka’s built-in cross-cluster replicator is called MirrorMaker.

In this chapter, we will discuss cross-cluster mirroring of all or part of the data. We’ll start by discussing some of the common use cases for cross-cluster mirroring. Then we’ll show a few architectures that are used to implement these use cases and discuss the pros and cons of each architecture pattern. We’ll then discuss ...