book

Building Real-Time Analytics Systems

Name: Building Real-Time Analytics Systems
Author: Mark Needham
ISBN: 9781098138790

by Mark Needham

September 2023

Beginner to intermediate

220 pages

4h 36m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Foreword
Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Real-Time Analytics
What Is an Event Stream?Making Sense of Streaming DataWhat Is Real-Time Analytics?Benefits of Real-Time AnalyticsNew Revenue StreamsTimely Access to InsightsReduced Infrastructure CostImproved Overall Customer ExperienceReal-Time Analytics Use CasesUser-Facing AnalyticsPersonalizationMetricsAnomaly Detection and Root Cause AnalysisVisualizationAd Hoc AnalyticsLog Analytics/Text SearchClassifying Real-Time Analytics ApplicationsInternal Versus External FacingMachine Versus Human FacingSummary
2. The Real-Time Analytics Ecosystem
Defining the Real-Time Analytics EcosystemThe Classic Streaming StackComplex Event ProcessingThe Big Data EraThe Modern Streaming StackEvent ProducersStreaming Data PlatformStream Processing LayerServing LayerFrontendSummary
3. Introducing All About That Dough: Real-Time Analytics on Pizza
Existing ArchitectureSetupMySQLApache KafkaZooKeeperOrders ServiceSpinning Up the ComponentsInspecting the DataApplications of Real-Time AnalyticsSummary
4. Querying Kafka with Kafka Streams
What Is Kafka Streams?What Is Quarkus?Quarkus ApplicationInstalling the Quarkus CLICreating a Quarkus ApplicationCreating a TopologyQuerying the Key-Value StoreCreating an HTTP EndpointRunning the ApplicationQuerying the HTTP EndpointLimitations of Kafka StreamsSummary
5. The Serving Layer: Apache Pinot
Why Can’t We Use Another Stream Processor?Why Can’t We Use a Data Warehouse?What Is Apache Pinot?How Does Pinot Model and Store Data?SchemaTableSetupData IngestionPinot Data ExplorerIndexesUpdating the Web AppSummary
6. Building a Real-Time Analytics Dashboard
Dashboard ArchitectureWhat Is Streamlit?SetupBuilding the DashboardSummary
7. Product Changes Captured with Change Data Capture
Capturing Changes from Operational DatabasesChange Data CaptureWhy Do We Need CDC?What Is CDC?What Are the Strategies for Implementing CDC?Log-Based Data CaptureRequirements for a CDC SystemDebeziumApplying CDC to AATDSetupConnecting Debezium to MySQLQuerying the Products StreamUpdating ProductsSummary
8. Joining Streams with Kafka Streams
Enriching Orders with Kafka StreamsAdding Order Items to PinotUpdating the Orders ServiceRefreshing the Streamlit DashboardSummary

9. Upserts in the Serving Layer
Order StatusesEnriched Orders StreamUpserts in Apache PinotUpdating the Orders ServiceCreating UsersResourceAdding an allUsers EndpointAdding an Orders for User EndpointAdding an Individual Order EndpointConfiguring Cross-Origin Resource SharingFrontend AppOrder Statuses on the DashboardTime Spent in Each Order StatusOrders That Might Be StuckSummary
10. Geospatial Querying
Delivery StatusesUpdating Apache PinotOrdersDelivery StatusesUpdating the Orders ServiceIndividual OrdersDelayed Orders by AreaConsuming the New API EndpointsSummary
11. Production Considerations
PreproductionCapacity PlanningData PartitioningThroughputData RetentionData GranularityTotal Data SizeReplication FactorDeployment PlatformIn-House SkillsData Privacy and SecurityCostControlPostproductionMonitoring and AlertingData GovernanceSummary
12. Real-Time Analytics in the Real World
Content Recommendation (Professional Social Network)The ProblemThe SolutionBenefitsOperational Analytics (Streaming Service)The ProblemThe SolutionBenefitsReal-Time Ad Analytics (Online Marketplace)The ProblemThe SolutionBenefitsUser-Facing Analytics (Collaboration Platform)The ProblemThe SolutionBenefitsSummary
13. The Future of Real-Time Analytics
Edge AnalyticsCompute-Storage SeparationData LakehousesReal-Time Data VisualizationStreaming DatabasesStreaming Data Platform as a ServiceReverse ETLSummary
Index
About the Author

Content preview from Building Real-Time Analytics Systems

Chapter 4. Querying Kafka with Kafka Streams

AATD doesn’t currently have real-time insight into the number of orders being placed or the revenue being generated. The company would like to know if there are spikes or dips in the numbers of orders so that it can react more quickly in the operations part of the business.

The AATD engineering team is already familiar with Kafka Streams from other applications that they’ve built, so we’re going to create a Kafka Streams app that exposes an HTTP endpoint showing recent orders and revenue. We’ll build this app with the Quarkus framework, starting with a naive version. Then we’ll apply some optimizations. We’ll conclude with a summary of the limitations of using a stream processor to query streaming data. Figure 4-1 shows what we’ll be building in this chapter.

What Is Kafka Streams?

Kafka Streams is a library for building streaming applications that transform input Kafka topics into output Kafka topics. It is an example of the stream processor component of the real-time analytics stack described in Chapter 2.

Kafka Streams is often used for joining, filtering, and transforming streams, but in this chapter we’re going to use it to query an existing stream.

At the heart of a Kafka Streams application is a topology, which defines the stream processing logic of the application. A topology describes ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Building Real-Time Analytics Applications

Publisher Resources

ISBN: 9781098138783Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Building Real-Time Analytics Systems

by Mark Needham

Chapter 4. Querying Kafka with Kafka Streams

Figure 4-1. Kafka Streams architecture

What Is Kafka Streams?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.