book

Building Real-Time Analytics Systems

Name: Building Real-Time Analytics Systems
Author: Mark Needham
ISBN: 9781098138790

by Mark Needham

September 2023

Beginner to intermediate

220 pages

4h 36m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Foreword
Preface
Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction to Real-Time Analytics
What Is an Event Stream?Making Sense of Streaming DataWhat Is Real-Time Analytics?Benefits of Real-Time AnalyticsNew Revenue StreamsTimely Access to InsightsReduced Infrastructure CostImproved Overall Customer ExperienceReal-Time Analytics Use CasesUser-Facing AnalyticsPersonalizationMetricsAnomaly Detection and Root Cause AnalysisVisualizationAd Hoc AnalyticsLog Analytics/Text SearchClassifying Real-Time Analytics ApplicationsInternal Versus External FacingMachine Versus Human FacingSummary
2. The Real-Time Analytics Ecosystem
Defining the Real-Time Analytics EcosystemThe Classic Streaming StackComplex Event ProcessingThe Big Data EraThe Modern Streaming StackEvent ProducersStreaming Data PlatformStream Processing LayerServing LayerFrontendSummary
3. Introducing All About That Dough: Real-Time Analytics on Pizza
Existing ArchitectureSetupMySQLApache KafkaZooKeeperOrders ServiceSpinning Up the ComponentsInspecting the DataApplications of Real-Time AnalyticsSummary
4. Querying Kafka with Kafka Streams
What Is Kafka Streams?What Is Quarkus?Quarkus ApplicationInstalling the Quarkus CLICreating a Quarkus ApplicationCreating a TopologyQuerying the Key-Value StoreCreating an HTTP EndpointRunning the ApplicationQuerying the HTTP EndpointLimitations of Kafka StreamsSummary
5. The Serving Layer: Apache Pinot
Why Can’t We Use Another Stream Processor?Why Can’t We Use a Data Warehouse?What Is Apache Pinot?How Does Pinot Model and Store Data?SchemaTableSetupData IngestionPinot Data ExplorerIndexesUpdating the Web AppSummary
6. Building a Real-Time Analytics Dashboard
Dashboard ArchitectureWhat Is Streamlit?SetupBuilding the DashboardSummary
7. Product Changes Captured with Change Data Capture
Capturing Changes from Operational DatabasesChange Data CaptureWhy Do We Need CDC?What Is CDC?What Are the Strategies for Implementing CDC?Log-Based Data CaptureRequirements for a CDC SystemDebeziumApplying CDC to AATDSetupConnecting Debezium to MySQLQuerying the Products StreamUpdating ProductsSummary
8. Joining Streams with Kafka Streams
Enriching Orders with Kafka StreamsAdding Order Items to PinotUpdating the Orders ServiceRefreshing the Streamlit DashboardSummary

9. Upserts in the Serving Layer
Order StatusesEnriched Orders StreamUpserts in Apache PinotUpdating the Orders ServiceCreating UsersResourceAdding an allUsers EndpointAdding an Orders for User EndpointAdding an Individual Order EndpointConfiguring Cross-Origin Resource SharingFrontend AppOrder Statuses on the DashboardTime Spent in Each Order StatusOrders That Might Be StuckSummary
10. Geospatial Querying
Delivery StatusesUpdating Apache PinotOrdersDelivery StatusesUpdating the Orders ServiceIndividual OrdersDelayed Orders by AreaConsuming the New API EndpointsSummary
11. Production Considerations
PreproductionCapacity PlanningData PartitioningThroughputData RetentionData GranularityTotal Data SizeReplication FactorDeployment PlatformIn-House SkillsData Privacy and SecurityCostControlPostproductionMonitoring and AlertingData GovernanceSummary
12. Real-Time Analytics in the Real World
Content Recommendation (Professional Social Network)The ProblemThe SolutionBenefitsOperational Analytics (Streaming Service)The ProblemThe SolutionBenefitsReal-Time Ad Analytics (Online Marketplace)The ProblemThe SolutionBenefitsUser-Facing Analytics (Collaboration Platform)The ProblemThe SolutionBenefitsSummary
13. The Future of Real-Time Analytics
Edge AnalyticsCompute-Storage SeparationData LakehousesReal-Time Data VisualizationStreaming DatabasesStreaming Data Platform as a ServiceReverse ETLSummary
Index
About the Author

Content preview from Building Real-Time Analytics Systems

Chapter 5. The Serving Layer: Apache Pinot

AATD has come to the conclusion that it’s going to need to introduce a new piece of infrastructure to achieve scalable real-time analytics, but isn’t yet convinced that a full-blown OLAP database is necessary.

In this chapter, we’ll start by explaining why we can’t just use a stream processor to serve queries on streams, before introducing Apache Pinot, one of the new breed of OLAP databases designed for real-time analytics. We’ll learn about Pinot’s architecture and data model, before ingesting the orders stream. After that, we’ll learn about timestamp indexes and how to write queries against Pinot using SQL.

Figure 5-1 shows how we’re going to evolve our infrastructure in this chapter.

Why Can’t We Use Another Stream Processor?

At the end of the last chapter, we described some of the limitations of using Kafka Streams to serve queries on top of streams. (See “Limitations of Kafka Streams”.) These were by no means a criticism of Kafka Streams as a technology; it’s just that we weren’t really using it for the types of problems for which it was designed.

A reasonable question might be, Why can’t we use another stream processor instead, such as ksqlDB or Flink? Both of these tools offer SQL interfaces, solving the issue of having to write Java code to query streams.

Unfortunately, it still doesn’t ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Building Real-Time Analytics Applications

Publisher Resources

ISBN: 9781098138783Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Building Real-Time Analytics Systems

by Mark Needham

Chapter 5. The Serving Layer: Apache Pinot

Figure 5-1. Evolution of the orders service

Why Can’t We Use Another Stream Processor?

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.