Building Real-Time Analytics Systems

Book description

Gain deep insight into real-time analytics, including the features of these systems and the problems they solve. With this practical book, data engineers at organizations that use event-processing systems such as Kafka, Google Pub/Sub, and AWS Kinesis will learn how to analyze data streams in real time. The faster you derive insights, the quicker you can spot changes in your business and act accordingly.

Author Mark Needham from StarTree provides an overview of the real-time analytics space and an understanding of what goes into building real-time applications. The book's second part offers a series of hands-on tutorials that show you how to combine multiple software products to build real-time analytics applications for an imaginary pizza delivery service.

You will:

  • Learn common architectures for real-time analytics
  • Discover how event processing differs from real-time analytics
  • Ingest event data from Apache Kafka into Apache Pinot
  • Combine event streams with OLTP data using Debezium and Kafka Streams
  • Write real-time queries against event data stored in Apache Pinot
  • Build a real-time dashboard and order tracking app
  • Learn how Uber, Stripe, and Just Eat use real-time analytics

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  3. 1. Introduction to Real-Time Analytics
    1. What Is an Event Stream?
    2. Making Sense of Streaming Data
    3. What Is Real-Time Analytics?
    4. Benefits of Real-Time Analytics
      1. New Revenue Streams
      2. Timely Access to Insights
      3. Reduced Infrastructure Cost
      4. Improved Overall Customer Experience
    5. Real-Time Analytics Use Cases
      1. User-Facing Analytics
      2. Personalization
      3. Metrics
      4. Anomaly Detection and Root Cause Analysis
      5. Visualization
      6. Ad Hoc Analytics
      7. Log Analytics/Text Search
    6. Classifying Real-Time Analytics Applications
      1. Internal Versus External Facing
      2. Machine Versus Human Facing
    7. Summary
  4. 2. The Real-Time Analytics Ecosystem
    1. Defining the Real-Time Analytics Ecosystem
    2. The Classic Streaming Stack
      1. Complex Event Processing
      2. The Big Data Era
    3. The Modern Streaming Stack
      1. Event Producers
      2. Streaming Data Platform
      3. Stream Processing Layer
      4. Serving Layer
      5. Frontend
    4. Summary
  5. 3. Introducing All About That Dough: Real-Time Analytics on Pizza
    1. Existing Architecture
    2. Setup
      1. MySQL
      2. Apache Kafka
      3. ZooKeeper
      4. Orders Service
      5. Spinning Up the Components
    3. Inspecting the Data
    4. Applications of Real-Time Analytics
    5. Summary
  6. 4. Querying Kafka with Kafka Streams
    1. What Is Kafka Streams?
    2. What Is Quarkus?
    3. Quarkus Application
      1. Installing the Quarkus CLI
      2. Creating a Quarkus Application
      3. Creating a Topology
      4. Querying the Key-Value Store
      5. Creating an HTTP Endpoint
    4. Running the Application
    5. Querying the HTTP Endpoint
    6. Limitations of Kafka Streams
    7. Summary
  7. 5. The Serving Layer: Apache Pinot
    1. Why Can’t We Use Another Stream Processor?
    2. Why Can’t We Use a Data Warehouse?
    3. What Is Apache Pinot?
    4. How Does Pinot Model and Store Data?
      1. Schema
      2. Table
    5. Setup
    6. Data Ingestion
    7. Pinot Data Explorer
    8. Indexes
    9. Updating the Web App
    10. Summary
  8. 6. Building a Real-Time Analytics Dashboard
    1. Dashboard Architecture
    2. What Is Streamlit?
    3. Setup
    4. Building the Dashboard
    5. Summary
  9. 7. Product Changes Captured with Change Data Capture
    1. Capturing Changes from Operational Databases
    2. Change Data Capture
      1. Why Do We Need CDC?
      2. What Is CDC?
      3. What Are the Strategies for Implementing CDC?
      4. Log-Based Data Capture
      5. Requirements for a CDC System
      6. Debezium
    3. Applying CDC to AATD
    4. Setup
    5. Connecting Debezium to MySQL
    6. Querying the Products Stream
    7. Updating Products
    8. Summary
  10. 8. Joining Streams with Kafka Streams
    1. Enriching Orders with Kafka Streams
    2. Adding Order Items to Pinot
    3. Updating the Orders Service
    4. Refreshing the Streamlit Dashboard
    5. Summary
  11. 9. Upserts in the Serving Layer
    1. Order Statuses
    2. Enriched Orders Stream
    3. Upserts in Apache Pinot
    4. Updating the Orders Service
      1. Creating UsersResource
      2. Adding an allUsers Endpoint
      3. Adding an Orders for User Endpoint
      4. Adding an Individual Order Endpoint
      5. Configuring Cross-Origin Resource Sharing
    5. Frontend App
    6. Order Statuses on the Dashboard
      1. Time Spent in Each Order Status
      2. Orders That Might Be Stuck
    7. Summary
  12. 10. Geospatial Querying
    1. Delivery Statuses
    2. Updating Apache Pinot
      1. Orders
      2. Delivery Statuses
    3. Updating the Orders Service
      1. Individual Orders
      2. Delayed Orders by Area
    4. Consuming the New API Endpoints
    5. Summary
  13. 11. Production Considerations
    1. Preproduction
      1. Capacity Planning
      2. Data Partitioning
      3. Throughput
      4. Data Retention
      5. Data Granularity
      6. Total Data Size
      7. Replication Factor
    2. Deployment Platform
      1. In-House Skills
      2. Data Privacy and Security
      3. Cost
      4. Control
    3. Postproduction
      1. Monitoring and Alerting
      2. Data Governance
    4. Summary
  14. 12. Real-Time Analytics in the Real World
    1. Content Recommendation (Professional Social Network)
      1. The Problem
      2. The Solution
      3. Benefits
    2. Operational Analytics (Streaming Service)
      1. The Problem
      2. The Solution
      3. Benefits
    3. Real-Time Ad Analytics (Online Marketplace)
      1. The Problem
      2. The Solution
      3. Benefits
    4. User-Facing Analytics (Collaboration Platform)
      1. The Problem
      2. The Solution
      3. Benefits
    5. Summary
  15. 13. The Future of Real-Time Analytics
    1. Edge Analytics
    2. Compute-Storage Separation
    3. Data Lakehouses
    4. Real-Time Data Visualization
    5. Streaming Databases
    6. Streaming Data Platform as a Service
    7. Reverse ETL
    8. Summary
  16. Index
  17. About the Author

Product information

  • Title: Building Real-Time Analytics Systems
  • Author(s): Mark Needham
  • Release date: September 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098138790