Chapter 1. Introduction to Real-Time Analytics

It’s a huge competitive advantage to see in real time what’s happening with your data.

Hilary Mason, Founder and CEO of Fast Forward Labs

A lot of data in a business environment is considered unbounded because it arrives gradually over time. Customers, employers, and machines produced data yesterday and today and will continue to produce more data tomorrow. This process never ends unless you go out of business, so the dataset is never complete in any meaningful way.

Note

Of the companies that participated in Confluent’s Business Impact of Data Streaming: State of Data in Motion Report 2022, 97% have access to real-time data streams, and 66% have widespread access.

Today, many businesses are adopting streaming data and real-time analytics to make faster, more reliable, and more accurate decisions, allowing them to gain a competitive advantage in their market segment.

This chapter provides an introduction to streaming and real-time analytics. We’ll start with a refresher about streaming data before explaining why organizations want to apply analytics on top of that data. After going through some use cases, we’ll conclude with an overview of the types of real-time analytics applications we can build.

What Is an Event Stream?

The term streaming describes a continuous, never-ending flow of data. The data is made available incrementally over time, which means that you can act upon it without needing to wait for the whole dataset to become available so that you can download it.

A data stream consists of a series of data points ordered in time, that is, chronological order, as shown in Figure 1-1.

bras 0101
Figure 1-1. A data stream

Each data point represents an event, or a change in the state of the business. For example, these might be real-time events like a stream of transactions coming from an organization or Internet of Things (IoT) sensors emitting their readings.

One thing even streams have in common is that they keep on producing data for as long as the business exists. Event streams are generated by different data sources in a business, in various formats and volumes.

We can also consider a data stream as an immutable, time-ordered stream of events, carrying facts about state changes that occurred in the business. These sources include, but are not limited to, ecommerce purchases, in-game player activity, information from social networks, clickstream data, activity logs from web servers, sensor data, and telemetry from connected devices or instrumentation in data centers.

An example of an event is the following:

A user with ID 1234 purchased item 567 for $3.99 on 2022/06/12 at 12:23:212

Events are an immutable representation of facts that happened in the past. The facts of this event are shown in Table 1-1.

Table 1-1. Facts in event example
Fact Value

User ID

1234

Item purchased

567

Price paid

$3.99

By aggregating and analyzing event streams, businesses can uncover insights about their customers and use them to improve their offerings. In the next section, we will discuss different means of making sense of events.

Making Sense of Streaming Data

Events have a shelf life. The business value of events rapidly decreases over time, as shown in Figure 1-2.

bras 0102
Figure 1-2. Event shelf life

The sooner you understand events’ behavior, the sooner you can react and maximize your business outcome. For example, if we have an event that a user abandoned their shopping cart, we can reach out to them via SMS or email to find out why that happened. Perhaps we can offer them a voucher for one of the items in their cart to entice them to come back and complete the transaction.

But that only works if we’re able to react to the cart abandonment in real time. If we detect it tomorrow, the user has probably forgotten what they were doing and will likely ignore our email.

What Is Real-Time Analytics?

Real-time analytics (RTA) describes an approach to data processing that allows us to extract value from events as soon as they are made available.

Tip

When we use the term real time in this book, we are referring to soft real time. Delays causes by network latencies and garbage collection pauses, for example, may delay the delivery and processing of events by hundreds of milliseconds or more.

Real-time analytics differs substantially from batch processing, where we collect data in batches and then process it, often with quite a long delay between event time and processing time. Figure 1-3 gives a visual representation of batch processing.

bras 0103
Figure 1-3. Batch processing

In contrast, with real-time analytics we react right after the event happens, as shown in Figure 1-4.

bras 0104
Figure 1-4. Real-time processing

Traditionally, batch processing was the only means of data analysis, but it required us to draw artificial time boundaries to make it easier to divide the data into chunks of fixed duration and process them in batches. For example, we might process a day’s worth of data at the end of every day or an hour’s worth of data at the end of every hour. That was too slow for many users because it produced stale results and didn’t allow them to react to things as they were happening.

Over time the impact of these problems was reduced by decreasing the size of processing batches down to the minute or even the second, which eventually led to events being processed as they arrived and fixed time slices being abandoned. And that is the whole idea behind real-time analytics!

Real-time analytics systems capture, analyze, and act upon events as soon as they become available. They are the unbounded, incrementally processed counterpart to the batch processing systems that have dominated the data analytics space for years.

Benefits of Real-Time Analytics

Speed is a decisive factor in decision making, where organizations that are fast enough to understand and respond to events more often become market leaders, while others remain followers. Therefore, a real-time analytics system can be beneficial to business in many ways, as shown in Figure 1-5.

bras 0105
Figure 1-5. Benefits of real-time analytics

In this section, we will explore several benefits of real-time analytics systems.

New Revenue Streams

Real-time analytics can generate new revenue streams for organizations in a couple of different ways. By allowing end users to have access to analytical querying capabilities, organizations can develop brand new data-centered products that are compelling enough that users will pay to have access.

In addition, real-time analytics can make existing applications more compelling, leading to increased user engagement and retention, as well as driving more usage of the application, ultimately creating more revenue for the organization.

Timely Access to Insights

Real-time analytics enables better, faster, and improved decision making by providing timely access to actionable insights. Businesses can maximize their profits and reduce losses by understanding and reacting to events in real time. For example, real-time customer behavior analysis results in launching dynamic and more focused marketing campaigns, which often drive high returns. Also, a real-time temperature-monitoring system can reduce costs by shutting down air conditioning based on fluctuations in temperature.

Reduced Infrastructure Cost

In traditional batch processing, data storage and computation is often coupled, resulting in an exponential growth in infrastructure cost as the data volume grows over time. In real-time analytics, data is processed as it arrives, eliminating the need for costly data storage and processing systems.

Improved Overall Customer Experience

Addressing customer issues took a rather reactive approach in the past, as issues were reported, diagnosed, and solved in lengthy time frames. With real-time analytics, businesses can proactively attend to customers by constantly monitoring for potential issues and fixing them automatically, improving overall customer satisfaction.

Real-Time Analytics Use Cases

Real-time analytics is not a new thing. It has been around in many industries for quite some time. In this section, we will look at several real-world use cases where real-time analytics is applicable and has already been delivering value to businesses.

There are a variety of use cases for real-time analytics, and each use case has different requirements with respect to query throughput, query latency, query complexity, and data accuracy. A real-time metrics use case requires higher data accuracy, but it’s fine if queries take a bit longer to return. On the other hand, a user-facing analytical application must be optimized for query speed.

Table 1-2 describes some use cases and their query properties.

Table 1-2. Common use cases and their properties
Use case Query throughput (queries/second) Query latency (p95th) Consistency & accuracy Query complexity

User-facing analytics

Very high

Very low

Best effort

Low

Personalization

Very high

Very low

Best effort

Low

Metrics

High

Very low

Accurate

Low

Anomaly detection

Moderate

Very low

Best effort

Low

Root-cause analytics

High

Very low

Best effort

Low

Visualizations & dashboarding

Moderate

Low

Best effort

Medium

Ad hoc analytics

Low

High

Best effort

High

Log analytics & text search

Moderate

Moderate

Best effort

Medium

User-Facing Analytics

Organizations have been generating and collecting massive amounts of data for a long time now. Analytics on that data has been playing a crucial role in analyzing user behavior, growth potential, and revenue spend, enabling employees and executives to make key business decisions.

This analytics has mostly been done inside organizations, but there is an increasing desire to provide this analytical capability directly to end users. Doing so will democratize decision making and provide an even more personalized experience. The term user-facing analytics has been coined to describe this process.

The key requirements are high throughput and low query latency, since this directly impacts the user experience.

Personalization

Personalization is a special type of user-facing analytics used to personalize the product experience for a given user. This might mean showing them content that they’ll be particularly interested in or presenting them with vouchers specific to their interests.

This is done by going through a user’s activity and product interaction and extracting real-time features, which are then used to generate personalized recommendations or actions.

Metrics

Tracking business metrics in real time allows us to get an up-to-date view on key performance indicators (KPIs). This enables organizations to identify issues and take proactive steps to address them in a timely manner.

Being able to do this is particularly critical for operational intelligence, anomaly/fraud detection, and financial planning.

This use case has a requirement for a high number of queries per second along with low latency. We must also achieve a high degree of data accuracy.

Anomaly Detection and Root Cause Analysis

Anomaly detection and root cause analysis is a popular use case when working with time-series data. Time-series data is a sequence of data points collected over a period of time.

In the context of ecommerce, this could include data like the number of transactions per day, average transaction value, or number of returns.

Anomaly detection is all about identifying unusual patterns in data that may indicate a problem that we need to address. That problem might be a sudden surge in orders of a particular product or even that fraudulent activity is being committed.

Either way, we need to find out what went wrong quickly. It’s no use finding out that we had a problem 24 hours ago—​we need to know about it now!

And once we’ve detected that something unusual has happened, we also need to understand which dimensions were responsible for any anomalies. In other words, we need to find the root cause of the issue.

This use case performs temporal scans and Group By queries at a high number of queries per second.

Visualization

Much has been said about the death of the dashboard, but dashboards still have a role to play in the real-time analytics space.

This could be as simple as a dashboard that plots metrics on different charts or as complex as geospatial visualization, clustering, trend analysis, and more. The main difference from a typical dashboard is that tables and charts will be constantly updated as new data comes in.

The serving layer must integrate with existing visualization solutions like Apache Superset and Grafana.

Ad Hoc Analytics

Analysts often want to do real-time data exploration to debug issues and detect patterns as events are being ingested. This means that we need to be able to run SQL queries against the serving layer.

Analysts will also want to do some analysis that combines real-time data with historical data. For example, they might want to see how the business is performing this month compared to the same month in previous years. This means that we either need to bring the historical data into the serving layer or use a distributed SQL query engine that can combine multiple data sources. The number of queries per second will be slow, but query complexity is likely to be high.

Log Analytics/Text Search

Running real-time text search queries on application log data is a less common but still important use case. Since logs are often unstructured, we must be able to run regex-style text search on this data to triage production issues.

The queries per second will be low for most applications, but will get higher if we are debugging a user-facing application.

Classifying Real-Time Analytics Applications

Now that you’ve been introduced to streaming data and real-time analytics, along with its benefits and several industry use cases, the remaining chapters of the book will walk through the process of building real-time analytics applications to harness value from streaming data.

Before starting to build, let’s classify real-time analytics applications based on the audience and the use cases they serve. That helps us pick the right application type to solve our analytics needs.

The quadrant diagram shown in Figure 1-6 divides real-time analytics applications into four categories along two axes.

bras 0107
Figure 1-6. Real-time analytics quadrants

Internal Versus External Facing

There are two types of real-time analytics applications: internal and external facing.

Internal facing means the insights produced by applications are utilized within organizational boundaries, possibly for internal use cases. An example would be a transportation company monitoring vehicle performance to optimize fuel efficiency and detect maintenance issues, or a telecommunications provider that monitors network performance and network capacity.

External facing means that the insights are consumed by an audience external to the organization, possibly by end users. Examples would be a user of a ride-sharing application tracking the ride’s location in real time or a health-care app that tracks patient vital signs in real time and alerts health-care professionals to any changes that require attention.

Traditionally, many more applications have been built for internal users because doing so is easier. The number of users concurrently accessing an internal application is usually relatively small, and they also have a higher tolerance for query latency—​if a query takes 10 seconds to run, so be it.

External users, on the other hand, aren’t nearly as forgiving. They expect queries to return results instantly, and there are many more of these users accessing a given application, all of whom will likely be using the system at the same time. On the other hand, serving real-time analytics to external users gives us a big opportunity, as shown in Figure 1-7.

bras 0108
Figure 1-7. The analytics flywheel

Real-time analytics improves the products that we’re offering to external users. This in turn means that more of them will use and engage with those products, in the process generating more data, which we can use to create new products.

Machine Versus Human Facing

The analytics produced by human-facing applications are delivered using a UI, such as a dashboard, and consumed by humans, including decision-makers, analysts, operators, engineers, and end users. Ad hoc interactive exploration of insights is the primary objective of these applications.

Machine-facing analytics applications are consumed by machines, such as microservices, recommendation engines, and machine learning algorithms. There, the logic to derive analytics is programmed into the application prior to the execution, eliminating any human intervention. Speed and accuracy are primary objectives of machine-facing applications, where humans often fail to deliver at scale.

Summary

In this chapter, we learned how streams of events form the foundation for real-time analytics, a practice of analyzing events as soon as they are made available. We also discussed the benefits of real-time analytics, along with a few industry use cases. Finally, we classified real-time analytics applications based on the audience they are serving.

In the next chapter, we will dive deep into the real-time analytics landscape to identify critical components that exist in a typical real-time analytics infrastructure.

Get Building Real-Time Analytics Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.