Chapter 1. The Four Requirements of Real-Time Analytics

Data alone is not valuable. It needs analysis to turn rows of attributes and numbers into actionable insights, and these insights usually have more value when you generate them sooner. Real-time analytics is the ideal situation when data is available for reporting as soon as it is collected. Any delay in the process of moving data from data storage to transactional systems to analysis can leave business value on the table.

Consider this customer service example:

  1. A company collects data on a customer unable to find a particular product during a web interaction.

  2. The data sits in the transactional system until it is loaded into an analytic database hours later that night.

  3. An analyst writes a query to pull information about what customers are unable to find. This takes several hours to write and run.

  4. The report arrives in the salesperson’s inbox 24 hours after the transaction occurred.

  5. The product the user was looking for was available at another nearby store, but because it took so long to collect, store, and analyze the data, the customer moved on to another vendor.

So how do companies reduce or eliminate the delay from transaction to reporting? More importantly, how can that company generate and use that insight at the point of sale at the very moment that customer is looking for that product? There are four key requirements for supporting real-time analytics. They are latency, freshness, throughput, and concurrency, as seen in Figure 1-1.

Where real time analytics gets slowed down
Figure 1-1. Where real-time analytics gets slowed down

These four ideas work together to control your business’s ability to make decisions on the fly. They are required to ensure that data transforms into actionable analytics as close to real-time as possible (in fractions of a second). They are also necessary to quickly tie current transactions with historical data for comparisons and trends without having to preprocess the data. Let’s investigate each of these separately to understand how they impact the flow of data within your organization.

Latency

Speed is the most obvious requirement of real-time analytics. How long does it take from when you request data to when data is returned? Actionable insights about your business are not useful if it takes hours or days to run the queries that generate the necessary data. Additionally, as your business evolves, the queries may become more complex and demanding. Speed can be significantly impacted while combining historical and streaming data into a useful form, and when new insights and queries are desired, reengineering of the entire process must take place.

Many of the remaining requirements for real-time analytics impact speed in one way or another, but the queries themselves could be throttling you. Are you having to sacrifice sources of data that would enrich your analysis? Are there additional dimensions included that are not needed? How long does it take to aggregate and translate data into an actionable format? The speed of your data depends greatly on how it is stored, and that then dictates what needs to happen to pull and combine it into actionable information successfully.

Additionally, your data is always growing. As more data gets added to your data warehouse, more time and computational resources are needed to process the data for analysis. Changes in your business add new sources, tables, more dimensions, and additional indicators to measure. Can your current infrastructure handle the pressure as demands increase? Can you overcome any challenges you encounter with speed by simple brute force methods like adding servers, processing power, or RAM?

Freshness

Freshness, also known as timeliness, is generally characterized by the time between when something happens and when data about that event is available to take action on. In your business, you need to react quickly when things change. The value of a few milliseconds is having a larger impact every day, ranging from sales opportunities to treating life-threatening circumstances. How do you ensure that your data is up to date and ready to serve your organization?

The timeliness of data is important to both decision systems and decision makers. How long does your company wait before data from your transaction systems is available for a model to make a prediction from it? It may be possible for your systems to push data quickly from where it is stored; however, if the data only updates weekly, is that gap between data and decision leaving value on the table? Data needs to reach your AI/ML models and analytics tools as soon as possible after it is generated in production systems. When an event impacts your business, you need to be able to react to it as soon as possible to gain the most benefit or prevent a potential loss. Like latency, timely data can mark the difference between a successful interaction with a customer during a shopping session and a lost sale.

Throughput

Most organizations have millions of rows of data in multiple tables across multiple schemas being generated monthly, daily, and sometimes every second. Throughput is the volume of data that can be inserted and processed from sensors, whether those are part of a supply chain or signals created during a shopping session online. In many cases, the volume you can process is heavily dependent on the bandwidth of the network passing the data, along with the number of nodes, processors, and memory levels of the databases used to store the data. Speed and throughput go hand in hand, as the ability to push and pull data to and from your database is a limiting factor in how much time it takes to access and use that data.

When you think of throughput, you must think about data traveling from start to end. Data travels not only from your database to your models but also from your production systems to your applications. Your customers, applications, and analysts may be able to run complex queries and generate insights from data without impacting performance, but what happens when your company suddenly sends an email blast and a large burst of behavioral data flows in from your website? Can your current system handle these larger volumes of records and events being generated simultaneously?

Another issue related to throughput is change. Can your database rapidly update existing fields while simultaneously being able to add new fields? Can it add those new fields without needing to create an entirely new schema? Businesses change over time, and data needs to be flexible. Your real-time databases need to be able to adapt instantly to changes in business processes and procedures to ensure that the data that is powering your applications and decisions is accurate and complete.

Concurrency

The final requirement is concurrent access. Concurrency is the number of tasks your analytical system can perform simultaneously. It’s also about ensuring that your database isn’t blocked from updating, inserting, or executing queries while other tasks are performed. How many different actions can your database manage at one time? If you have production systems adding data regularly, is that restricting the number of users able to log in to your application or the speed that they can pull insights from that data for analysis? A limit on the number of concurrent actions is a limit on the number of users you can serve at one time. The data in your databases may be up to date and transfer quickly, but if your customers or analysts are waiting for the production systems to finish returning a query, will you deliver a substandard experience?

Concurrency is also very susceptible to data growth. As mentioned above, your organization’s data is always growing. The data stored from previous years isn’t going away, and more data is added each second. Additional data means more frequent and longer reads and writes to the database. Will your production systems and database(s) start reaching a limit on how much can happen simultaneously? If more tasks are running at once than your system can handle, then one of them will inevitably fail or at the very least you will deliver a bad experience to your end users.

The combination of latency, freshness, throughput, and concurrency forms the basis of real-time analytics. We see how they go hand in hand with each other and that any failure in one area could easily cascade into another. For this reason, we need to ensure that each is working as effectively as possible. What steps can companies take to avoid these issues? How do companies mitigate risks to these requirements and maintain a true real-time system for their analytics? The next chapter explores some ways that companies might successfully improve performance and the costs to do so.

Get Unlocking the Value of Real-Time Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.