Chapter 4. How Do You Analyze Infinite Data Sets?

Infinite data sets raise important questions about how to do certain operations when you don’t have all the data and never will. In particular, what do classic SQL operations like GROUP BY and JOIN mean in this context? What about statistics like min, max, and average?

A theory of streaming semantics has emerged that provides the answer. Central to this theory is the idea that aggregation operations like these make sense only in the context of windows of data, often over a fixed range of time.

Apache Beam, an open source streaming engine based on Google Dataflow, is arguably the streaming engine with the most sophisticated formulation of these semantics. It has become the gold standard for defining how precise analytics should be performed in real-world streaming scenarios. Although Beam is not as widely used as the other streaming engines discussed in this report, the designs of these engines were strongly influenced by Beam.

To actually use Beam, a third-party “runner” is required to execute Beam data flows. In the open source world, this functionality has been implemented for Flink and Spark, while Google’s own runner is its cloud service, Cloud Dataflow. This means you can write Beam data flows and run them with these other tools. Not all constructs defined by Beam are supported by all runners. The Beam documentation has a Capability Matrix that shows what features each runner supports. There are even semantics defined that Beam and Google Cloud Dataflow themselves don’t yet support!

Usually, when a runner supports a Beam construct, the runner also provides access to the feature in its “native” API. So, if you don’t need runner portability, you might use that API instead.

For space reasons, I can only provide a sketch of these advanced semantics here, but I believe that every engineer working on streaming pipelines should take the time to understand them in depth. Tyler Akidau, the leader of the Beam/Dataflow team, has written two O’Reilly Radar blog posts and co-wrote an O’Reilly book explaining these details:

If you follow no other links in this report, at least read those two blog posts!

Streaming Semantics

Suppose we set out to build our own streaming engine. We might start by implementing two “modes” of processing, to cover a large fraction of possible scenarios: single-record processing and what I’ll call “aggregation processing” over many records, including summary statistics and operations like GROUP BY and JOIN queries.

Single-record processing is the simplest case to support, where we process each record individually. We might trigger an alarm on an error record, or filter out records that aren’t interesting, or transform each record into a more useful format. Lots of ETL (extract, transform, and load) scenarios fall into this category. All of these actions can be performed one record at a time, although for efficiency reasons we might do small batches of them at a time. This simple processing model is supported by all the low-latency tools introduced in Chapter 2 and discussed in more depth shortly.

The next level of sophistication is aggregations of records. Because the stream may be infinite, the simplest approach is to trigger a computation over a fixed window,1 usually defined by a start and end time, in which case the actual number of records can vary from one window to the next. Another option is to use windows with a fixed number of records.

Time windows are more common, because they are usually tied to a business requirement, such as counting the number of transactions per minute. Suppose we implement this requirement and also a requirement to segregate the transactions by geographic region. We collect the data coming in and at the end of each minute window, we perform these tasks on the accumulated data, while the next minute’s data is being accumulated. This fixed-window approach is the core construct in Spark’s original Streaming model, based on mini-batch processing. It was a clever way to adapt Spark, which started life as a batch-mode tool, to provide a stream processing capability.

This notion of windows also applies to SQL extensions for streaming, which I mentioned in Chapter 2.

But the fixed-window model has challenges. Time is a first-class concern in streaming. We almost always want to know when things happened and the order in which they happened, especially if state transitions are involved.

There is a difference between event time, when something happened on a particular server, and processing time, some later time when the record that represents the event is processed, perhaps on a different server. Processing time is easiest to handle; for our example, we just trigger the computation once a minute on the processing server. However, event time is almost always what we need. Most likely, we need an accurate count of transactions per minute per country for accounting reasons. If we’re tracking potential fraudulent activity, we may need accuracy for legal reasons. So, a fixed-window model can’t work with just the processing time alone.

Another complication is the fact that times across servers will never be completely synchronized. The widely used Network Time Protocol for clock synchronization is accurate to a few milliseconds, at best. Precision Time Protocol is submicrosecond, but not as widely deployed. If I’m trying to reconstruct a sequence of events based on event times for activity that spans servers, I have to account for the inevitable clock skew between the servers. Still, it may be sufficient if we accept the timestamp assigned to an event by the server where it was recorded.

Another implication of the difference between event and processing time is the fact that I can’t really know that my processing system has received all of the records for a particular window of event time. Arrival delays could be significant, due to network delays or partitions, servers that crash and then reboot, mobile devices that leave and then rejoin the network, and so on. Even under normal operations, data travel is not instantaneous between systems.

Because of inevitable latencies, however small, the actual records in the mini-batch will include stragglers from the previous window of event time. After a network partition is resolved, events from the other side of the partition could be significantly late, perhaps many windows of time late.

Figure 4-1 illustrates event versus processing times.

Event time vs. processing time
Figure 4-1. Event time versus processing time

Records generated on Servers 1 and 2 at particular event times may propagate to the Analysis server at different rates and traverse different distances. Most records will arrive in the same window in which they occurred, while others will arrive in subsequent windows. During each minute, events that arrive are accumulated. At minute boundaries, the analysis task for the previous minute’s events is invoked (represented by the right-facing arrowhead), but clearly some data will always be missing.

So, we really should delay processing until we have all the records for a given event-time window or decide when we can’t wait any longer. This impacts correctness, too. For stream processing to replace batch-mode processing, we require the same correctness guarantees. Our transactions-per-minute calculation should have the same accuracy whether we do it incrementally over the incoming records or compute it tomorrow in a batch job that analyzes today’s activity.

To balance these trade-offs, additional concepts are needed. We’ll keep finite windows, but we’ll add the concepts of watermarks and triggers for when processing should proceed. We also need to clarify how updates to previous results are handled:

Watermarks

Used to indicate that all events relative to a window or other context have been received, so it’s now safe to process the final result for the context.

Triggers

A more general tool for invoking processing, including the detection of watermarks, a timer firing, or a threshold being reached. Previously, we just started processing when the window end time was reached. The results may be incomplete for a given context, if more data will arrive later.

Accumulation modes

How should we treat multiple observations within the same window, whether they are disjoint or overlapping? If overlapping, do later observations override earlier ones or are they merged somehow?

It’s great when we have a watermark that tells us we’ve seen everything for a context. If we haven’t received one, we may need to trigger the calculation because downstream consumers can’t wait any longer for the result, even if it’s preliminary. We still need a plan for handling records that arrive too late.

One possibility is to structure our analysis so that everyone knows we’re computing a preliminary result when a trigger occurs, then accepts that an improved result will be forthcoming, if and when additional late records arrive. A dashboard might show a result that is approximate at first and grows increasingly correct over time. It’s probably better to show something early on, as long as it’s clear that the results aren’t final.

However, suppose instead that these point-in-time analytics are written to yet another stream of data that feeds a billing system. Simply overwriting a previous value isn’t an option. Accountants never modify an entry in bookkeeping. Instead, they add a second entry that corrects the earlier mistake. Similarly, our streaming tools need to support a way of correcting results, such as issuing retractions, followed by new values.

Recall that we had a second task to do: segregate transactions by geographic region. As long as our persistent storage supports incremental updates, this second task is less problematic; we just add late-arriving records as they arrive. Hence, scenarios that don’t have an explicit time component are easier to handle.

We’ve just scratched the surface of the important considerations required for processing infinite streams in a timely fashion, while preserving correctness, robustness, and durability.

Now that you understand some of the important concepts required for effective stream processing, let’s examine several of the currently available tools and how they support these concepts, as well as other potential requirements.

Which Streaming Engines Should You Use?

How do the streaming engines available today stand up? Which ones should you use?

There are five streaming engines shown in Figure 2-1. We discussed Beam already, because of its foundational role in defining the semantics of streaming. The other four cover a spectrum of features and ideal use cases. Each has its own vibrant community and ecosystem of other tools to leverage.

Criteria for Evaluating Streaming Engines

For a given application, the following criteria should be considered:

What is the latency budget?

If you must process records within milliseconds, this will constrain your options. If your budget is minutes to hours, you have a lot of options; you could even run frequent, short batch jobs!

What is the volume of data to be processed per unit time?

A streaming job may ultimately process petabytes of data, but if this happens one record at a time over a very long time, then almost anything can be made to work. If you’re processing the Twitter firehose, then you’ll need tools that provide lots of scalability through parallelism.

Which kinds of data processing will you do?

Simple record transformations and filtering, like many ETL tasks, can be done with many tools. You’ll need sophisticated tools if you are doing complex GROUP BY and JOIN operations over several streams and historical data. If you need the precise semantics discussed previously in this chapter, that will also constrain your choices.

How strong and dynamic is the community around the project?

Is the project evolving quickly? Are there lots of people contributing enhancements and add-on libraries? Are quality standards high? Does it appear that the project will have a long and healthy life?

How do you build, deploy, and manage services?

If your organization already has a mature CI/CD (continuous integration/continuous delivery) process for microservices, it would be very nice to continue using those tools for stream processing as well. If you are accustomed to running and using a diverse set of tools, then adding new ones will be less of a challenge.

Let’s explore the four tools based on these criteria.

Spark and Flink: Scalable Data Processing Systems

Spark and Flink are examples of systems that automatically manage a group of processes across a cluster and partition and distribute your data to those processes. They handle high-volume scenarios very well and can also operate at relatively low latencies. They offer rich options for stream processing, from simple transformations to subsets of the Beam semantics, either as data flows or SQL queries.

Both support different application deployment options and they are integrated with the popular resource managers: YARN, Kubernetes, and Mesos, which are used to obtain resources on demand. In the most common scenario, you submit a job to a master service, which decides how to partition the data set to be processed and then schedules tasks (JVM processes) to do the work of the job.

All of this work behind the scenes is done for you, as long as your job fits this runtime model. The APIs are reasonably intuitive and promote concise code. Today, they come in two basic forms:

Data flows

A programming model based on sequencing transformations on collections. For Spark in particular, this API was heavily inspired by the Scala language collections API, which also inspired recent changes in Java’s collections API. For problems that are naturally described as a sequence of processing steps, forming a DAG (directed acyclic graph), this approach is very flexible and powerful.

SQL

Recall that SQL is a declarative language, where you specify the relational constraints and the query engine determines how to select and transform the data to satisfy those constraints. We saw earlier that SQL with extensions for streaming has become a popular choice for writing applications, because it is concise and easier to use by nonprogrammers.

Both models produce new streams, even SQL. In batch-mode SQL, the data at rest is processed to return a new, fixed data set. In the Streaming context, the input stream is transformed into an output stream matching the criteria of the query.

Both models require window primitives in order for them to work on infinite streams of data. Beam’s concepts for windowing were influential in the designs of Flink and Spark.

Spark and Flink have different histories. Spark started as a batch-mode system, then added stream processing later. Flink started as a streaming engine first, with batch support added later.

The initial Spark Streaming system is based on mini-batch processing, with windows (called intervals) of approximately 100 milliseconds or more. Also, it only supports processing time windows. This API is based on the original storage model in Spark, RDDs (resilient, distributed data sets), which provides the previously mentioned data flow API inspired by the Scala collections API. This API comes in Scala, Java, Python, and R versions. Spark Streaming is now effectively deprecated, replaced by a new streaming system.

The new system, Structured Streaming, introduced in Spark version 2.0, is based on SparkSQL, Spark’s declarative SQL system, with an idiomatic API in Scala, Java, Python, and R, as well as SQL itself. The declarative nature of SQL has provided the separation between user code and implementation that has allowed the Spark project to focus on aggressive internal optimizations, as well as support for advanced streaming semantics, like event-time processing. This is now the standard streaming API for Spark. A major goal of the Spark project is to promote so-called continuous applications, where the division between batch and stream processing is further reduced. In parallel, a new low-latency streaming engine has been developed to break through the 100-millisecond barrier of the original engine.

Spark also includes a machine learning library, MLlib, and many third-party libraries integrate with Spark.

Because Flink has always excelled at low-latency stream processing, it has maintained a lead over Spark in supporting advanced semantics. Specifically, it supports more of the semantics defined by Beam compared to Spark, as described in the Beam Capability Matrix. It also offers the more mature Beam runner of the two systems.

Flink also provides both a data-flow-style API in Scala and Java, as well as SQL. Integrated machine learning is less mature compared with Spark.

To conclude, Spark and Flink are comparable in the list of criteria we defined. Both provide rich options for processing, good low-latency and very good high-volume performance, and automatic data partitioning and task management. Both have vibrant and growing communities and tool ecosystems. Spark’s community is much larger at this time, while Flink is more mature for stream processing. Both have very good quality.

Akka Streams and Kafka Streams: Data-Centric Microservices

Akka Streams and Kafka Streams are libraries that you integrate into applications, as opposed to standalone services like Spark and Flink. They are designed for the sweet spot of building microservices with integrated data processing, versus using separate services focused exclusively on data analytics. There are several advantages to this approach:

  • You would rather use the same microservice-based development and production workflow for all services, versus supporting a more heterogeneous environment.

  • You want more control over how applications are deployed and managed.

  • You want more flexibility in how different domain logic is integrated together—for example, incorporating machine learning model training or serving into other microservices, rather than deploying them separately.

If you run separate data analytics (for example, Spark jobs), and separate microservices that use the results, you’ll need to exchange the results between the services (for example, via a Kafka topic). This is a great pattern, of course, as I argued in Chapter 3, but also I pointed out that sometimes you don’t want the extra overhead of going through Kafka. Nothing is faster with lower overhead than making a function call in the same process, which you can do if your data transformation is done by a library rather than a separate service! Also, a drawback of microservices is the increase in the number of different services you have to manage. You may want to minimize that number.

Because Akka Streams and Kafka Streams are libraries, the drawback of using them is that you have to handle some details yourself that Spark or Flink would do for you, like automatic task management and data partitioning. Now you really need to write all the logic for creating a runnable process and actually running it.

Despite these drawbacks, because Kafka Streams runs as a library on top of the Kafka cluster, it requires no additional cluster setup and modest additional management overhead to use.

You can use Akka Streams in a similar way, too, although you can leverage Akka Cluster for building highly distributed, resilient applications, when needed. The Akka ecosystem of which Akka Streams is a part—combined with the other Lightbend Reactive Platform projects, Play and Lagom—provides a full spectrum of tools and libraries for microservice development. While Kafka Streams is focused on reading and writing Kafka topics, the connector library Alpakka makes it relatively easy to connect different data sources and sinks to Akka Streams.

Both streaming libraries provide single-event processing with very low latency and high throughput. When exchanging data over a Kafka topic, you do need to carefully watch consumer lag (i.e., queue depth), which is a source of latency.

Kafka Streams doesn’t attempt to provide a full spectrum of microservice tools, so it’s common to embed Kafka Streams in applications built with other tools, like Lightbend Reactive Platform and the Spring Framework.

Akka Streams also implements the Reactive Streams specification. This is a simple, yet powerful, standard for defining composable streams with built-in back pressure, a flow control mechanism. If a single stream has built-in flow control, such that it can control how much data is pushed into it to avoid data loss, then this back pressure composes when these streams are joined together, either in the same process or across process boundaries. This keeps the internal stream “segments” robust against data loss and allows strategic decisions to be made at the entry points for the assembly, where it is more likely that a good strategy can be defined and implemented. Hence, reactive streams are a very robust mechanism for reliable processing. If you don’t use a Kafka topic as a giant buffer between microservices, you really need a back pressure mechanism like this!

Neither library does automatic data partitioning and dynamic task management associated with these partitions, so they are not as well suited for high-volume scenarios as Spark and Flink. Nothing stops you from implementing this yourself, of course, but good luck…

When you are comparing these two libraries, perhaps the most important thing to understand is that each emerged out of a different community, which has influenced the features it offers.

It’s useful to know that Kafka Streams was really designed with data processing in mind, while Akka Streams emerged out of the microservices focus of Akka. Compared to Flink and Spark, neither library is as advanced at the Beam-style streaming semantics, but Kafka Streams is very good at supporting many of the most common data processing scenarios, such as GROUP BY and JOIN operations and aggregations. When you want to see all records in a stream, the usual way of using Kafka topics, Kafka Streams offers a KStream abstraction. When you just want to see the latest value for a key—say, a statistic computed over data in a stream—then the KTable abstraction, analogous to a database table, makes this easy. There is even a SQL query capability that lets you query the state of a running Kafka Stream. In contrast to Spark and Flink, SQL is not offered as an API for writing applications.

Compared to Kafka Streams, Akka Streams is more microservice oriented and less data-analytics oriented. You can still do many operations like filtering and transforming, as well as aggregations like grouping and joining, but the latter are not as feature-complete as comparable SQL operations. Of the five systems we’re discussing here, Akka Streams and Beam are the only ones that don’t offer a SQL abstraction (although Beam’s semantics fit the requirements for streaming SQL very well). Similarly, Akka Streams doesn’t implement many of the Beam semantics that the other engines implement for handling late arrival of data, grouping by event time, and so on.

So why discuss Akka Streams then? Because it is unparalleled for defining complex event processing (CEP) scenarios, such as life cycle state transitions, alarm triggering of certain events, and building a wide range of microservices. Of the five engines we’re discussing, only Akka Streams supports feedback loops in the data flow graph, so you’re not restricted to DAGs. If you think of typical data processing as one end of a spectrum and general microservice capabilities as the other end, then Akka Streams sits closer to the general microservice end, while Kafka Streams sits closer to the data processing end.

Okay, So What Should I Use?

I hesitated including four streaming engines, plus Beam. I was concerned about the paradox of choice, where if you have too many choices, you might panic at the thought of making the wrong choice. The reason I included so many options is because a real environment may need several of them, and each one brings unique strengths, as well as weaknesses. There is certainly enough overlap that you won’t need all of them and you shouldn’t take on the burden of learning and using more than a few. Two will probably suffice—for example, one system like Spark or Flink and one microservice library like Akka Streams or Kafka Streams.

If you already use Spark for interactive SQL queries, machine learning, and batch jobs, then Spark Streaming is a good choice that covers a wide class of problems and lets you share logic between streaming and batch applications. To paraphrase an old saying about IBM, no one gets fired for choosing Spark.

If you need the sophisticated semantics and low latency provided by Beam, then use Flink with the Beam API or Flink’s own API. Choose Flink if you really are focused on streaming and batch processing is not a major requirement.

Since we assume that Kafka is a given, use Kafka Streams for low-overhead processing that doesn’t require the sophistication or flexibility provided by the other tools. The management overhead is minimal, once Kafka is running. ETL tasks like filtering, transformation, and many aggregation tasks are ideal for Kafka Streams. Use it for Kafka-centric microservices that process asynchronous data streams.

Are you coming to streaming data from a microservice background? Do you want one full-featured toolbox with the widest spectrum of options? Use Akka Streams to implement your general-purpose and data-processing microservices.

1 Sometimes the word window is used in a slightly different context in streaming engines and the word interval is used instead for the idea we’re discussing here. For example, in Spark Streaming’s mini-batch system, I define a fixed mini-batch interval to capture data, but I process a window of one or more intervals together at a time.

Get Fast Data Architectures for Streaming Applications, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.