Chapter 1. The Anatomy of Fast Data Applications
Nowadays, it is becoming the norm for enterprises to move toward creating data-driven business-value streams in order to compete effectively. This requires all related data, created internally or externally, to be available to the right people at the right time, so real value can be extracted in different forms at different stages—for example, reports, insights, and alerts. Capturing data is only the first step. Distributing data to the right places and in the right form within the organization is key for a successful data-driven strategy.
A Basic Application Model
From a high-level perspective, we can observe three main functional areas in Fast Data applications, illustrated in Figure 1-1:
- Data sources
-
How and where we acquire the data
- Processing engines
-
How to transform the incoming raw data in valuable assets
- Data sinks
-
How to connect the results from the stream analytics with other streams or applications
Streaming Data Sources
Streaming data is a potentially infinite sequence of data points, generated by one or many sources, that is continuously collected and delivered to a consumer over a transport (typically, a network).
In a data stream, we discern individual messages that contain records about an interaction. These records could be, for example, a set of measurements of our electricity meter, a description of the clicks on a web page, or still images from a security camera. As we can observe, some of these data sources are distributed, as in the case of electricity meters at each home, while others might be centralized in a particular place, like a web server in a data center.
In this report, we will make an abstraction of how the data gets to our processing backend and assume that our stream is available at the point of ingestion. This will enable us to focus on how to process the data and create value out of it.
Stream Properties
We can characterize a stream by the number of messages we receive over a period of time. Called the throughput of the data source, this is an important metric to take into consideration when defining our architecture, as we will see later.
Another important metric often related to streaming sources is latency. Latency can be measured only between two points in a given application flow. Going back to our electricity meter example, the time it takes for a reading produced by the electricity meter at our home to arrive at the server of the utility provider is the network latency between the edge and the server. When we talk about latency of a streaming source, we are often referring to how fast the data arrives from the actual producer to our collection point. We also talk about processing latency, which is the time it takes for a message to be handled by the system from the moment it enters the system, until the moment it produces a result.
From the perspective of a Fast Data platform, streaming data arrives over the network, typically terminated by a scalable adaptor that can persist the data within the internal infrastructure. This capture process needs to scale up to the same throughput characteristics of the streaming source or provide some means of feedback to the originating party to let them adapt their data production to the capacity of the receiver. In many distributed scenarios, adapting by the originating party is not always possible, as edge devices often consider the processing backend as always available.
Note
Once the event messages are within the backend infrastructure, streaming flow control such as Reactive Streams can provide bidirectional signaling to keep a series of streaming applications working at their optimum load.
The amount of data we can receive is usually limited by how much data we can process and how fast that process needs to be to maintain a stable system. This takes us to the next architectural area of our interest: processing engines.
Processing Engines
The processing area of our Fast Data architecture is the place where business logic gets implemented. This is the component or set of components that implements the streaming transformation logic specific to our application requirements, relating to the business goals behind it.
When characterized by the methods used to handle messages, stream processing engines can be classified into two general groups:
- One-at-a-time
-
These streaming engines process each record individually, which is optimized for latency at the expense of either higher system resource consumption or lower throughput when compared to micro-batch.
- Micro-batch
-
Instead of processing each record as it arrives, micro-batching engines group messages together following certain criteria. When the criteria is fulfilled, the batch is closed and sent for execution, and all the messages in the batch undergo the same series of transformations.
Processing engines offer an API and programming model whereby requirements can be translated to executable code. They also provide warranties with regards to the data integrity, such as no data loss or seamless failure recovery. Processing engines implement data processing semantics that relate how each message is processed by the engine:
- At-most-once
-
Messages are only ever sent to their destination once. They are either received successfully or they are not. At-most-once has the best performance because it forgoes processes such as acknowledgment of message receipt, write consistency guarantees, and retries—avoiding the additional overhead and latency at the expense of potential data loss. If the stream can tolerate some failure and requires very low latency to process at a high volume, this may be acceptable.
- At-least-once
-
Messages are sent to their destination. An acknowledgement is required so the sender knows the message was received. In the event of failure, the source can retry to send the message. In this situation, it’s possible to have one or more duplicates at the sink. Sink systems may be tolerant of this by ensuring that they persist messages in an idempotent way. This is the most common compromise between at-most-once and exactly-once semantics.
- Exactly-once [processing]
-
Messages are sent once and only once. The sink processes the message only once. Messages arrive only in the order they’re sent. While desirable, this type of transactional delivery requires additional overhead to achieve, usually at the expense of message throughput.
When we look at how streaming engines process data from a macro perspective, their three main intrinsic characteristics are scalability, sustained performance, and resilience:
- Scalability
-
If we have an increase in load, we can add more resources—in terms of computing power, memory, and storage—to the processing framework to handle the load.
- Sustained performance
-
In contrast to batch workloads that go from launch to finish in a given timeframe, streaming applications need to run 365/24/7. Without any notice, an unexpected external situation could trigger a massive increase in the size of data being processed. The engine needs to deal gracefully with peak loads and deliver consistent performance over time.
- Resilience
-
In any physical system, failure is a question of when and not if. In distributed systems, this probability that a machine fails is multiplied by all the machines that are part of a cluster. Streaming frameworks offer recovery mechanisms to resume processing data in a different host in case of failure.
Data Sinks
At this point in the architecture, we have captured the data, processed it in different forms, and now we want to create value with it. This exchange point is usually implemented by storage subsystems, such as a (distributed) file system, databases, or (distributed) caches.
For example, we might want to store our raw data as records in “cold storage,” which is large and cheap, but slow to access. On the other hand, our dashboards are consulted by people all over the world, and the data needs to be not only readily accessible, but also replicated to data centers across the globe. As we can see, our choice of storage backend for our Fast Data applications is directly related to the read/write patterns of the specific use cases being implemented.
Get Designing Fast Data Application Architectures now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.