Chapter 1. What Is Fast Data?
Into a world dominated by discussions of big data, fast data has been born with little fanfare. Yet fast data will be the agent of change in the information-management industry, as we will show in this report.
Fast data is data in motion, streaming into applications and computing environments from hundreds of thousands to millions of endpoints—mobile devices, sensor networks, financial transactions, stock tick feeds, logs, retail systems, telco call routing and authorization systems, and more. Real-time applications built on top of fast data are changing the game for businesses that are data dependent: telco, financial services, health/medical, energy, and others. It’s also changing the game for developers, who must build applications to handle increasing streams of data.1
We’re all familiar with big data. It’s data at rest: collections of structured and unstructured data, stored in Hadoop and other “data lakes,” awaiting historical analysis. Fast data, by contrast, is streaming data: data in motion. Fast data demands to be dealt with as it streams in to the enterprise in real time. Big data can be dealt with some other time—typically after it’s been stored in a Hadoop data warehouse—and analyzed via batch processing.
A stack is emerging across verticals and industries to help developers build applications to process fast streams of data. This fast data stack has a unique purpose: to process real-time data and output recommendations, analytics, and decisions—transactions—in milliseconds (billing authorization and up-sell of service level, for example, in telecoms), although some fast data use cases can tolerate up to minutes of latency (energy sensor networks, for example).
Applications of Fast Data
Fast data applications share a number of requirements that influence architectural choices. Three of particular interest are:
- Rapid ingestion of millions of data events—streams of live data from multiple endpoints
- Streaming analytics on incoming data
- Per-event transactions made on live streams of data in real time as events arrive.
Ingestion is the first stage in the processing of streaming data. The job of ingestion is to interface with streaming data sources and to accept and transform or normalize incoming data. Ingestion marks the first point at which data can be transacted against, applying key functions and processes to extract value from data—value that includes insight, intelligence, and action.
Developers have two choices for ingestion. The first is to use “direct ingestion,” where a code module hooks directly into the data-generating API, capturing the entire stream at the speed at which the API and the network will run, e.g., at “wire speed.” In this case, the analytic/decision engines have a direct ingestion “adapter.” With some amount of coding, the analytic/decision engines can handle streams of data from an API pipeline without the need to stage or cache any data on disk.
If access to the data-generating API is not available, an alternative is using a message queue, e.g., Kafka. In this case, an ingestion system processes incoming data from the queue. Modern queuing systems handle partitioning, replication, and ordering of data, and can manage backpressure from slower downstream components.
As data is created, it arrives in the enterprise in fast-moving streams. Data in a stream may arrive in many data types and formats. Most often, the data provides information about the process that generated it; this information may be called messages or events. This includes data from new sources, such as sensor data, as well as clickstreams from web servers, machine data, and data from devices, events, transactions, and customer interactions.
The increase in fast data presents the opportunity to perform analytics on data as it streams in, rather than post-facto, after it’s been pushed to a data warehouse for longer-term analysis. The ability to analyze streams of data and make in-transaction decisions on this fresh data is the most compelling vision for designers of data-driven applications.
As analytic platforms mature to produce real-time summary and reporting on incoming data, the speed of analysis exceeds a human operator’s ability to act. To derive value from real-time analytics, one must be able to take action in real time. This means being able to transact against event data as it arrives, using real-time analysis in combination with business logic to make optimal decisions—to detect fraud, alert on unusual events, tune operational tolerances, balance work across expensive resources, suggest personalized responses, or tune automated behavior to real-time customer demand.
At a data-management level, all of these actions mean being able to read and write multiple, related pieces of data together, recording results and decisions. It means being able to transact against each event as it arrives.
High-speed streams of incoming data can add up to massive amounts of data, requiring systems that ensure high availability and at-least-once delivery of events. It is a significant challenge for enterprise developers to create apps not only to ingest and perform analytics on these feeds of data, but also to capture value, via per-event transactions, from them.
Uses of Fast Data
Front End for Hadoop
Building a fast front end for Hadoop is an important use of fast data application development. A fast front end for Hadoop should perform the following functions on fast data: filter, dedupe, aggregate, enrich, and denormalize. Performing these operations on the front end, before data is moved to Hadoop, is much easier to do in a fast data front end than it is to do in batch mode, which is the approach used by Spark Streaming and the Lambda Architecture. Using a fast front end carries almost zero cost in time to do filter, deduce, aggregate, etc., at ingestion, as opposed to doing these operations in a separate batch job or layer. A batch approach would need to clean the data, which would require the data to be stored twice, also introducing latency to the processing of data.
An alternative is to dump everything in HDFS and sort it all out later. This is easy to do at ingestion time, but it’s a big job to sort out later. Filtering at ingestion time also eliminates bad data, data that is too old, and data that is missing values; developers can fill in the values, or remove the data if it doesn’t make sense.
Then there’s aggregation and counting. Some developers maintain it’s difficult to count data at scale, but with an ingestion engine as the fast front end of Hadoop it’s possible to do a tremendous amount of counting and aggregation. If you’ve got a raw stream of data, say 100,000 events per second, developers can filter that data by several orders of magnitude, using counting and aggregations, to produce less data. Counting and aggregations reduce large streams of data and make it manageable to stream data into Hadoop.
Developers also can delay sending aggregates to HDFS to allow for late-arriving events in windows. This is a common problem with other streaming systems—data streams in a few seconds too late to a window that has already been sent to HDFS. A fast data front end allows developers to update aggregates when they come in.
Enriching Streaming Data
Enrichment is another option for a fast data front end for Hadoop.
Streaming data often needs to be filtered, correlated, or enriched before it can be “frozen” in the historical warehouse. Performing this processing in a streaming fashion against the incoming data feed offers several benefits:
Unnecessary latency created by batch ETL processes is eliminated and time-to-analytics is minimized.
Unnecessary disk IO is eliminated from downstream big data systems (which are usually disk-based, not memory-based, when ETL is real time and not batch oriented).
Application-appropriate data reduction at the ingest point eliminates operational expense downstream—less hardware is necessary.
The input data feed in fast data applications is a stream of information. Maintaining stream semantics while processing the events in the stream discretely creates a clean, composable processing model. Accomplishing this requires the ability to act on each input event—a capability distinct from building and processing windows, as is done in traditional CEP systems.
These per-event actions need three capabilities: fast look-ups to enrich each event with metadata; contextual filtering and sessionizing (re-assembly of discrete events into meaningful logical events is very common); and a stream-oriented connection to downstream pipeline systems (e.g., distributed queues like Kafka, OLAP storage, or Hadoop/HDFS clusters). This requires a stateful system fast enough to transact on a per-event basis against unlimited input streams and able to connect the results of that transaction processing to downstream components.
Queries that make a decision on ingest are another example of using fast data front-ends to deliver business value. For example, a click event arrives in an ad-serving system, and we need to know which ad was shown, and analyze the response to the ad. Was the click fraudulent? Was it a robot? Which customer account do we debit because the click came in and it turns out that it wasn’t fraudulent? Using queries that look for certain conditions, we might ask questions such as: “Is this router under attack based on what I know from the last hour?” Another example might deal with SLAs: “Is my SLA being met based on what I know from the last day or two? If so, what is the contractual cost?” In this case, we could populate a dashboard that says SLAs are not being met, and it has cost n in the last week. Other deep analytical queries, such as “How many purple hats were sold on Tuesdays in 2015 when it rained?” are really best served by systems such as Hive or Impala. These types of queries are ad-hoc and may involve scanning lots of data; they’re typically not fast data queries.
1 Where is all this data coming from? We’ve all heard the statement that “data is doubling every two years”—the so-called Moore’s Law of data. And according to the oft-cited EMC Digital Universe Study (2014), which included research and analysis by IDC, this statement is true. The study states that data “will multiply 10-fold between 2013 and 2020—from 4.4 trillion gigabytes to 44 trillion gigabytes”. This data, much of it new, is coming from an increasing number of new sources: people, social, mobile, devices, and sensors. It’s transforming the business landscape, creating a generational shift in how data is used, and a corresponding market opportunity. Applications and services tapping this market opportunity require the ability to process data fast.