Chapter 4. Recipe: Design Data Pipelines

Idea in Brief

Processing big data effectively often requires multiple database engines, each specialized to a purpose. Databases that are very good at event-oriented real-time processing are likely not good at batch analytics against large volumes. Some systems are good for high-velocity problems. Others are good for large-volume problems. However, in most cases, these systems need to interoperate to support meaningful applications.

Minimally, data arriving at the high-velocity, ingest-oriented systems needs to be processed and captured into the volume-oriented systems. In more advanced cases, reports, analytics, and predictive models generated from the volume-oriented systems need to be communicated to the velocity-oriented system to support real-time applications. Real-time analytics from the velocity side need to be integrated into operational dashboards or downstream applications that process real-time alerts, alarms, insights, and trends.

In practice, this means that many big data applications sit on top of a platform of tools. Usually the components of the platform include at least a large shared storage pool (like HDFS), a high-performance BI analytics query tool (like a columnar SQL system), a batch processing system (MapReduce or perhaps Spark), and a streaming system. Data and processing outputs move between all of these systems. Designing that dataflow—designing a processing pipeline—that coordinates these different platform components, is key to solving many big data challenges.

Pattern: Use Streaming Transformations to Avoid ETL

New events being captured into a long-term repository often require transformation, filtering, or processing before they are available for reporting use cases. For example, many applications capture sessions comprising several discrete events, enrich events with static dimension data to avoid expensive repeated joins in the batch layer, or filter redundant events from a dataset storing only unique values.

There are at least two approaches to running these transformations.

  1. All of the data can be landed to a long-term repository and then extracted, transformed, and re-loaded back in its final form. This approach trades I/O, additional storage requirements, and longer time to insight (reporting is delayed until the ETL process completes) for a slightly simpler architecture (likely data is moving directly from a queue to HDFS). This approach is sometimes referred to as schema-on-read. It narrows the choice of backend systems to those systems that are relatively schema-free—systems that may not be optimal, depending on your specific reporting requirements.

  2. The transformations can be executed in a streaming fashion before the data reaches the long-term repository. This approach adds a streaming component between the source queue and the final repository, creating a continuous processing pipeline. Moving the transformation to a real-time processing pipeline has several advantages. The write I/O to the backend system is at least halved (in the first model, first raw data is written and then ETL’d data is written. In this model, only ETL’d data is written.) This leaves more I/O budget available for data science and reporting activity—the primary purpose of the backend repository. Operational errors are noticeable in almost real time. When using ETL, raw data is not inspected until the ETL process runs. This delays operational notifications of missing or corrupt inputs. Finally, time-to-insight is reduced. For example, when organizing session data, a session is available to participate in batch reporting immediately upon being “closed.” When ETL’ing, you must wait, on average, for half of the ETL period before the data is available for backend processing.

Pattern: Connect Big Data Analytics to Real-Time Stream Processing

Real-time applications processing incoming events often require analytics from backend systems. For example, if writing an alerting system that issues notifications when a moving five minute interval exceeds historical patterns, the data describing the historical pattern needs to be available. Likewise, applications managing real-time customer experience or personalization often use customer segmentation reports generated by statistical analysis run on the batch analytics system. A third common example is hosting OLAP outputs in a fast, scalable query cache to support operators and applications that need high-speed, highly concurrent access to data.

In many cases, the reports, analytics, or models from big data analytics need to be made available to real-time applications. In this case, information flows from the big data (batch) oriented system to the high-velocity (streaming) system. This introduces a few important requirements. First, the fast data, velocity-oriented application requires a data management system capable of holding the state generated by the batch system; second, this state needs to be regularly updated or replaced in full. There are a few common ways to manage the refresh cycle—the best tradeoff will depend on your specific application.

Some applications (for example, applications based on user segmentation) require per-record consistency but can tolerate eventual consistency across records. In these cases, updating state in the velocity-oriented database on a per-record basis is sufficient. Updates will need to communicate new records (in this example, new customers), updates to existing records (customers that have been recategorized), and deletions (ex-customers). Records in the velocity system should be timestamped for operational monitoring and alerts generated if stale records persist beyond the expected refresh cycle.

Other applications require the analytics data to be strictly consistent; if it is insufficient for each record to be internally consistent, the set of records as a whole requires a consistency guarantee. These cases are often seen in analytic query caches. Often these caches are queried for additional levels of aggregation, aggregations that span multiple records. Producing a correct result therefore requires that the full data set be consistent. A reasonable approach to transferring these report data from the batch analytics system to the real-time system is to write the data to a shadow table. Once the shadow table is completely written, it can be atomically renamed, or swapped, with the main table that is addressed by the application. The application will either see only data from the previous version of the report or only data from the new version of the report, but will never see a mix of data from both reports in a single query.

Pattern: Use Loose Coupling to Improve Reliability

When connecting multiple systems, it is imperative that all systems have an independent fate. Any part of the pipeline should be able to fail while leaving other systems available and functional. If the batch backend is offline, the high-velocity front end should still be operating, and vice versa. This requires thinking through several design decisions:

  1. Where does data that can not be pushed (or pulled) through the pipeline rest? Which components are responsible for the durability of stalled data?

  2. How do systems recover? Which systems are systems of record—meaning they can be recovery sources for lost data or interrupted processing?

  3. What is the failure and availability model of each component in the pipeline?

  4. When a component fails, which other components become unavailable? How long can upstream components maintain functionality (for example, how long can they log processed work to disk)? These numbers inform your recovery time objective (RTO).

In every pipeline, there is by definition a slowest component—a bottleneck. When designing, explicitly choose the component that will be your bottleneck. Having many systems, each with identical performance, means a minor degradation to any system will create a new overall bottleneck. This is operationally painful. It is often better to choose your most reliable component as your bottleneck or your most expensive resource as your bottleneck. Overall you will achieve a more predictable level of reliability.

When to Avoid Pipelines

Tying multiple systems together is always complex. This complexity is rarely linear with the number of systems. Typically, complexity increases as a function of connections (and worst case you can have N^2 connections between N systems). To the extent your problem can be fully solved by a single stand-alone system, you should prefer that approach. However, large-volume and high-velocity data management problems typically require a combination of specialized systems. When combining these systems, carefully, consciously design dataflow and connections between them. Loosely couple systems so each operates independently of the failure of its connected partners. Use multiple systems to simplify when possible—for example, by moving batch ETL processes to continuous processes.

Get Fast Data: Smart and at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.