This chapter focuses on the real-life challenges of managing data processing pipelines of depth and complexity. It considers the frequency continuum between periodic pipelines that run very infrequently through to continuous pipelines that never stop running, and discusses the discontinuities that can produce significant operational problems. A fresh take on the leader-follower model is presented as a more reliable and better-scaling alternative to the periodic pipeline for processing Big Data.
The classic approach to data processing is to write a program that reads in data, transforms it in some desired way, and outputs new data. Typically, the program is scheduled to run under the control of a periodic scheduling program such as cron. This design pattern is called a data pipeline. Data pipelines go as far back as co-routines [Con63], the DTSS communication files [Bul80], the UNIX pipe [McI86], and later, ETL pipelines,1 but such pipelines have gained increased attention with the rise of “Big Data,” or “datasets that are so large and so complex that traditional data processing applications are inadequate.”2
Programs that perform periodic or continuous transformations on Big Data are usually referred to as “simple, one-phase pipelines.”
Given the scale and processing complexity inherent to Big Data, ...