Chapter 3. Modern Data Pipeline Alternatives

Now that we’ve covered some of the new patterns emerging in the world of data architecture, let’s compare the different approaches available for building modern data pipelines by considering ingestion, transformation, and processing platform (data lake or data warehouse).

Criteria for Evaluating Approaches to Data Pipelines

To compare different approaches, we’ll apply the following criteria.

Data source connectors

The variety of modern data sources is a common reason for having multiple pipeline platforms. Options range from tools supporting a multitude of SaaS applications to those supporting a smaller set of general-purpose data infrastructure systems such as databases, raw file systems, and message queues.

Compute capabilities

Batch is the traditional way to process data. Streaming usually requires a dedicated engine such as Spark Streaming or Flink. Processing data with a large state has been a limitation of stream processing, but advancements in state stores have closed the gap and allowed stream processing to replace batch in an increasing number of use cases. Transformations range from compute-light (data cleansing) to compute-heavy (joining two data streams, aggregating streaming data).

Orchestration capabilities

Pipelines require orchestration that describes pipeline stages and dependencies as a DAG. The complexity of defining and maintaining DAGs led to the introduction of declarative data pipelines where orchestration ...

Get Unlock Complex and Streaming Data with Declarative Data Pipelines now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.