Chapter 1. Data Ingestion

Data ingestion, in essence, involves transferring data from a source to a designated target. Its primary aim is to usher data into an environment primed for staging, processing, analysis, and artificial intelligence/machine learning (AI/ML). While massive organizations may focus on moving data internally (among teams), for most of us, data ingestion emphasizes pulling data from external sources and directing it to in-house targets.

In an era where data holds central importance in both business and product development, the significance of accurate and timely data cannot be overstated. This heightened reliance on data has given rise to a multitude of “sources” from which teams extract information to refine decision-making processes, craft outstanding products, and conduct a multitude of other actions. For instance, a marketing team would need to retrieve data from several advertising and analytics platforms, such as Meta, Google (including Ads and Analytics), Snapchat, LinkedIn, and Mailchimp.

However, as time marches on, APIs and data sources undergo modifications. Columns might be introduced or removed, fields could get renamed, and new versions might replace outdated ones. Handling changes from a single source might be feasible, but what about juggling alterations from multiple sources—five, ten, or even a hundred? The pressing challenge is this: “How can a budding data team efficiently handle these diverse sources in a consistent and expandable way?” As data engineers, how do we ensure our reputation for providing reliable and straightforward data access, especially when every department’s demands are continuously escalating?

Data Ingestion—Now Versus Then

Though the principles of ingestion largely remain the same, much has changed. As the volume, velocity, and variety of data evolve, so too must our methods.

We’ve had multiple industry changes to accommodate this—movement to the cloud, the warehouse to the data lake to the lakehouse, and the simplification of streaming technologies, to name a few. This has been manifested as a shift from extract-transform-load (ETL) workflows to extract-load-transform (ELT), the key difference being that all data is now loaded into a target system. We’ll discuss these environments in the context of transformation in Chapter 2.

We refrain from being too pedantic about the terms ETL and ELT; however, we’d like to emphasize that almost every modern data engineering workflow will involve staging almost all data in the cloud. The notable exception is cases where hundreds of trillions of rows of highly granular data are processed daily (e.g., Internet of Things [IoT] or sensor data), where it makes sense to aggregate or discard data before staging.

Despite constant changes, the fundamental truth of extraction is that data is pulled from a source and written to a target. Hence, the discussion around extraction must be centered on precisely that.

Sources and Targets

While most associate ingestion with extraction, it’s also tightly coupled with loading; after all, every source requires a destination. In this guide, we assume that you have an established warehouse or data lake; therefore, storage will not be a primary topic in this chapter. Instead, we’ll highlight both best practices for staging and the hallmarks of ideal storage implementations.

It’s our mission to arm you with a toolkit for architecture design, keeping in mind that a “perfect” solution might not exist. We’ll navigate a framework for appraising sources and untangling the unique knots of data ingestion. Our high-level approach is designed to give you a bird’s-eye view of the landscape, enabling you to make informed, appropriate decisions.

The Source

Our primary consideration for ingesting data is the source and its characteristics. Unless you’re extremely lucky, there will be many sources. Each must be separately assessed to ensure adequate resources and set the criteria for your ingestion solution(s).

With the sheer volume of data sources and the nature of business requirements (I’ve seldom been asked to remove sources, but adding one is just another Thursday), it’s highly likely that you’ll encounter one or many sources that do not fit into a single solution. While building trust takes weeks, months, and years, it can be lost in a day. Reliable, timely ingestion is paramount. So, what’s important in choosing a source?

Examining sources

As a practical guide, we take the approach of presenting time-tested questions that will guide you toward understanding the source data, both its characteristics and how the business will get value.

We recommend taking a highly critical stance: it is always possible that source data is not needed or a different source will better suit the business. You are your organization’s data expert, and it’s your job to check and double-check assumptions. It’s normal to bias for action and complexity, but imperative we consider essentialism and simplicity.

When examining sources, keep in mind that you’ll likely be working with software engineers on upstream data, but downstream considerations are just as important. Neglecting these can be highly costly, since your error may not manifest itself until weeks of work have taken place.

Questions to ask

Who will we work with?: In an age of artificial intelligence, we prioritize real intelligence. The most important part of any data pipeline is the people it will serve. Who are the stakeholders involved? What are their primary motives—OKRs (objectives and key results) or organizational mandates can be useful for aligning incentives and moving projects along quickly.
How will the data be used?: Closely tied to “who,” how the data will be used should largely guide subsequent decisions. This is a way for us to check our stakeholder requirements and learn the “problem behind the problem” that our stakeholders are trying to solve. We highly recommend a list of technical yes/no requirements to avoid ambiguity.
What’s the frequency?: As we’ll discuss in detail later, most practitioners immediately jump to batch versus streaming. Any data can be processed as a batch or stream, but we are commonly referring to the characteristics of data that we would like to stream. We advocate first considering if the data is bounded or unbounded—i.e., does it end (for example, the 2020 Census American Community Survey dataset), or is it continuous (for example, log data from a fiber cabinet).; After bounds are considered, the minimum frequency available sets a hard limit for how often we can pull from the source. If an API only updates daily, there’s a hard limit on the frequency of your reporting. Bounds, velocity, and business requirements will inform the frequency at which we choose to extract data.
What is the expected data volume?: Data volume is no longer a limiter for the ability to store data—after all, “storage is cheap,'' and while compute can be costly, it’s less expensive than ever (by a factor of millions; see Figures 1-1 and 1-2). However, volume closely informs how we choose to write and process our data and the scalability of our desired solution.

What’s the format?: While we will eventually choose a format for storage, the input format is an important consideration. How is the data being delivered? Is it via a JavaScript Object Notation (JSON) payload over an API? Perhaps an FTP server? If you’re lucky, it already lives in a relational database somewhere. What does the schema look like? Is there a schema? The endless number of data formats keeps our livelihoods interesting, but also presents a challenge.
What’s the quality?: The quality of the dataset will largely determine if any transformation is necessary. As data engineers, it’s our job to ensure consistent datasets for our users. Data might need to be heavily processed or even enriched from external sources to supplement missing characteristics.

We’ll use these characteristics to answer our final question:

How will the data be stored?: As we mentioned, this book assumes some fixed destination for your data. Even then, there are a few key considerations in data storage: to stage or not to stage (is it really a question?), business requirements, and stakeholder fit are the most important.

Source checklist

For every source you encounter, consider these guiding questions. Though it might seem daunting as the number of sources piles up, remember: this isn’t a writing task. It’s a framework to unpack the challenges of each source, helping you sketch out apt solutions that hit your targets.

While it might feel repetitive, this groundwork is a long-term time and resource saver.

Question	Example
Who will we collaborate with?	Engineering (Payments)
How will the data be used?	Financial reporting and quarterly strategizing
Are there multiple sources?	Yes
What’s the format?	Semi-structured APIs (Stripe and Internal)
What’s the frequency?	Hourly
What’s the volume?	Approximately 1K new rows/day, with an existing pool of ~100K
What processing is required?	Data tidying, such as column renaming, and enrichment from supplementary sources
How will the data be stored?	Storing staged data in Delta tables via Databricks

The Destination

While end-to-end systems require hypothetical considerations that are just that, we assume most readers will be tasked with building a pipeline into an existing system, where the storage technology is already chosen.

Choosing data storage technology is beyond the scope of this guide’s focus, but we will briefly consider destinations, as they are highly important to the total value created by a data system. Thus, when analyzing (or considering) a destination, we recommend using a similar checklist to that of a source. Usually, there are far fewer destinations than sources, so this should be a much simpler exercise.

Examining destinations

A key differentiator in destinations is that the stakeholder is prioritized. Destinations have a unique trait: they pivot around stakeholders. These destinations either directly fuel BI, analytics, and AI/ML applications or indirectly power them when dealing with staged data, not to mention user-oriented apps. Though we recommend the same checklist, we suggest framing it slightly toward the stakeholder to be sure it meets their requirements, while working toward engineering goals.

We fully recognize this is not always possible. As an engineer, your role is to craft the most fitting solution, even if it means settling on a middle ground or admitting there’s no clear-cut answer. Despite technology’s incredible strides, certain logical dilemmas do not have straightforward solutions.

Staging ingested data

We advocate for a data lake approach to data ingestion. This entails ingesting most data into cloud storage systems, such as S3, Google Cloud Platform, or Azure, before loading it into a data warehouse for analysis.

One step further is a lakehouse—leveraging data storage protocols, like Delta Lake, which use metadata to add performance, reliability, and expanded capability. Lakehouses can even replicate some warehouse functionality; this means avoiding the need to load data to a separate warehouse system. Adding in a data governance layer, like Databricks’ Unity Catalog, can provide better discoverability, access management, and collaboration for all data assets across an organization.

A prevailing and effective practice for staging data is utilizing metadata-centric Parquet-based file formats, including Delta Lake, Apache Iceberg, or Apache Hudi. Grounded in Parquet—a compressed, columnar format designed for large datasets—these formats incorporate a metadata layer, offering features such as time travel, ACID (atomicity, consistency, isolation, and durability) compliance, and more.

Integrating these formats with the medallion architecture, which processes staged data in three distinct quality layers, ensures the preservation of the entire data history. This facilitates adding new columns, retrieving lost data, and backfilling historical data.

The nuances of the medallion architecture will be elaborated upon in our chapter on data transformation (Chapter 2). For the current discussion, it’s pertinent to consider the viability of directing all data to a “staging layer” within your chosen cloud storage provider.

OLAP Versus OLTP Databases

The biggest choice in data warehousing is whether to use a cloud native database or a cloud-hosted traditional database—the main difference being the distributed nature and column-oriented architecture of cloud native solutions. These are more commonly referred to as OLAP (online analytical processing) and OLTP (online transaction processing):

OLAP systems are designed to process large amounts of data quickly. This is commonly accomplished via distributed processing and a column-oriented architecture. Newer, cloud native databases are OLAP systems; the big three OLAP solutions are Amazon Redshift, Google BigQuery, and Snowflake.
OLTP systems are engineered to handle large amounts of transactional data originating from multiple users. This usually takes the form of a row-oriented database. Many traditional database systems are OLTP: Postgres, MySQL, etc.

OLAP systems are most commonly used by analytics and data science teams for their speed, stability, and low maintenance cost. Here are some considerations for data warehouse selection:

Computing platform of choice: The integration of these technologies largely depends on the rest of your tech stack. For example, if every tool in your organization lives in the Amazon ecosystem, Redshift might be more cost-effective and simple to implement than Google BigQuery. Similarly, the Databricks ecosystem contains a broad array of functionality—from storage (Lakehouse, Databricks SQL) to governance (Unity Catalog) and compute (Apache Spark).
Functionality: BigQuery has great support for semistructured data and ML operations. Redshift Spectrum has proven to be a useful tool for creating external tables from data in S3. Databricks separates storage and compute through the use of Spark, Databricks SQL, and Delta Lake. Every warehouse has strengths and weaknesses; these need to be evaluated according to your team’s use case.
Cost: Pricing structures vary wildly across platforms. Unfortunately, pricing can also be opaque—in most cases, you won’t truly know how a database will be used until you’re on the platform. The goal with pricing should be to understand how to use a database to minimize cost and what the cost might be if that route is taken. For example, in cases where cost is directly tied to the amount of data scanned (BigQuery), intelligent partitioning and query filtering can go a long way. Using a lakehouse saves you the redundancy and replicated data of a lake and a data warehouse. This is especially important since warehouses tend to have steeper storage costs and limitations.; There will be no direct answer on cost, but research is worthwhile, as warehouses can be expensive, especially at scale.

Change data capture

Change data capture (CDC) is a data engineering design pattern that captures and tracks changes in source databases to update downstream systems. Rather than batch-loading entire databases, CDC transfers only the changed data, optimizing both speed and resource usage.

This technique is crucial for real-time analytics and data warehousing, as it ensures that data in the target systems is current and in sync with the source. By enabling incremental updates, CDC enhances data availability and consistency across the data pipeline. Put simply by Joe Reis and Matt Housley in Fundamentals of Data Engineering (O’Reilly, 2022), “CDC…is the process of ingesting changes from a source database system.”

CDC becomes important in analytics and engineering patterns—like creating Slowly Changing Dimension (SCD) type 1 and 2 tables, a process that can be unnecessarily complex and time-consuming. Choosing platforms or solutions that natively support CDC can expedite common tasks, letting you focus on what matters most. One example is Delta Live Tables (DLT) on Databricks, which provide native support for SCD type 1 and 2 in both batch and streaming pipelines.

Destination checklist

Here’s a sample checklist for questions to consider when selecting a destination:

Question	Example
Who will we collaborate with?	Human Resources
How will the data be used?	Understand how tenure/contract duration affects bottom-line results in a multinational organization.
Are there multiple destinations?	Data is staged in Delta Lake; final tables are built in Databricks SQL.
What’s the format?	Parquet in Delta Lake. Structured/semi-structured in Databricks SQL.
What’s the frequency?	Batch
What’s the volume?	Approximately 1K new rows/day, with an existing pool of ~100K
What processing is required?	Downstream processing using DLT. ML models and generative AI applications in Spark.
How will the data be stored?	Blend of Databricks SQL and external tables in Delta Lake.

Ingestion Considerations

In this section, we outline pivotal data characteristics. While this list isn’t exhaustive and is influenced by specific contexts, aspects like frequency, volume, format, and processing emerge as primary concerns.

Frequency

We already mentioned that the first consideration in frequency should be bounds, i.e., is the dataset bounded or unbounded. Bounds and business needs will dictate a frequency for data ingestion—either in batch or streaming formats. Figure 1-3 neatly shows the difference between batch and streaming processes; streaming captures events as they occur, while batch groups them up, as the name would suggest.

We’ll present batch, micro-batch, and streaming along with our thoughts to help you select the most appropriate frequency and a compatible ingestion solution.

Batch

Batch processing is the act of processing data in batches rather than all at once. Like a for loop that iterates over a source, batch simply involves either chunking a bounded dataset and processing each component or processing unbounded datasets as data arrives.

Micro-batch

A micro-batch is simply “turning the dial down” on batch processing. If a typical batch ingestion operates daily, a micro-batch might function hourly or even by the minute. Of course, you might say, “At what point is this just semantics? A micro-batch pipeline with 100 ms latency seems a lot like streaming to me.”

We agree!

For the rest of this chapter (and book), we’ll refer to low-latency micro-batch solutions as streaming solutions—the most obvious being Spark Structured Streaming. While “technically” micro-batch, latency in the hundreds of milliseconds makes Spark Structured Streaming effectively a real-time solution.

Streaming

Streaming refers to the continuous reading of datasets, either bounded or unbounded, as they are generated. While we will not discuss streaming in great detail, it does warrant further research. For a comprehensive understanding of streaming data sources, we suggest exploring resources like Streaming 101, Streaming Databases, and, of course, Fundamentals of Data Engineering.

Methods

Common methods of streaming unbounded data include:

Windowing: Segmenting a data source into finite chunks based on temporal boundaries.
Fixed windows: Data is essentially “micro-batched” and read in small fixed windows to a target.
Sliding windows: Similar to fixed windows, but with overlapping boundaries.
Sessions: Dynamic windows in which sequences of events are separated by gaps of inactivity—in sessions, the “window” is defined by the data itself.
Time-agnostic: Suitable for data where time isn’t crucial, often utilizing batch workloads.

Figure 1-4 demonstrates the difference between fixed windows, sliding windows, and sessions. It’s crucial to differentiate between the actual event time and the processing time, since discrepancies may arise.

Message services

When we say “message services,” we refer to “transportation layers” or systems for communicating and transporting streaming data. One important note is that this is not a direct comparison; while there is overlap in these services, many operate under fundamentally different architectures, rendering “Kafka versus Pub/Sub” or “Kinesis versus Redpanda” discussions largely irrelevant.

Apache Kafka: Originated at LinkedIn in 2011, Apache Kafka started as a message queue system but quickly evolved into a distributed streaming platform. While Kafka’s design allows for high throughput and scalability, its inherent complexity remains a hurdle for many.
Redpanda: Developed as an alternative to Kafka, Redpanda boasts similar performance with a simplified configuration and setup. Redpanda is based on C++ rather than Java and compatible with Kafka APIs.
Pub/Sub: Pub/Sub is the Google Cloud offering for a durable, dynamic messaging queue. Unlike Kafka, Google Pub/Sub scales dynamically to handle variable workloads. Rather than dealing in “streams” and “shards,” Pub/Sub opts for “topics” and “subscriptions.” A big draw to Pub/Sub is the elimination of most “maintenance” tasks—it’s almost fully managed.
Kinesis: Kinesis is another robust, fully managed service. As an Amazon service, Kinesis offers the obvious ease of integration with other Amazon Web Services (AWS) offerings while bringing automatic scalability and real-time data processing. Like Pub/Sub, Kinesis stands out for its managed service nature, offering a lower operational burden than Apache Kafka.

Stream processing engines

Stream processing is about analyzing and acting on real-time data (streams). Given Kafka’s longevity, the three most popular and well-known stream processing tools are:

Apache Flink: An open source engine that continuously processes both unbounded and bounded datasets with minimal downtime. Apache Flink ensures low latency through in-memory computations, offers high availability by eliminating single points of failure, and scales horizontally.
Apache Spark Structured Streaming: An arm of the Apache Spark ecosystem designed to handle real-time data processing. It brings the familiarity and power of Spark’s DataFrame and Dataset APIs to streaming data. Structured Streaming might be an attractive option given the popularity of Apache Spark in data processing and ubiquity of the engine in tools like Databricks.
Apache Kafka Streams: A library built on Kafka that provides stateful processing capabilities, but ties to Java can be limiting.

Simplifying stream processing

Several relatively new solutions simplify stream processing by offering straightforward clients, with a focus on performance and simple development cycles.

Managed platforms: Taking Databricks as an example: leveraging tools like Delta Live Tables (DLT) or simply running Spark Streaming jobs on the Databricks runtime can be a powerful abstraction of complexity and drastically simplify the process of building streaming systems.
Confluent Kafka: An attempt to bring Apache Kafka capabilities to Python, although it remains rudimentary compared with its Java counterpart. Confluent Kafka is simply a client library in the same way psycopg2 is a Postgres client library.
Bytewax: A library that aims to bridge the gap by offering a more intuitive, Pythonic way of dealing with stream processing, making it more accessible to a wider range of developers. Built on Rust, Bytewax is highly performant, simpler than Flink, and boasts shorter feedback loops and easy deployment/scalability.

Still newer tools that seek to unify stream processing—like Apache Beam or Estuary Flow—or combine stream processing directly with databases (streaming databases) are growing in popularity. We recommend Streaming Systems and Streaming Databases for an in-depth look.

The streaming landscape, while complex, has seen strides in simplification and user-friendliness, especially when considering managed platforms, like Databricks, and low-latency micro-batch solutions, like Spark Structured Streaming.

While many think of real-time data as the ultimate goal, we emphasize a “right-time” approach. As with any solution, latency can be optimized infinitely, but the cost of the solution (and complexity) will increase proportionally. Most will find going from daily or semi-daily data to hourly/subhourly data a perfectly acceptable solution.

Payload

The term “payload” refers to the actual message being transmitted, along with any metadata or headers used for routing, processing, or formatting the data. Data payloads are inherently broadly defined, since they can take almost any shape imaginable. In this section, we’ll discuss typical payload characteristics.

Volume

Volume is a pivotal factor in data ingestion decisions, influencing the scalability of both processing and storage solutions. When assessing data volume, be sure to consider factors like:

Cost: Higher volumes often lead to increased costs, both in terms of storage and compute resources. Make sure to align the cost factor with your budget and project needs, including storage/staging costs associated with warehouses, lakes, or lakehouses, depending on your solution.
Latency: Depending on whether you need real-time, near-real-time, or batch data ingestion, latency can be a critical factor. Real-time processing not only requires more resources, but it also necessitates greater efficiency when volume spikes. For any data volume, be sure your systems can handle latency requirements.
Throughput/scalability: It’s essential to know the ingestion tool’s ability to handle the sheer volume of incoming data. If the data source generates large amounts of data, the ingestion tool should be capable of ingesting that data without causing bottlenecks.
Retention: With high volumes, data retention policies become more important. You’ll need a strategy to age out old data or move it to cheaper, long-term storage solutions. In addition to storing old data, security and backfilling (restoring) lost data should be considered.

For handling sizable datasets, using compressed formats like Apache Avro or Parquet is crucial. Each offers distinct advantages and constraints, especially regarding schema evolution. For each of these considerations, be sure to look to the future—tenable solutions can quickly disintegrate with order-of-magnitude increases in data volume.

Structure and shape

Data varies widely in form and structure, ranging from neat, relational “structured” data to more free-form “unstructured” data. Importantly, structure doesn’t equate to quality; it merely signifies the presence of a schema.

In today’s AI-driven landscape, unstructured data’s value is soaring as advancements in large language and machine learning models enable us to mine rich insights from such data. Despite this, humans have a penchant for structured data when it comes to in-depth analysis, a fact underscored by the enduring popularity of SQL—Structured Query Language—a staple in data analytics for nearly half a century.

Unstructured

As we’ve alluded, unstructured data is data without any predefined schema or structure. Most often, it’s represented as text, but other forms of media represent unstructured data, too. Video, audio, and imagery all have elements that may be analyzed numerically. Unstructured data might be text from a Pierce Brown novel:

“A man thinks he can fly, but he is afraid to jump. A poor friend pushes him from behind.” He looks up at me. “A good friend jumps with.”

Such data often feeds into machine learning or AI applications, underlining the need to understand stakeholder requirements comprehensively. Given the complexity of machine learning, it’s vital to grasp how this unstructured data will be utilized before ingesting it. Metrics like text length or uncompressed size may serve as measures of shape.

Semi-structured

Semi-structured data lies somewhere between structured and unstructured data—XML and JSON are two popular formats. Semi-structured data might take the form of a JSON payload:

'{"friends": ["steph", "julie", "thomas", "tommy", "michelle", "tori", “larry”]}'

As data platforms continue to mature, so too will the ability to process and analyze semi-structured data directly. The following snippet shows how to parse semi-structured JSON in Google BigQuery to pivot a list into rows of data:

WITH j_data AS (
        SELECT
            (
            JSON '{"friends": ["steph", "julie", "thomas", "tommy", "michelle", "tori", “larry”]}'
            ) AS my_friends_json
), l_data AS (
    SELECT
        JSON_EXTRACT_ARRAY(
            JSON_QUERY(j_data.my_friends_json, "$.friends"), '$'
        ) as my_friends_list
    FROM j_data
)
    SELECT
        my_friends
FROM l_data, UNNEST(l_data.my_friends_list) as my_friends
ORDER BY RAND()

In some situations, moving data processing downstream to the analytics layer is worthwhile—it allows analysts and analytics engineers greater flexibility in how they query and store data. Semi-structured data in a warehouse (or accessed via external tables) allows for flexibility in the case of changing schemas or missing data while still providing all the benefits of tabular data manipulation and SQL.

Still, we must be careful to properly validate this data to ensure that missing data isn’t causing errors. Many invalid queries result from the improper consideration of NULL data.

Describing the shape of JSON frequently involves discussing keys, values, and the number of nested elements. We highly recommend tools like JSONLint and JSON Crack for this purpose. VS Code extensions also exist to validate and format JSON/XML data.

Structured

The golden “ideal” data, structured sources are neatly organized with fixed schemas and unchanging keys. For over 50 years, SQL has been the language of choice for querying structured data. When storing structured data, we frequently concern ourselves with the number of columns and length of the table (in rows). These characteristics inform our use of materialization, incremental builds, and, in aggregate, an OLAP versus OLTP database (column-/row-oriented).

Though much data today lacks structure, we still find SQL to be the dominant tool for analysis. Will this change? Possibly, but as we showed, it’s more likely that SQL will simply adapt to accommodate semi-structured formats. Though language-based querying tools have started appearing with AI advancements, SQL is often an intermediary. If SQL disappears from data analysis, it will likely live on as an API.

Format

What is the best format for source data? While JSON and CSV (comma-separated values) are common choices, an infinite number of format considerations can arise. For instance, some older SFTP/FTP transfers might arrive compressed, necessitating an extra extraction step.

The data format often dictates processing requirements and available solutions. While a tool like Airbyte might seamlessly integrate with a CSV source, it could stumble with a custom compression method or a quirky Windows encoding (believe us, it happens).

If at all possible, we advise opting for familiar, popular data formats. Like repairing a vehicle, the more popular the format, the easier it will be to find resources, libraries, and instructions. Still, in our experience it’s a rite of passage to grapple with a perplexing format, but that’s part of what makes our jobs fun!

Variety

It’s highly likely you’ll be dealing with multiple sources and thus varying payloads. Data variety plays a large role in choosing your ingestion solution—it must not only be capable of handling disparate data types but also be flexible enough to adapt to schema changes and varying formats. Variety makes governance and observability particularly challenging, something we’ll discuss in Chapter 5.

Failing to account for data variety can result in bottlenecks, increased latency, and a haphazard pipeline, compromising the integrity and usefulness of the ingested data.

Choosing a Solution

The best tools for your team will be the ones that support your sources and targets. Given the unique data requirements of each organization, your choices will be context-specific. Still, we can’t stress enough the importance of tapping into the knowledge of mentors, peers, and industry experts. While conferences can be pricey, their knowledge yield can be priceless. For those on a tight budget, data communities can be invaluable. Look for them on platforms like Slack and LinkedIn and through industry newsletters.

When considering an ingestion solution, we think in terms of general and solution-specific considerations—the former applying to all tools we’ll consider, and the latter being specific to the class of tooling.

General considerations include extensibility, the cost to build, the cost to maintain, the cost to switch, and other, system-level concerns.

Solution-specific considerations are dependent on the class of tooling, which commonly takes one of two forms:

Declarative solutions dictate outcomes. For example, “I use the AWS Glue UI to build a crawler that systematically processes data” or “I create a new Airbyte connection via a UI.”
Imperative solutions dictate actions. For example, “I build a lambda that calls the Stripe API, encodes/decodes data, and incrementally loads it to Snowflake.”

Each of these solutions has its pros and cons. We’ll briefly discuss each and present our recommended method for approaching data integration.

Declarative Solutions

We classify declarative solutions as legacy, modern, or native, largely dependent on their adherence to modern data stack (MDS) principles, like infrastructure-as-code, DRY (don’t repeat yourself), and other software engineering best practices. Native platforms differ from the first two—they are integrated directly into a cloud provider:

Legacy

Think Talend, WhereScape, and Pentaho. These tools have robust connectors and benefit from a rich community and extensive support. However, as the data landscape evolves, many of these tools lag behind, not aligning with the demands of the MDS. Unless there’s a compelling reason, we’d recommend looking beyond legacy enterprise tools.

Modern

Here’s where Fivetran, Stitch, and Airbyte come into play. Designed around “connectors,” these tools can seamlessly link various sources and targets, powered by state-of-the-art tech and leveraging the best of the MDS.

Native

In the first two solutions, we’re working from the assumption that data must be moved from one source to another—but what if you had a managed platform that supported ingestion, out of the box? Databricks, for example, can natively ingest from message buses and cloud storage:

CREATE STREAMING TABLE raw_data 
AS select * 
FROM cloud_files(“/raw_data”, “json”);

CREATE STREAMING TABLE clean_data
AS SELECT SUM(profit)...
FROM raw_data;

While there is no “right” type of declarative solution, many will benefit from the reduced cost to build and maintain these solutions, especially those native to your existing cloud provider/platform.

Cost to build/maintain

Declarative solutions are largely hands-off—here’s where you get a bang for your buck! Dedicated engineers handle the development and upkeep of connectors. This means you’re delegating the heavy lifting to specialists. Most paid solutions come with support teams or dedicated client managers, offering insights and guidance tailored to your needs. These experts likely have a bird’s-eye view of the data landscape and can help you navigate ambiguity around specific data problems or connect you with other practitioners.

Extensibility

Extensibility revolves around how easy it is to build upon existing solutions. How likely is that new connector to be added to Airbyte or Fivetran? Can you do it yourself, building on the same framework? Or do you have to wait weeks/months/years for a product team? Will using Stitch suffice, or will you need a complementary solution? Remember, juggling multiple solutions can inflate costs and complicate workflows. In declarative solutions, extensibility is huge. No one wants to adopt a solution only to learn it will only solve 15% of their needs.

Cost to switch

The cost to switch is where the limitations of declarative tools come to light. Proprietary frameworks and specific CDC methods make migrating to another tool expensive. While sticking with a vendor might be a necessary compromise, it’s essential to factor this in when evaluating potential solutions.

Imperative Solutions

Imperative data ingestion approaches can be in-house Singer taps, lambda functions, Apache Beam templates, or jobs orchestrated through systems like Apache Airflow.

Typically, larger organizations with substantial resources find the most value in adopting an imperative methodology. Maintaining and scaling a custom, in-house ingestion framework generally requires the expertise of multiple data engineers or even a dedicated team.

The biggest benefit of imperative solutions is their extensibility.

Extensibility

By nature, imperative is custom—that means each tap and target is tailored to the needs of the business. When exploring data integration options, it quickly becomes apparent that no single tool meets every criterion. Standard solutions inevitably involve compromises. However, with an imperative approach, there’s the freedom to design it precisely according to the desired specifications. Unfortunately, without a large, dedicated data engineering organization, this extensibility is incredibly expensive to build and maintain.

Cost to build/maintain

While imperative solutions can solve complex and difficult ingestion problems, they require quite a bit of engineering. One look at the Stripe entity relationship diagram should be enough to convince you that this can be incredibly time-consuming. Additionally, the evolving nature of data—like changes in schema or the deprecation of an API—can amplify the complexity. Managing a single data source is one thing, but what about when you scale to multiple sources?

A genuinely resilient, imperative system should incorporate best practices in software design, emphasizing modularity, testability, and clarity. Neglecting these principles might compromise system recovery times and hinder scalability, ultimately affecting business operations. Hence, we suggest that only enterprises with a robust data engineering infrastructure consider going fully imperative.

Cost to switch

Transitioning from one imperative solution to another might not always be straightforward, given the potential incompatibility between different providers’ formats. However, on a brighter note, platforms based on common frameworks, like Singer, might exhibit more compatibility, potentially offering a smoother switch compared with purely declarative tools such as Fivetran or Airbyte.

Hybrid Solutions

Striking the right balance in integration often means adopting a hybrid approach. This might involve leveraging tools like Fivetran for most integration tasks, while crafting in-house solutions for unique sources, or opting for platforms like Airbyte/Meltano and creating custom components for unsupported data sources.

Contributing to open source can also be rewarding in a hybrid environment. Though not without faults, hybrid connectors, like those in Airbyte or Singer taps, benefit from expansive community support. Notably, Airbyte’s contributions to the sector have positively influenced market dynamics, compelling competitors like Fivetran to introduce free tiers. We also encourage a proactive approach to exploring emerging libraries and tools. For instance, dlt (data load tool—not to be confused with Delta Live Tables!) is an open source library showing significant promise.

Consider data integration akin to choosing an automobile. Not every task demands the prowess of a Formula One car. More often, what’s required is a dependable, versatile vehicle. However, while a Toyota will meet 99% of uses, you won’t find a Sienna in an F1 race. The optimal strategy? Rely on the trusty everyday vehicle, but ensure access to high performance tools when necessary.

Get Understanding ETL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 1. Data Ingestion

Data Ingestion—Now Versus Then

Sources and Targets

The Source

Examining sources

Questions to ask

Figure 1-1. Hard drive costs per GB, 1980 to 2015 (Source: Matt Komorowski); the y-axis values are in log scale

Figure 1-2. Cost of compute, millions of instructions per second (MIPS) (Source: Field Robotics Center); the y-axis values are in log scale

Source checklist

The Destination

Examining destinations

Staging ingested data

Change data capture

Destination checklist

Ingestion Considerations

Frequency

Figure 1-3. Latency is the property that defines “batch” or “streaming” processes; beyond some arbitrary latency threshold, we consider data “streamed” (courtesy of Denny Lee)

Batch

Micro-batch

Streaming

Methods

Figure 1-4. Fixed windows, sliding windows, and sessions treat the arrival of streaming data quite differently

Message services

Stream processing engines

Simplifying stream processing

Payload

Volume

Structure and shape

Unstructured

Semi-structured

Structured

Format

Variety

Choosing a Solution

Declarative Solutions

Cost to build/maintain

Extensibility

Cost to switch

Imperative Solutions

Extensibility

Cost to build/maintain

Cost to switch

Hybrid Solutions

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly