Chapter 4. Data Ingestion: Extracting Data
As discussed in Chapter 3, the ELT pattern is the ideal design for data pipelines built for data analysis, data science, and data products. The first two steps in the ELT pattern, extract and load, are collectively referred to as data ingestion. This chapter discusses getting your development environment and infrastructure set up for both, and it goes through the specifics of extracting data from various source systems. Chapter 5 discusses loading the resulting datasets into a data warehouse.
Note
The extract and load code samples in this chapter are fully decoupled from each other. Coordinating the two steps to complete a data ingestion is a topic that’s discussed in Chapter 7.
As discussed in Chapter 2, there are numerous types of source systems to extract from, as well as numerous destinations to load into. In addition, data comes in many forms, all of which present different challenges for ingesting it.
This chapter and the next include code samples for exporting and ingesting data from and to common systems. The code is highly simplified and contains only minimal error handing. Each example is intended as an easy-to-understand starting point for data ingestions, but is fully functional and extendable to more scalable solutions.
Note
The code samples in this chapter write extracted data to CSV files to be loaded into the destination data warehouse. There are times when it makes more sense to store extracted data in another format, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access