Chapter 9. Best Practices for Maintaining Pipelines
Up to this point, this book has been focused on building data pipelines. This chapter discusses how to maintain those pipelines as you encounter increased complexity and deal with the inevitable changes in the systems that your pipelines rely on.
Handling Changes in Source Systems
One of the most common maintenance challenges for data engineers is dealing with the fact that the systems they ingest data from are not static. Developers are always making changes to their software, either adding features, refactoring the codebase, or fixing bugs. When those changes introduce a modification to the schema or meaning of data to be ingested, a pipeline is at risk of failure or inaccuracy.
As discussed throughout this book, the reality of a modern data infrastructure is that data is ingested from a large diversity of sources. As a result, it’s difficult to find a one-size-fits-all solution to handling schema and business logic changes in source systems. Nonetheless, there a few best practices I recommend investing in.
Introduce Abstraction
Whenever possible, it’s best to introduce a layer of abstraction between the source system and the ingestion process. It’s also important for the owner of the source system to either maintain or be aware of the abstraction method.
For example, instead of ingesting data directly from a Postgres database, consider working with the owner of the database to build a REST API that pulls from the database ...
Get Data Pipelines Pocket Reference now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.