In the last two chapters, we introduced how we can orchestrate machine learning workflows and manage the workflow’s metadata. With this in mind, we can start building our workflows.
Data is the basis for every machine learning model, and the model’s usefulness and performance depend on the data used to train, validate and analyze the model. As you can image, without robust data, we can’t build robust models. Moreover, in colloquial terms, you might have heard the phrase: “Garbage in, garbage out” - meaning that our models won’t perform if the underlying data isn’t curated and validated. This is the exact purpose of our first workflow step in our machine learning pipeline: data validation.
In this chapter, we introduce you to a Python package from the TensorFlow ecosystem called TensorFlow Data Validation. We show you how you can set up the package in your data science projects, walk you through the common use cases and highlight some very useful workflows.
TensorFlow Data Validation assists you in comparing multiple data sets with each other, and it highlights if your data schema changes over time (data drift) or if your training data is significantly different from your data to validate your models or data which is used to infer your model (data skew).
At the end of the chapter, we integrate our first workflow step into our Airflow pipelines.
In machine learning, we are trying to learn from patterns in data sets ...