Chapter 6. Artifact and Metadata Store

Machine learning typically involves dealing with a large amount of raw and intermediate (transformed) data where the ultimate goal is creating and deploying the model. In order to understand our model it is necessary to be able to explore datasets used for its creation and transformations (data lineage). The collection of these datasets and the transformation applied to them is called the metadata of our model.1

Model metadata is critical for reproducibility in machine learning;2 reproducibility is critical for reliable production deployments. Capturing the metadata allows us to understand variations when rerunning jobs or experiments. Understanding variations is necessary to iteratively develop and improve our models. It also provides a solid foundation for model comparisons. As Pete Warden defined it in this post:

To reproduce results, code, training data, and the overall platform need to be recorded accurately.

The same information is also required for other common ML operations—model comparison, reproducible model creation, etc.

There are many different options for tracking the metadata of models. Kubeflow has a built-in tool for this called Kubeflow ML Metadata.3 The goal of this tool is to help Kubeflow users understand and manage their ML workflows by tracking and managing the metadata that the workflows produce. Another tool for tracking metadata that we can integrate into our Kubeflow pipelines is MLflow Tracking. It provides API ...

Get Kubeflow for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.