Reliable Machine Learning
by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood
Chapter 4. Feature and Training Data
It should be clear by this point that models come from data. This chapter is about the data: how it is created, processed, annotated, stored, and ultimately used to create the model. You will see that managing and handling the data creates specific challenges for repeatability, manageability, and reliability, and we will make some concrete recommendations about how to approach those challenges. For background, make sure to see (if you haven’t already) Chapters 2 and 3.
This chapter covers the infrastructure that accepts data from a source and readies it for use by the training system. We will discuss three fundamental functional subsystems involved in this task: a feature system, a system for human annotations, and a metadata system. We discussed features a little in the previous chapter; another way of thinking about them is that they are characteristics of the input data, especially characteristics that we have determined are predictive of something we care about. Labels are specific cases of the output that we want from the model that we ultimately train. They are used as examples to train that model. Another way to think about labels is that they are the target or “correct” values for a specific data instance that the model will learn. Labels can be extracted from logs by correlating the data with another independent event, or they can be generated by humans. We’ll discuss the systems needed for generation ...