Chapter 2. Data Management Principles
In this book, we are rarely concerned with the algorithmic details of how models are constructed or how they’re structured. The most exciting algorithmic development of last year is the mundane executable of next year. Instead, we are overwhelmingly interested in two things: the data used to construct the models, and the processing pipeline that takes the data and transforms it into models.
Ultimately, ML systems are data processing pipelines, and their purpose is to extract usable and repeatable insights from data. There are some key differences between ML pipelines and conventional log processing or analysis pipelines, however. ML pipelines have some very different and specific constraints and fail in different ways. Their success is hard to measure, and many failures are difficult to detect. (We cover these topics at length in Chapter 9.) Fundamentally, they consume data, and output a processed representation of that data (though vastly different forms of both). As such, ML systems depend thoroughly and completely on the structure, performance, accuracy, and reliability of their underlying data systems. This is the most useful way to think about ML systems from the reliability point of view.
In this chapter, we will start with a deep dive on data itself:
Where data comes from
How to interpret data
Data quality
Updating data sources (which we use and how we use them)
Assembling data into an appropriate form for use
We’ll cover the production ...
Get Reliable Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.