Chapter 5. Advanced Labeling, Augmentation, and Data Preprocessing

The topics in this chapter are especially important to shaping your data to get the most value from it for your model, especially in a supervised learning setting. Labeling in particular can easily be one of the most expensive and time-consuming activities in the creation, maintenance, and evolution of an ML application. A good understanding of the options available will help you make the most of your resources and budget.

To that end, in this chapter we will discuss data augmentation, a class of methods in which you add more data to your training dataset in order to improve training, usually to improve generalization in particular. Data augmentation is almost always based on manipulating your current data to create new, but still valid, variations of your examples.

We will also discuss data preprocessing, but in this chapter we’ll focus on domain-specific preprocessing. Different domains, such as time series, text, and images, have specialized forms of feature engineering. We discussed one of these, tokenizing text, in “Consider Instance-Level Versus Full-Pass Transformations”. In this chapter, we’ll review common methods for working with time series data.

But first, let’s address an important question: How can we assign labels in ways other than going through each example manually? In other words, can we automate the process even at the expense of introducing inaccuracies in the labeling process? The answer ...

Get Machine Learning Production Systems now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.