Chapter 5. Data and Feature Preparation

Machine learning algorithms are only as good as their training data. Getting good data for training involves data and feature preparation.

Data preparation is the process of sourcing the data and making sure it’s valid. This is a multistep process1 that can include data collection, augmentation, statistics calculation, schema validation, outlier pruning, and various validation techniques. Not having enough data can lead to overfitting, missing significant correlations, and more. Putting in the effort to collect more records and information about each sample during data preparation can considerably improve the model.2

Feature preparation (sometimes called feature engineering) refers to transforming the raw input data into features that the machine learning model can use.3 Poor feature preparation can lead to losing out on important relations, such as a linear model with nonlinear terms not expanded, or a deep learning model with inconsistent image orientation.

Small changes in data and feature preparation can lead to significantly different model outputs. The iterative approach is the best for both feature and data preparation, revisiting them as your understanding of the problem and model changes. Kubeflow Pipelines makes it easier for us to iterate our data and feature preparation. We will explore how to use hyperparameter tuning to iterate in Chapter 10.

In this chapter, we will cover different approaches to data and feature preparation ...

Get Kubeflow for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.