Skip to Content
Kubeflow for Machine Learning
book

Kubeflow for Machine Learning

by Trevor Grant, Holden Karau, Boris Lublinsky, Richard Liu, Ilan Filonenko
October 2020
Intermediate to advanced
261 pages
6h 19m
English
O'Reilly Media, Inc.
Book available
Content preview from Kubeflow for Machine Learning

Chapter 5. Data and Feature Preparation

Machine learning algorithms are only as good as their training data. Getting good data for training involves data and feature preparation.

Data preparation is the process of sourcing the data and making sure it’s valid. This is a multistep process1 that can include data collection, augmentation, statistics calculation, schema validation, outlier pruning, and various validation techniques. Not having enough data can lead to overfitting, missing significant correlations, and more. Putting in the effort to collect more records and information about each sample during data preparation can considerably improve the model.2

Feature preparation (sometimes called feature engineering) refers to transforming the raw input data into features that the machine learning model can use.3 Poor feature preparation can lead to losing out on important relations, such as a linear model with nonlinear terms not expanded, or a deep learning model with inconsistent image orientation.

Small changes in data and feature preparation can lead to significantly different model outputs. The iterative approach is the best for both feature and data preparation, revisiting them as your understanding of the problem and model changes. Kubeflow Pipelines makes it easier for us to iterate our data and feature preparation. We will explore how to use hyperparameter tuning to iterate in Chapter 10.

In this chapter, we will cover different approaches to data and feature preparation ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Feature Store for Machine Learning

Feature Store for Machine Learning

Jayanth Kumar M J
Grokking Deep Learning

Grokking Deep Learning

Andrew W. Trask

Publisher Resources

ISBN: 9781492050117Errata PageSupplemental Content