2Feature Engineering Techniques in Machine Learning

A photograph of the top view of a wooden slab. It has several horizontal layers.

Photo by Annamária Borsos

Datasets are samples that are composed of several observations. Each observation is a set of values associated with a set of variables. The transformations we need to perform before starting to train models depend on the nature of features or variables. In the variable's realm, we can find quantitative and qualitative variables. Quantitative variables can be continuous, which means composed of real values, or discrete, which means that they can only take values from within a finite set or an infinite but countable set. Qualitative variables do not translate mathematical magnitudes and can be ordinal, for which an order relationship can be defined, or nominal, where no order relationship can be found between values.

Feature engineering is an important part of the data science workflow that can greatly impact the performance of machine learning algorithms. Anomalies must be corrected or at least not ignored, and we need to adjust for missing values, eliminate duplicate observations, digitalize the data to facilitate the use of machine learning tools, encode categorical data, rescale data, and perform other tasks. For instance, often we can replace a missing value by the mean, but when the number of missing values is important, it can create bias. Instead, we can choose linear regression, which can ...

Get Machine Learning Theory and Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.