Preparing Your Data

In the previous chapter, we dealt with clean data, where all the values were available to us, all the columns had numeric values, and when faced with too many features, we had a regularization technique on our side. In real life, it will often be the case that the data is not as clean as you would like it to be. Sometimes, even clean data can still be preprocessed in ways to make things easier for our machine learning algorithm. In this chapter, we will learn about the following data preprocessing techniques:

  • Imputing missing values
  • Encoding non-numerical columns
  • Changing the data distribution
  • Reducing the number of features via selection
  • Projecting data into new dimensions

Imputing missing values

"It is a capital mistake ...

Get Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.