Chapter 13

Preprocessing Data

IN THIS CHAPTER

Deciding to get more data or start preparations

Fixing bad data like wrong or missing values

Creating new meaningful features

Compressing and reconstructing any redundant information

Understanding why you should beware of outliers

When building a new house, before thinking of any beautiful architecture, aesthetic addition, or even furniture designed to beautify it, you need to build a solid foundation over which to construct walls. In addition, the more difficult the terrain you have to work on, the more time and effort it will take. If you neglect to create a sturdy foundation, nothing built on it can withstand time and nature for long.

The same issue exists in machine learning. No matter the level of sophistication of the learning algorithm, if you don’t prepare your foundation well — that is, your data — your algorithm won’t last long when tested in real data situations. You can’t prepare data by just looking at it; you must expend the effort to examine it closely. Unfortunately, time spent on cleaning data can take around 80 percent of the total time you devote to a machine learning project.

Preparing data consists of several steps (as detailed in the sections that follow):

  1. Obtain meaningful data (also called ground truth), which is data that someone has correctly measured or labeled.
  2. Acquire enough data for the learner algorithm to work correctly. You can’t tell in advance how much data you’ll need because it all depends on ...

Get Machine Learning For Dummies now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.