Chapter 13
Preprocessing Data
IN THIS CHAPTER
Deciding to get more data or start preparations
Fixing bad data like wrong or missing values
Creating new meaningful features
Compressing and reconstructing any redundant information
Understanding why you should beware of outliers
When building a new house, before thinking of any beautiful architecture, aesthetic addition, or even furniture designed to beautify it, you need to build a solid foundation over which to construct walls. In addition, the more difficult the terrain you have to work on, the more time and effort it will take. If you neglect to create a sturdy foundation, nothing built on it can withstand time and nature for long.
The same issue exists in machine learning. No matter the level of sophistication of the learning algorithm, if you don’t prepare your foundation well — that is, your data — your algorithm won’t last long when tested in real data situations. You can’t prepare data by just looking at it; you must expend the effort to examine it closely. Unfortunately, time spent on cleaning data can take around 80 percent of the total time you devote to a machine learning project.
Preparing data consists of several steps (as detailed in the sections that follow):
- Obtain meaningful data (also called ground truth), which is data that someone has correctly measured or labeled.
- Acquire enough data for the learner algorithm to work correctly. You can’t tell in advance how much data you’ll need because it all depends on ...
Get Machine Learning For Dummies now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.