IN THIS CHAPTER
Deciding to get more data or start preparations
Fixing bad data like wrong or missing values
Creating new meaningful features
Compressing and reconstructing any redundant information
Understanding why you should beware of outliers
When building a new house, before thinking of any beautiful architecture, aesthetic addition, or even furniture designed to beautify it, you need to build a solid foundation over which to construct walls. In addition, the more difficult the terrain you have to work on, the more time and effort it will take. If you neglect to create a sturdy foundation, nothing built on it can withstand time and nature for long.
The same issue exists in machine learning. No matter the level of sophistication of the learning algorithm, if you don’t prepare your foundation well — that is, your data — your algorithm won’t last long when tested in real data situations. You can’t prepare data by just looking at it; you must expend the effort to examine it closely. Unfortunately, time spent on cleaning data can take around 80 percent of the total time you devote to a machine learning project.
Preparing data consists of several steps (as detailed in the sections that follow):