Chapter 13
Preprocessing Data
IN THIS CHAPTER
Deciding to get more data or start preparations
Fixing bad data like wrong or missing values
Creating new meaningful features
Compressing and reconstructing any redundant information
Understanding why you should beware of outliers
When building a new house, before thinking of any beautiful architecture, aesthetic addition, or even furniture designed to beautify it, you need to build a solid foundation over which to construct walls. In addition, the more difficult the terrain you have to work on, the more time and effort it will take. If you neglect to create a sturdy foundation, nothing built on it can withstand time and nature for long.
The same issue exists in machine learning. No matter the level of sophistication of the learning algorithm, if you don’t prepare your foundation well — that is, your data — your algorithm won’t last long when tested in real data situations. You can’t prepare data by just looking at it; you must expend the effort to examine it closely. Unfortunately, time spent on cleaning data can take around 80 percent of the total time you devote to a machine learning project.
Preparing data consists of several steps (as detailed in the sections that follow):
- Obtain meaningful data (also called ground truth), which is data that someone has correctly measured or labeled.
- Acquire enough data for the learner algorithm to work correctly. You can’t tell in advance how much data you’ll need because it all depends on ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access