November 2015
Beginner to intermediate
31 pages
56m
English
If you torture the data long enough, it will confess.
Ronald Coase, Economist
As every data scientist knows, about 70%–80% of the time spent in data science projects is in what is commonly known as data munging—a popular term that refers to two main activities:
Identifying and remediating data quality problems
Transforming the raw data into what is known as a feature matrix, a task commonly referred to as feature generation or feature engineering
This eBook, which is part of our upcoming book, Data Science with Hadoop