Data Munging with Hadoop
If you torture the data long enough, it will confess.
Ronald Coase, Economist
As every data scientist knows, about 70%–80% of the time spent in data science projects is in what is commonly known as data munging—a popular term that refers to two main activities:
Identifying and remediating data quality problems
Transforming the raw data into what is known as a feature matrix, a task commonly referred to as feature generation or feature engineering
This eBook, which is part of our upcoming book, Data Science with Hadoop
Get Data Munging with Hadoop now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.