Data preparation

Data quality has always been a pervasive problem in the industry. The presence of incorrect or inconsistent data can produce misleading results of your analysis. Implementing better algorithm or building better models will not help much if the data is not cleansed and prepared well, as per the requirement. There is an industry jargon called data engineering that refers to data sourcing and preparation. This is typically done by data scientists and in a few organizations, there is a dedicated team for this purpose. However, while preparing data, a scientific perspective is often needed to do it right. As an example, you may not just do mean substitution to treat missing values and look into data distribution to find more appropriate ...

Get Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.