O'Reilly logo

Spark for Data Science by Bikramaditya Singhal, Srinivas Duvvuri

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data preparation

Data quality has always been a pervasive problem in the industry. The presence of incorrect or inconsistent data can produce misleading results of your analysis. Implementing better algorithm or building better models will not help much if the data is not cleansed and prepared well, as per the requirement. There is an industry jargon called data engineering that refers to data sourcing and preparation. This is typically done by data scientists and in a few organizations, there is a dedicated team for this purpose. However, while preparing data, a scientific perspective is often needed to do it right. As an example, you may not just do mean substitution to treat missing values and look into data distribution to find more appropriate ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required