Data acquisition and data cleansing
Data acquisition is the logical next step. It may be as simple as selecting data from a single spreadsheet or it may be an elaborate several months project in itself. A data scientist has to collect as much relevant data as possible. 'Relevant' is the keyword here. Remember, more relevant data beats clever algorithms.
We have already covered how to source data from heterogeneous data sources and consolidate it to form a single data matrix, so we will not iterate the same fundamentals here. Instead, we source our data from a single source and extract a subset of it.
Now it is time to view the data and start cleansing it. The scripts presented in this chapter tend to be longer than the previous examples but still ...
Get Spark for Data Science now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.