February 2017
Intermediate to advanced
274 pages
5h 58m
English
In order to build a statistical model in an informed way, an intimate knowledge of the dataset is necessary. Without knowing the data it is possible to build a successful model, but it is then a much more arduous task, or it would require more technical resources to test all the possible combinations of features. Therefore, after spending the required 80% of the time cleaning the data, we spend the next 15% getting to know it!
I normally start with descriptive statistics. Even though the DataFrames expose the .describe() method, since we are working with MLlib, we will use the .colStats(...) method.
A word of warning: the .colStats(...) calculates the descriptive statistics based on a sample. For ...
Read now
Unlock full access