Summary and descriptive statistics
In this recipe, we will see how to get the summary statistics for data at scale in Spark. The descriptive summary statistics helps in understanding the distribution of data.
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.
How to do it…
Let's take an example of load prediction data. Here is what the sample data looks like:
Download the data from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.
- The preceding data contains numerical as well as ...