In this recipe, we will see how to get the summary statistics for data at scale in Spark. The descriptive summary statistics helps in understanding the distribution of data.
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.
Let's take an example of load prediction data. Here is what the sample data looks like:
Download the data from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.