Summary and descriptive statistics

In this recipe, we will see how to get the summary statistics for data at scale in Spark. The descriptive summary statistics helps in understanding the distribution of data.

Getting ready

To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Also, have Apache Hadoop 2.6 and Apache Spark 1.6.0 installed.

How to do it…

Let's take an example of load prediction data. Here is what the sample data looks like:

How to do it…

Note

Download the data from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Loan_Prediction_Data.csv.

  1. The preceding data contains numerical as well as ...

Get Apache Spark for Data Science Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.