July 2017
Beginner to intermediate
418 pages
9h 46m
English
In this section, we will explore this dataset and try to perform some simple and useful analytics on top of this dataset.
First, we will create the boilerplate code for Spark configuration and the Spark session:
SparkConf conf = ... SparkSession session = ...
Next, we will load the dataset and find the number of rows in it:
Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail.csv");This will print the number of rows in the dataset as:
Number of rows --> 541909
As you can see, this is not a very small dataset but it is not big data either. Big data can run into terabytes. We have seen the number of rows, so let's look at the first few rows now.
rawData.show();
This will print the result as:
As you can see, this dataset is ...