Data exploration
In this section, we will explore this dataset and try to perform some simple and useful analytics on top of this dataset.
First, we will create the boilerplate code for Spark configuration and the Spark session:
SparkConf conf = ... SparkSession session = ...
Next, we will load the dataset and find the number of rows in it:
Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail.csv");
This will print the number of rows in the dataset as:
Number of rows --> 541909
As you can see, this is not a very small dataset but it is not big data either. Big data can run into terabytes. We have seen the number of rows, so let's look at the first few rows now.
rawData.show();
This will print the result as:
As you can see, this dataset is ...
Get Big Data Analytics with Java now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.