January 2018
Intermediate to advanced
470 pages
11h 9m
English
In this sub-section, we will see some EDA of the dataset before we start preprocessing and feature engineering. Only then creation of an analytics pipeline makes sense. At first, let's import necessary packages and libraries as follows:
import org.apache.spark._import org.apache.spark.sql.functions._import org.apache.spark.sql.types._import org.apache.spark.sql._import org.apache.spark.sql.Dataset
Then, let's specify the data source and schema for the dataset to be processed. When loading the data into a DataFrame, we can specify the schema. This specification provides optimized performance compared to the pre-Spark 2.x schema inference.
At first, let's create a Scala case class with all the fields ...
Read now
Unlock full access