Exploratory analysis and feature engineering

In this sub-section, we will see some EDA of the dataset before we start preprocessing and feature engineering. Only then creation of an analytics pipeline makes sense. At first, let's import necessary packages and libraries as follows:

import org.apache.spark._import org.apache.spark.sql.functions._import org.apache.spark.sql.types._import org.apache.spark.sql._import org.apache.spark.sql.Dataset

Then, let's specify the data source and schema for the dataset to be processed. When loading the data into a DataFrame, we can specify the schema. This specification provides optimized performance compared to the pre-Spark 2.x schema inference.

At first, let's create a Scala case class with all the fields ...

Get Scala Machine Learning Projects now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.