This step is important because we are going to create a model that we want to train with a training set. One way to create a training set is to partition the current dataframe and assign 80% of it to a new training dataset:
val splitFeaturizedDF = featurizedDF.randomSplit(Array(0.80, 0.20), 98765L)splitFeaturizedDF1: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = Array([filteredMailFeatures: string, label: double ... 2 more fields], [filteredMailFeatures: string, label: double ... 2 more fields])
Now, let's retrieve the training set:
val trainFeaturizedDF = splitFeaturizedDF(0)
The testing dataset follows. Here is how we will create it:
val testFeaturizedDF = splitFeaturizedDF( ...