July 2018
Intermediate to advanced
334 pages
8h 20m
English
We will split our DataFrame in two:
The training set is used to train (fit) the model, and the remaining 25% will be put to use for testing:
val splitDataSet: Array[org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = indexedDataFrame.randomSplit(Array(0.75, 0.25), 98765L)//create two vals to hold TrainingData and TestingData respectivelyval trainDataFrame = splitDataSet(0)val testDataFrame = splitDataSet(1)
To verify that our split went well, we will run the count method on both the trainDataFrame and testDataFrame dataframes. We will leave this as an exercise to the reader. Next, we will move on to creating a LogisticRegression classifier model ...
Read now
Unlock full access