Like in the previous chapter, we need to prepare the training and validation data. In this case, we'll reuse the Spark API to split the data:
val trainValidSplits = inputData.randomSplit(Array(0.8, 0.2))val (trainData, validData) = (trainValidSplits(0), trainValidSplits(1))
Now, let's perform a grid search using a simple decision tree and a few hyperparameters:
val gridSearch =for ( hpImpurity <- Array("entropy", "gini"); hpDepth <- Array(5, 20); hpBins <- Array(10, 50))yield {println(s"Building model with: impurity=${hpImpurity}, depth=${hpDepth}, bins=${hpBins}")val model = new DecisionTreeClassifier() .setFeaturesCol("reviewVector") .setLabelCol("label") .setImpurity(hpImpurity) .setMaxDepth(hpDepth) .setMaxBins(hpBins) ...