In the previous section, we explored data and unified it into a form without missing values. We still need to transform the data into a form expected by Spark MLlib. As explained in the previous chapter, it involves the creation of RDD of LabeledPoints. Each LabeledPoint is defined by a label and a vector defining input features. The label serves as a training target for model builders and it references the index of categorical variables (see prepared transformation activityId2Idx):
import org.apache.spark.mllib import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.util.MLUtils ...