June 2017
Beginner to intermediate
576 pages
15h 22m
English
The decision tree algorithm we will be using is from the standard Spark MLlib library. This implementation requires that the input be formatted in labeled point format. Labeled points assist in specifying which of the variables are target variables and which ones are features.
For this example, the target variable (frisked2) is the first variable listed (we will call the first variable column 0), so the target variable is designated as such within the LabeledPoint call as line[0].
The independent variables, or features, are contained within the remaining columns and are specified as line[1:]. This is shorthand for column 1 and all the following columns until there are no more.
When referring to features, they are also numbered ...