O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mapping to an RDD

The decision tree algorithm we will be using is from the standard Spark MLlib library. This implementation requires that the input be formatted in labeled point format. Labeled points assist in specifying which of the variables are target variables and which ones are features.

For this example, the target variable (frisked2) is the first variable listed (we will call the first variable column 0), so the target variable is designated as such within the LabeledPoint call as line[0].

The independent variables, or features, are contained within the remaining columns and are specified as line[1:]. This is shorthand for column 1 and all the following columns until there are no more.

When referring to features, they are also numbered ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required