Extracting the right features from your data

You might recall from Chapter 4, Obtaining, Processing, and Preparing Data with Spark, that the majority of machine learning models operate on numerical data in the form of feature vectors. In addition, for supervised learning methods such as classification and regression, we need to provide the target variable (or variables in the case of multiclass situations) together with the feature vector.

Classification models in MLlib operate on instances of LabeledPoint, which is a wrapper around the target variable (called label) and the feature vector.

case class LabeledPoint(label: Double, features: Vector) 

While in most examples of using classification, you will come across existing datasets that ...

Get Machine Learning with Spark - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.