O'Reilly logo

Machine Learning with Spark - Second Edition by Nick Pentreath, Manpreet Singh Ghotra, Rajdeep Dua

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Normalization

It is a common practice to standardize input data prior to running dimensionality reduction models, particularly, for PCA. As we did in Chapter 6, Building a Classification Model with Spark, we will do this using the built-in StandardScaler provided by MLlib's feature package. We will only subtract the mean from the data in this case.

import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.feature.StandardScaler val scaler = new StandardScaler(withMean = true, withStd = false)  .fit(vectors)
Standard Scalar: It standardizes features by removing the mean, and scaling to unit standard using column summary statistics on the samples in the training set. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required