Many models that we employ make inherent assumptions about the distribution or scale of input data. One of the most common forms of assumption is about normally-distributed features. Let's take a deeper look at the distribution of our features.
To do this, we can represent the feature vectors as a distributed matrix in MLlib, using the RowMatrix class. RowMatrix is an RDD made up of vectors, where each vector is a row of our matrix.
The RowMatrix class comes with some useful methods to operate on the matrix, one of which is a utility to compute statistics on the columns of the matrix.
import org.apache.spark.mllib.linalg.distributed.RowMatrix val vectors = data.map(lp => lp.features) val matrix = new RowMatrix(vectors) ...