Using StringIndexer for indexing categorical features and labels

In this exercise, we will be training a random forest classifier. First, we will index the categorical features and labels as required by spark.ml. Next, we will assemble the feature columns into a vector column because every spark.ml machine learning algorithm expects it. Finally, we can train our random forest on a training Dataset. Optionally, we can also unindex the labels to make them more readable.

There are several ready-to-use Transformers available to index categorical features. We can assemble all the features into one vector (using VectorAssembler) and then use a VectorIndexer to index it. The drawback of VectorIndexer is that it will index every feature that has ...

Get Learning Spark SQL now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.