Using StringIndexer for indexing categorical features and labels

In this exercise, we will be training a random forest classifier. First, we will index the categorical features and labels as required by spark.ml. Next, we will assemble the feature columns into a vector column because every spark.ml machine learning algorithm expects it. Finally, we can train our random forest on a training Dataset. Optionally, we can also unindex the labels to make them more readable.

There are several ready-to-use Transformers available to index categorical features. We can assemble all the features into one vector (using VectorAssembler) and then use a VectorIndexer to index it. The drawback of VectorIndexer is that it will index every feature that has ...

Get Learning Spark SQL now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.