Feature engineering

Now it is time to run the first transformer (which is actually an estimator). It is StringIndexer and needs to keep track of an internal mapping table between strings and indexes. Therefore, it is not a transformer but an estimator:

import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}var indexer = new StringIndexer()  .setHandleInvalid("skip")  .setInputCol("L0_S22_F545")  .setOutputCol("L0_S22_F545Index")var indexed = indexer.fit(df_notnull).transform(df_notnull)indexed.printSchema

As we can see clearly in the following image, an additional column called L0_S22_F545Index has been created:

Finally, let's examine ...

Get Apache Spark 2: Data Processing and Real-Time Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.