We have seen that we need to be careful about standardizing and potentially normalizing our features, and the impact on model performance can be serious. In this case, we used only a portion of the features available. For example, we completely ignored the category variable and the textual content in the boilerplate variable column.
This was done for ease of illustration, but let's assess the impact of adding an additional feature such as the category feature.
First, we will inspect the categories, and form a mapping of index to category, which you might recognize as the basis for a 1-of-k encoding of this categorical feature:
val categories = records.map(r => r(3)).distinct.collect.zipWithIndex.toMap val numCategories ...