One-hot encoding categorical features

Similar to the previous chapter, we need to encode categorical features into sets of multiple binary features by executing the following steps:

  1. In our case, the categorical features include the following:
>>> categorical = df_train.columns>>> categorical.remove('label')>>> print(categorical)['C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']

In PySpark, one-hot encoding is not as direct as scikit-learn (specifically, with the OneHotEncoder module). 

  1. We first need to index each categorical column using the StringIndexer module:
>>> from pyspark.ml.feature ...

Get Python Machine Learning By Example - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.