One-hot encoding categorical features

Similar to the previous chapter, we need to encode categorical features into sets of multiple binary features by executing the following steps:

  1. In our case, the categorical features include the following:
>>> categorical = df_train.columns>>> categorical.remove('label')>>> print(categorical)['C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']

In PySpark, one-hot encoding is not as direct as scikit-learn (specifically, with the OneHotEncoder module).

  1. We first need to index each categorical column using the StringIndexer module:
>>> from pyspark.ml.feature ...

Get Python Machine Learning By Example - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.