Similar to the previous chapter, we need to encode categorical features into sets of multiple binary features by executing the following steps:
- In our case, the categorical features include the following:
>>> categorical = df_train.columns>>> categorical.remove('label')>>> print(categorical)['C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
In PySpark, one-hot encoding is not as direct as scikit-learn (specifically, with the OneHotEncoder module).
- We first need to index each categorical column using the StringIndexer module:
>>> from pyspark.ml.feature ...