February 2019
Beginner to intermediate
382 pages
10h 1m
English
Similar to the previous chapter, we need to encode categorical features into sets of multiple binary features by executing the following steps:
>>> categorical = df_train.columns>>> categorical.remove('label')>>> print(categorical)['C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
In PySpark, one-hot encoding is not as direct as scikit-learn (specifically, with the OneHotEncoder module).
>>> from pyspark.ml.feature ...