Exercises

  • In the one-hot encoding solution, can you use different classifiers supported in PySpark instead of logistic regression, such as decision tree, random forest, and linear SVM?
  • In the feature hashing solution, can you try other hash sizes, such as 5,000, and 20,000? What do you observe?
  • In the feature interaction solution, can you try other interactions, such as C1 and C20?
  • Can you first use feature interaction and then feature hashing in order to lower the expanded dimension? Are you able to obtain higher AUC?

Get Python Machine Learning By Example - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.