In the one-hot encoding solution, can you use different classifiers supported in PySpark instead of logistic regression, such as decision tree, random forest, and linear SVM?
In the feature hashing solution, can you try other hash sizes, such as 5,000, and 20,000? What do you observe?
In the feature interaction solution, can you try other interactions, such as C1 and C20?
Can you first use feature interaction and then feature hashing in order to lower the expanded dimension? Are you able to obtain higher AUC?
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month, and much more.