Best practice 12 – performing feature engineering without domain expertise

If we unfortunately have very little domain knowledge, how can we generate features? Don't panic. There are several generic approaches that you can follow:

  • Binarization: This is the process of converting a numerical feature to a binary one with a preset threshold. For example, in spam email detection, for the feature (or term) prize, we can generate a new feature whether prize occurs: any term frequency value greater than 1 becomes 1, otherwise it is 0. The feature number of visits per week can be used to produce a new feature is frequent visitor by judging whether the value is greater than or equal to 3. We implement such binarization using scikit-learn, as follows: ...

Get Python Machine Learning By Example - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.