If we unfortunately have very little domain knowledge, how can we generate features? Don't panic. There are several generic approaches that you can follow:
- Binarization: This is the process of converting a numerical feature to a binary one with a preset threshold. For example, in spam email detection, for the feature (or term) prize, we can generate a new feature whether prize occurs: any term frequency value greater than 1 becomes 1, otherwise it is 0. The feature number of visits per week can be used to produce a new feature is frequent visitor by judging whether the value is greater than or equal to 3. We implement such binarization using scikit-learn, as follows: ...