Thinking about features for text data

From the preceding analysis, we can safely conclude that, if we want to figure out whether a document was from the newsgroup, the presence or absence of words such as car, doors, and bumper can be very useful features. The presence or not of a word is a boolean variable, and we can also propose looking at the count of certain words. For instance, car occurs multiple times in the document. Maybe the more times such a word is found in a text, the more likely it is that the document has something to do with cars.

Get Python Machine Learning By Example - Second Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.