As we have seen earlier, the bag of word approach is both fast and robust. It is, though, not without challenges. Let's dive directly into them.
We do not have to write custom code for counting words and representing those counts as a vector. SciKit's
CountVectorizer method does the job not only efficiently but also has a very convenient interface. SciKit's functions and classes are imported via the
>>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer(min_df=1)
min_df parameter determines how
CountVectorizer treats seldom words (minimum document frequency). If it is set to ...