As we have seen previously, the bag-of-word approach is both fast and robust. However, it is not without challenges. Let's dive directly into them.
We do not have to write a custom code for counting words and representing those counts as a vector. Scikit's
CountVectorizer does the job very efficiently. It also has a very convenient interface. Scikit's functions and classes are imported via the
sklearn package as follows:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer(min_df=1)
min_df determines how
CountVectorizer treats words that are not used frequently (minimum document frequency). ...