O'Reilly logo

Building Machine Learning Systems with Python - Second Edition by Luis Pedro Coelho, Willi Richert

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Preprocessing – similarity measured as a similar number of common words

As we have seen earlier, the bag of word approach is both fast and robust. It is, though, not without challenges. Let's dive directly into them.

Converting raw text into a bag of words

We do not have to write custom code for counting words and representing those counts as a vector. SciKit's CountVectorizer method does the job not only efficiently but also has a very convenient interface. SciKit's functions and classes are imported via the sklearn package:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=1)

The min_df parameter determines how CountVectorizer treats seldom words (minimum document frequency). If it is set to ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required