How it works...
CountVectorizer() from scikit-learn converts a collection of text documents into a matrix of token counts. The tokens can be individual words or arrays of two or more consecutive words, that is, n-grams. In this recipe, we created a BoW from a text variable in a dataframe.
We loaded the 20 Newsgroup text dataset from scikit-learn and, first, we removed punctuation and numbers from the text rows using pandas' replace(), which can be accessed through pandas' str, to replace digits, '\d+', or symbols, '[^\w\s]', with empty strings, ''. Then, we used CountVectorizer() to create the BoW. We set the lowercase parameter to True, to put the words in lowercase before extracting the BoW. We set the stop_words argument to english to ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access