How it works...

CountVectorizer() from scikit-learn converts a collection of text documents into a matrix of token counts. The tokens can be individual words or arrays of two or more consecutive words, that is, n-grams. In this recipe, we created a BoW from a text variable in a dataframe.

We loaded the 20 Newsgroup text dataset from scikit-learn and, first, we removed punctuation and numbers from the text rows using pandas' replace(), which can be accessed through pandas' str, to replace digits, '\d+', or symbols, '[^\w\s]', with empty strings, ''. Then, we used CountVectorizer() to create the BoW. We set the lowercase parameter to True, to put the words in lowercase before extracting the BoW. We set the stop_words argument to english to ...

Get Python Feature Engineering Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.