Bag of words

Sometimes, if we want to use text in Machine Learning algorithms, we’ll have to convert them into a numerical representation. We know that computers are very good at handling numbers. We convert text into a numerical representation called a feature vector. A vector can be as simple as a list of numbers. The bag-of-words model is one of the feature-extraction algorithms for text. We can use this package to generate a bag of words. 

For that, we need to use sklearn from Python:

from sklearn.feature_extraction.text import CountVectorizer

We are going to use CountVectorizer to create the bag of words:

corpus = []len(text.sentences)for sentence in text.sentences:corpus.append(str(sentence)) vectorizer = CountVectorizer()print( vectorizer.fit_transform(corpus).todense() ...

Get Hands-On Big Data Modeling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.