July 2017
Intermediate to advanced
360 pages
8h 26m
English
The algorithm is very simple and it's based on representing a token considering how many times it appears in a document. Of course, the whole corpus must be processed in order to determine how many unique tokens are present and their frequencies. Let's see an example of the CountVectorizer class on a simple corpus:
from sklearn.feature_extraction.text import CountVectorizer>>> corpus = [ 'This is a simple test corpus', 'A corpus is a set of text documents', 'We want to analyze the corpus and the documents', 'Documents can be automatically tokenized']>>> cv = CountVectorizer()>>> vectorized_corpus = cv.fit_transform(corpus)>>> print(vectorized_corpus.todense())[[0 0 0 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0] [0 0 0 0 0 1 1 1 1 1 ...
Read now
Unlock full access