July 2017
Intermediate to advanced
360 pages
8h 26m
English
So far, we have considered only single tokens (also called unigrams), but in many contexts, it's useful to consider short sequences of words (bigrams or trigrams) as atoms for our classifiers, just like all the other tokens. For example, if we are analyzing the sentiment of some texts, it could be a good idea to consider bigrams such as pretty good, very bad, and so on. From a semantic viewpoint, in fact, it's important to consider not just the adverbs but the whole compound form. It's possible to inform our vectorizers about the range of n-grams we want to consider. For example, if we need unigrams and bigrams, we can use this snippet:
>>> cv = CountVectorizer(tokenizer=tokenizer, ngram_range=(1, 2))>>> vectorized_corpus = cv.fit_transform(corpus) ...
Read now
Unlock full access