December 2017
Beginner to intermediate
470 pages
12h 29m
English
The bag-of-words model takes into account isolated terms called unigrams. This looses the order of the words, which can be important in some cases. A generalization of the technique is called n-grams, where we use single words as well as word pairs or word triplets, in the case of bigrams and trigrams, respectively. The n-gram refers to the general case where you keep up to n words together in the data. Naturally this representation exhibits unfavorable combinatorial complexity characteristics and makes the data grow exponentially. When dealing with a large corpus this can take significant computing power.
With the sentence object we created before to exemplify how the tokenization process works (it contains ...