The n-gram features and the hashing trick

As you have seen, the BoW of the vocabulary is taken to arrive at the word representation to be used later in the classification process. But the BoW is unordered and does not have any syntactic information. Hence, the bag of n-grams are used as additional features to capture some of the syntactic information.

As we have already discussed, large-scale NLP problems almost always involve using a large corpus. This corpus will always have unbounded number of unique words, as we have seen from the Zipf's law. Words are generally defined as a string of characters separated with a delimiter, such as a space in English. Hence, taking word n-grams is simply not scalable to large corpora, which is essential ...

Get fastText Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.