July 2017
Intermediate to advanced
254 pages
6h 29m
English
In this chapter's previous examples, a dictionary containing all of the corpus's unique tokens is used to map a document's tokens to the elements of a feature vector. Creating this dictionary, however, has two drawbacks. First, two passes are required over the corpus: the first pass is used to create the dictionary, and the second pass is used to create feature vectors for the documents.
Second, the dictionary must be stored in memory, which could be prohibitively expensive for large corpora. It is possible to avoid creating this dictionary by applying a hash function to the token to determine its index in the feature vector directly. This shortcut is called the hashing trick:
# In[1]: ...
Read now
Unlock full access