July 2017
Intermediate to advanced
254 pages
6h 29m
English
While stop word filtering is an easy strategy for dimensionality reduction, most stop word lists contain only a few hundred words. A large corpus may still have hundreds of thousands of unique words after filtering. Two similar strategies for further reducing dimensionality are called stemming and lemmatization.
A high-dimensional document vector may separately encode several derived or inflected forms of the same word. For example, "jumping" and "jumps" are both forms of the word "jump"; a document vector in a corpus of long-jumping articles may encode each inflected form with a separate element in the feature vector. Stemming and lemmatization are two strategies for condensing inflected and derived forms of a ...
Read now
Unlock full access