June 2017
Beginner to intermediate
576 pages
15h 22m
English
Most TDMs are initially filled with a lot of empty space. That is because every word in a corpus is indexed, and there are many words that occur so infrequently that they do not matter analytically. Removing sparse terms is a method in which we can reduce the number of terms to a manageable size, and also save space at the same time.
The removeSparseTerms() function will reduce the number of terms in the description from 268034 to 62:
dtms <- removeSparseTerms(dtm, 0.99)dim(dtms) > [1] 268034 62
As an alternative to inspect, we can also View() it in matrix form:
View(as.matrix(dtms))
Here is the output from the View command. A 1 indicates that the term occurs, and 0 indicates it did not occur: