O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Removing sparse terms

Most TDMs are initially filled with a lot of empty space. That is because every word in a corpus is indexed, and there are many words that occur so infrequently that they do not matter analytically. Removing sparse terms is a method in which we can reduce the number of terms to a manageable size, and also save space at the same time.

The removeSparseTerms() function will reduce the number of terms in the description from 268034 to 62:

dtms <- removeSparseTerms(dtm, 0.99)dim(dtms) > [1] 268034 62

As an alternative to inspect, we can also View() it in matrix form:

View(as.matrix(dtms))

Here is the output from the View command. A 1 indicates that the term occurs, and 0 indicates it did not occur:

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required