O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Converting to a document term matrix

Once we have a corpus, we can proceed to convert it to a document term matrix. When building DTM, care must be given to limiting the amount of data and resulting terms that are processed. If not parameterized correctly, it can take a very long time to run. Parameterization is accomplished via the options. We will remove any stopwords, punctuation, and numbers. Additionally, we will only include a minimum word length of four:

library(tm) dtm <- DocumentTermMatrix(corp, control = list(removePunctuation = TRUE, wordLengths = c(4,  999), stopwords = TRUE, removeNumbers = TRUE, stemming = FALSE, bounds = list(global = c(5,  Inf))))

We can begin to look at the data by using the inspect() function.

This is ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required