June 2017
Beginner to intermediate
576 pages
15h 22m
English
Once we have a corpus, we can proceed to convert it to a document term matrix. When building DTM, care must be given to limiting the amount of data and resulting terms that are processed. If not parameterized correctly, it can take a very long time to run. Parameterization is accomplished via the options. We will remove any stopwords, punctuation, and numbers. Additionally, we will only include a minimum word length of four:
library(tm) dtm <- DocumentTermMatrix(corp, control = list(removePunctuation = TRUE, wordLengths = c(4, 999), stopwords = TRUE, removeNumbers = TRUE, stemming = FALSE, bounds = list(global = c(5, Inf))))
We can begin to look at the data by using the inspect() function.
This is ...