Improving the model

From the world cloud, it is clear that there are still numbers and words that are not relevant for us. For example, there is the technical term package, and we can remove those. In addition to that, showing plural versions of nouns is redundant. Let's use the removeNumbers function to remove numbers:

> v <- tm_map(v, removeNumbers)

In order to remove some frequent domain-specific words with less relevance to our purpose, we need to see most common words in our corpus. To do that, we can compute TermDocumentMatrix, as shown in the following snippet:

> tdm <- TermDocumentMatrix(v)

The tdm object is a matrix that holds the words in the rows and the documents in the columns, where the cells show the number of occurrences. ...

Get Hands-On Big Data Modeling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.