Further cleanup
There are still some small disturbing glitches in the wordlist. Maybe, we do not really want to keep numbers in the package descriptions at all (or we might want to replace all numbers with a placeholder text, such as NUM
), and there are some frequent technical words that can be ignored as well, for example, package
. Showing the plural version of nouns is also redundant. Let's improve our corpus with some further tweaks, step by step!
Removing the numbers from the package descriptions is fairly straightforward, as based on the previous examples:
> v <- tm_map(v, removeNumbers)
To remove some frequent domain-specific words with less important meanings, let's see the most common words in the documents. For this end, first we have to ...
Get Mastering Data Analysis with R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.