Cleaning the corpus
One of the nicest features of the tm
package is the variety of bundled transformations to be applied on corpora (corpuses). The tm_map
function provides a convenient way of running the transformations on the corpus to filter out all the data that is irrelevant in the actual research. To see the list of available transformation methods, simply call the getTransformations
function:
> getTransformations() [1] "as.PlainTextDocument" "removeNumbers" [3] "removePunctuation" "removeWords" [5] "stemDocument" "stripWhitespace"
We should usually start with removing the most frequently used, so called stopwords from the corpus. These are the most common, short function terms, which usually carry less important meanings than the other ...
Get Mastering Data Analysis with R now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.