Cleaning the corpus

One of the nicest features of the tm package is the variety of bundled transformations to be applied on corpora (corpuses). The tm_map function provides a convenient way of running the transformations on the corpus to filter out all the data that is irrelevant in the actual research. To see the list of available transformation methods, simply call the getTransformations function:

> getTransformations()
[1] "as.PlainTextDocument" "removeNumbers"
[3] "removePunctuation"    "removeWords"
[5] "stemDocument"         "stripWhitespace"

We should usually start with removing the most frequently used, so called stopwords from the corpus. These are the most common, short function terms, which usually carry less important meanings than the other ...

Get Mastering Data Analysis with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Data Analysis with R by Gergely Daroczi

Cleaning the corpus

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly