O'Reilly logo

R for Data Science by Dan Toomey

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Summary

In this chapter, we discussed different methods of mining against a text source. We took a raw document, cleaned it up using built-in R functions, and produced a corpus that allowed analysis. We were able to remove sparse terms and stop words to be able to focus on the real value of the text.

From the corpus, we were able to generate a document term matrix that holds all of the word references in a source.

Once the matrix was available, we organized the words into clusters and plotted the data/text accordingly. Similarly, once in clusters, we could perform standard R clustering techniques to the data.

Finally, we looked at using raw XML as the text source for our processing and examined some of the XML processing features available in R. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required