Chapter 6

Security and Text Mining


Massive amounts of unstructured data are being collected from online sources, such as e-mails, call center transcripts, wikis, online bulletin boards, blogs, tweets, Web pages, and so on. The R programming language contains a rich collection of packages and functions for analyzing unstructured text data. Functions include those for identifying unique words and their corresponding occurrence frequencies, a process known as tokenizing. Other functions provide a means for cleansing text data, such as removal of white space and punctuation, converting to lowercase, and removing less meaningful words through a stop word list. Apache Hive functions also provide the means for tokenizing large amounts of text ...

