Chapter 6

Security and Text Mining


Massive amounts of unstructured data are being collected from online sources, such as e-mails, call center transcripts, wikis, online bulletin boards, blogs, tweets, Web pages, and so on. The R programming language contains a rich collection of packages and functions for analyzing unstructured text data. Functions include those for identifying unique words and their corresponding occurrence frequencies, a process known as tokenizing. Other functions provide a means for cleansing text data, such as removal of white space and punctuation, converting to lowercase, and removing less meaningful words through a stop word list. Apache Hive functions also provide the means for tokenizing large amounts of text ...

Get Information Security Analytics now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.