Chapter 6

Security and Text Mining


Massive amounts of unstructured data are being collected from online sources, such as e-mails, call center transcripts, wikis, online bulletin boards, blogs, tweets, Web pages, and so on. The R programming language contains a rich collection of packages and functions for analyzing unstructured text data. Functions include those for identifying unique words and their corresponding occurrence frequencies, a process known as tokenizing. Other functions provide a means for cleansing text data, such as removal of white space and punctuation, converting to lowercase, and removing less meaningful words through a stop word list. Apache Hive functions also provide the means for tokenizing large amounts of text ...

Get Information Security Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.