In the last several years, we’ve had an explosion of data that led to the big data movement. A large quantity of this data is in text form — sometimes as structured data that could include numbers, sometimes as semi-structured data such as JSON, XML, and even HTML documents, and sometimes as unstructured text that could also be part of these semi-structured documents.
When it comes to text, we often hear about sentiment analysis and buzz analysis. There is a lot more that can be done with text. This chapter discusses text analytics and how Streams can be used to help address a number of text analytics problems, including the use of the Text Analytics toolkit that allows for a more deterministic analysis that can help make decisions based on meanings.
What is Text Analytics?
When we think about text analytics, we may think about anything from buzz (where it could be enough to find the mention of a specific subject) to machine learning and artificial intelligence (where the system can give recommendations in complex subjects such as law and healthcare). An example of the latter would be IBM Watson that is used in areas such as cancer research.
There is a lot more to text analytics that can be addressed in multiple ways. Examples of that include email spam filters, article classification, keyword matching, and information extraction. Multiple approaches are needed to accomplish these tasks, as we are about to see.
Word count revisited
If you look at basic examples ...