Before we discuss unstructured data analytics in more detail, let’s first define what is meant by “unstructured data.” Unstructured data or unstructured information refers to information that does not have a predefined data model and/or does not fit well into relational database tables. Unstructured data typically have no identifiable structure and may include bitmap images/objects, text, and other data types that are not part of a typical database. Unstructured information is frequently text-heavy but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities, making it difficult to understand through the use of traditional computer programs, as compared to data stored in fielded forms in traditional relational databases or annotated (semantically tagged) in documents. Unstructured data cannot easily be analyzed with traditional analytics techniques.2

Unstructured data analytics first emerged in the late 1990s as “text mining.” Early approaches treated and analyzed text as a bag of words. Text mining evolved early to use basic shallow linguistics to handle variant word forms, such as abbreviations, plurals, and conjugations, as well as multiword terms known as n-grams. N-grams are a contiguous sequence of items from a sequence of text or speech. The items in question can be phonemes, syllables, letters, or words, depending on the application. An n-gram text analytics model is a type of probabilistic ...

Get Win with Advanced Business Analytics: Creating Business Value from Your Data now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.