1.3

The “Great Divide”

Abstract

Corporate data consists of structured data and unstructured data. Unstructured data consists of repetitive and nonrepetitive data. The separation between repetitive data and nonrepetitive data can be called the: great divide”. Repetitive Big Data is centric to Hadoop, where most of the activities include data management functions for very large amounts of data. Nonrepetitive data is data that is organized around textual disambiguation, including such functions as sub doc processing, inline contextualization, taxonomical resolution, acronym resolution, standardization, stop word processing, homographic resolution, proximity resolution, and other functions.

Keywords

corporate data
Hadoop
Big Data
textual disambiguation ...

Get Data Architecture: A Primer for the Data Scientist now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.