1.3

The “Great Divide”

Abstract

Corporate data consists of structured data and unstructured data. Unstructured data consists of repetitive and nonrepetitive data. The separation between repetitive data and nonrepetitive data can be called the: great divide”. Repetitive Big Data is centric to Hadoop, where most of the activities include data management functions for very large amounts of data. Nonrepetitive data is data that is organized around textual disambiguation, including such functions as sub doc processing, inline contextualization, taxonomical resolution, acronym resolution, standardization, stop word processing, homographic resolution, proximity resolution, and other functions.

Keywords

corporate data
Hadoop
Big Data
textual disambiguation ...

Get Data Architecture: A Primer for the Data Scientist now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.