Chapter 10

CSRUD for Big Data

Abstract

This chapter describes how a distributed processing environment such as Hadoop Map/Reduce can be used to support the CSRUD Life Cycle for Big Data. The examples shown in this chapter use the match key blocking described in Chapter 9 as a data partitioning strategy to perform ER on large datasets. The chapter includes an algorithm for finding the transitive closure of multiple match keys in a distributed processing environment using an iterative algorithm that minimizes the amount of local memory required for each processor. It also outlines a structure for an identity knowledge base in a distributed key-value data store, and describes strategies and distributed processing workflows for capture and update phases ...

Get Entity Information Life Cycle for Big Data now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.