Identity matching

In this section, we will cover one important data preparation topic, which is about identity matching and related solutions. We will discuss some of Spark's special features for solving identity issues and also some data matching solutions made easy with Spark.

After this section, we will be capable of taking care of some common data identity problems with Apache Spark.

Identity issues

For data preparation, we often need to deal with some data elements that belong to the same person or units, but which do not look similar to them. For example, we may have purchased some data for customer Larry Z. and web activity data for L. Zhang. Is Larry Z a same person as L. Zhang? Are there many identity variations in the data?

Matching entities ...

Get Apache Spark Machine Learning Blueprints now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.