In this section, we will cover one important data preparation topic, which is about identity matching and related solutions. We will discuss some of Spark's special features for solving identity issues and also some data matching solutions made easy with Spark.
After this section, we will be capable of taking care of some common data identity problems with Apache Spark.
For data preparation, we often need to deal with some data elements that belong to the same person or units, but which do not look similar to them. For example, we may have purchased some data for customer Larry Z. and web activity data for L. Zhang. Is Larry Z a same person as L. Zhang? Are there many identity variations in the data?
Matching entities ...