Chapter 2. Data Standardization

As we discussed in Chapter 1, before we can successfully match or deduplicate data sources we need to ensure our data is presented in a consistent manner and that any anomalies are removed or corrected. We will use the term data standardization to cover both the transformation of datasets into consistent formats and the cleansing of data to remove unhelpful extra characters that would otherwise interfere with the matching process.

In this chapter, we will get hands on and work through a real-world example of this process. We will create our working environment, acquire the data we need, cleanse that data, and then perform a simple entity resolution exercise to allow us to perform some simple analysis. We will conclude by examining the performance of our data matching process and consider how we might improve it.

First, let’s introduce our example and why we need entity resolution to solve it.

Sample Problem

Let’s work through an example problem to illustrate some of the common challenges we see in resolving entities between data sources and why data cleansing is an essential first step. As we are constrained to use openly available public sources of data, the example is slightly contrived but hopefully illustrates the need for entity resolution.

Let’s imagine we are researching factors that may influence whether members of the House of Commons, the lower house of the Parliament of the United Kingdom (UK), are reelected. We surmise that politicians ...

Get Hands-On Entity Resolution now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.