Chapter 3. Pragmatic Challenges in Building Data Cleaning Systems

Acquiring and collecting data often introduces errors, including missing values, typos, mixed formats, replicated entries of the same real-world entity, and even violations of business rules. As a result, “dirty data” has become the norm, rather than the exception, and most solutions that deal with real-world enterprise data suffer from related pragmatic problems that hinder deployment in practical industry and business settings.

In the field of big data, we need new technologies that provide solutions for quality data analytics and retrieval on large-scale databases that contain inconsistent and dirty data. Not surprisingly, developing pragmatic data quality solutions is a challenging task, rich with deep theoretical and engineering problems. In this chapter, we discuss several of the pragmatic challenges caused by dirty data, and a series of principles that will help you develop and deploy data cleaning solutions.

Data Cleaning Challenges

In the process of building data cleaning software, there are many challenges to consider. In this section, we’ll explore seven characteristics of real-world applications, and the often-overlooked challenges they pose to the data cleaning process.

1. Scale

One of the building blocks in data quality is record linkage and consistency checking. For example, detecting functional dependency violations involves (at least) quadratic complexity algorithms, such as those that ...

Get Getting Data Right now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.