O'Reilly logo

Getting Data Right by Shannon Cutt

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 3. Pragmatic Challenges in Building Data Cleaning Systems

Acquiring and collecting data often introduces errors, including missing values, typos, mixed formats, replicated entries of the same real-world entity, and even violations of business rules. As a result, “dirty data” has become the norm, rather than the exception, and most solutions that deal with real-world enterprise data suffer from related pragmatic problems that hinder deployment in practical industry and business settings.

In the field of big data, we need new technologies that provide solutions for quality data analytics and retrieval on large-scale databases that contain inconsistent and dirty data. Not surprisingly, developing pragmatic data quality solutions is a challenging task, rich with deep theoretical and engineering problems. In this chapter, we discuss several of the pragmatic challenges caused by dirty data, and a series of principles that will help you develop and deploy data cleaning solutions.

Data Cleaning Challenges

In the process of building data cleaning software, there are many challenges to consider. In this section, we’ll explore seven characteristics of real-world applications, and the often-overlooked challenges they pose to the data cleaning process.

1. Scale

One of the building blocks in data quality is record linkage and consistency checking. For example, detecting functional dependency violations involves (at least) quadratic complexity algorithms, such as those that ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required