5

Techniques for Data Cleaning

In this chapter, we will cover six key dimensions of data quality and their corresponding techniques to improve data quality, commonly known as techniques for cleaning data in machine learning. Simply put, data cleaning is the process of implementing techniques to improve data quality by fixing errors in data or removing erroneous data. As covered in Chapters 1 and 2, reducing errors in data is a highly efficient and effective way to improve model quality over using model-centric techniques such as adding more data and/or implementing complex algorithms.

At a high level, data cleaning techniques involve fixing or removing incorrect, incomplete, invalid, biased, inconsistent, stale, or corrupted data. As data is ...

Get Data-Centric Machine Learning with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.