7

Data Profiling and Data Quality

As we work with multiple sources of data, it is quite easy for some bad data to pass through if there are no checks in place. This can lead to serious issues in downstream systems that rely on the accuracy of upstream data to build models, run business-critical applications, and so on. To make our data pipelines resilient, it is imperative that we have data quality checks in place to ensure the data being processed meets the requirements imposed by both business as well as downstream applications.

Six primary data quality dimensions can be measured individually and used to improve the data quality:

  • Completeness: Does your customer dataset that you plan to use for an upcoming marketing campaign have all of the ...

Get Data Engineering with Scala and Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.