Chapter 2. Curating Your Data

Academics define data curation as “the act of discovering a data source(s) of interest, cleaning and transforming the new data, semantically integrating it with other local data sources, and deduplicating the resulting composite.”1

CDOs think of data curation more broadly as the strategic and systematic process of organizing, managing, and maintaining data to ensure the quality, integrity, and usability of data across the enterprise to meet the needs of a variety of business use cases and applications, from basic reporting to advanced ML and AI.

Both parties agree that data curation involves data collection, validation, transformation, storage, preservation, and dissemination. From the practical perspective of the C-suite, however, data curation needs to go beyond preparing data for individual applications. As the vast amounts of data continue to increase, the ability to automate the data curation process effectively at scale has become an increasingly critical factor for supporting modern, complex, cross-functional business initiatives.

This chapter explores the methods for automating and managing data curation. Let’s start by looking at what good data curation at scale looks like.

The Value of Curating Data

Effective large-scale data curation forms the cornerstone upon which robust data governance practices are built, enabling organizations to establish trust in their data to meet business needs. Achieving this entails implementing data integration ...

Get Data Governance with AWS now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.