Chapter 2. Curating Your Data
Academics define data curation as “the act of discovering a data source(s) of interest, cleaning and transforming the new data, semantically integrating it with other local data sources, and deduplicating the resulting composite.”1
CDOs think of data curation more broadly as the strategic and systematic process of organizing, managing, and maintaining data to ensure the quality, integrity, and usability of data across the enterprise to meet the needs of a variety of business use cases and applications, from basic reporting to advanced ML and AI.
Both parties agree that data curation involves data collection, validation, transformation, storage, preservation, and dissemination. From the practical perspective of the C-suite, however, data curation needs to go beyond preparing data for individual applications. As the vast amounts of data continue to increase, the ability to automate the data curation process effectively at scale has become an increasingly critical factor for supporting modern, complex, cross-functional business initiatives.
This chapter explores the methods for automating and managing data curation. Let’s start by looking at what good data curation at scale looks like.
The Value of Curating Data
Effective large-scale data curation forms the cornerstone upon which robust data governance practices are built, enabling organizations to establish trust in their data to meet business needs. Achieving this entails implementing data integration ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access