Chapter 2. Metadata Catalog Service
Assume a data user is looking to develop a revenue dashboard. By talking to peer data analysts and scientists, the user comes across a dataset with details related to customer billing records. Within that dataset, they come across an attribute called “billing rate.” What is the meaning of the attribute? Is it the source of truth, or derived from another dataset? Various other questions come up, such as, what is the schema of data? Who manages it? How was it transformed? How reliable is the data quality? When was it refreshed? and so on. There is no dearth of data within the enterprise, but consuming the data to solve business problems is a major challenge today. This is because building insights in the form of dashboards and ML models requires a clear understanding of the data properties (referred to as metadata). In the absence of comprehensive metadata, one can make inaccurate assumptions about the meaning of data and about its quality, leading to incorrect insights.
Getting reliable metadata is a pain point for data users. Prior to the big data era, data was curated before being added to the central warehouse—the metadata details, including schema, lineage, owners, business taxonomy, and so on, were cataloged first. This is known as schema-on-write (illustrated in Figure 2-1). Today, the approach with data lakes is to first aggregate the data and then infer the data details at the time of consumption. This is known as schema-on-read (illustrated ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access