9Data Discovery

After reading this chapter, you should be able to:

  • Reason about the importance of data discovery
  • Learn data governance practices
  • Understand tools for data discovery

Data discovery is a complex concept that could mean generating insights from different data sources to any sort of business intelligence activity. Nevertheless, my use of data discovery has a different implication. Data discovery makes metadata easily accessible, documented, and presented in different ways. It is a key piece since it enables the rest of the data platform activities. When the data is not discoverable, it generates friction in consumption. The better documented, available, searchable the metadata, the quicker the extraction of data.

This chapter discusses the importance of data discovery for a modern Big Data platform, find the link between data discovery and data governance, and explore tooling used in Big Data discovery.

9.1 Need for Data Discovery

The phenomena of data warehouses and later data lakes have been making data available in many forms. The quantity of data piped into the data lakes creates confusion, inconsistency, and errata. Since data are stored without requiring a structure, it becomes inherently harder to keep metadata consistent and meaningful. In big organizations, the problem becomes more challenging because numerous teams work in different departments using different tooling but shared infrastructure.

For a typical data lake, data sources are staggeringly ...

Get Designing Big Data Platforms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.