Chapter 10. Managing Metadata in Azure

When you think of metadata, you probably think schema—what are the names and types of fields contained in a table, the names of tables, etc.? This is the sort of information managed by localized metadata stores, like the Hive metadata store, which manages the metadata for external tables used by both Spark and Hive on HDInsight.

However, there is a bigger-picture metadata consideration that has to do with how you manage the metadata across all of your data assets in your data lake. In this chapter we explore one approach to doing so using the Azure Data Catalog (Figure 10-1).

Managing Metadata with Azure Data Catalog

When you first start collecting your data assets, managing what data lives where is easy. You have this database for your ecommerce transactions, and that data warehouse for your analytics. Think of how you would describe that to the new person on your team. However, as your data needs evolve to encompass a data lake, you have an explosion of databases, multiple data warehouses, and hyperscale filesystems. How do you help your new hire find that transaction history log he is asking for? That is the goal of Azure Data Catalog, a fully managed cloud service that enables users to discover the data sources they need for themselves, and to be certain it is the data they are looking for, all without actually having to move the data out of the data store in which it resides.

Azure Data Catalog is intended to enable any user—from developers ...

Get Mastering Azure Analytics, 1st Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.