Chapter 4. Data Catalogs
The storage layer within the lakehouse architecture is important, as it stores the data for the entire platform. To search, explore, and discover this stored data, users need a data catalog. This chapter will focus on understanding a data catalog and the overall metadata management process that enables lakehouse platform users to search and access the data.
In the first section of this chapter, I’ll explain fundamental concepts like metadata, metastore, and data catalogs. These are not new concepts; organizations have long been implementing data catalogs in both traditional data warehouses and modern data platforms. I’ll explain these core concepts first in order to set up our discussion of the advanced features later in the chapter.
We will discuss how data catalogs differ in lakehouse architecture, as compared to the traditional and combined architectures, and how they help users get a unified view of all metadata. We will also discuss the additional benefits of data catalogs in lakehouse architecture that allow users to leverage metadata to implement a unified data governance, permission control, lineage, and sharing mechanism.
In the last section of this chapter, I’ll discuss some of the popular data catalog technology options available across cloud platforms. You’ll learn about design considerations and practical limitations that can help you make an informed decision while designing the data catalogs in your lakehouse platform.
Understanding Metadata ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access