Chapter 4. Data Catalogs

The storage layer within the lakehouse architecture is important, as it stores the data for the entire platform. To search, explore, and discover this stored data, users need a data catalog. This chapter will focus on understanding a data catalog and the overall metadata management process that enables lakehouse platform users to search and access the data.

In the first section of this chapter, I’ll explain fundamental concepts like metadata, metastore, and data catalogs. These are not new concepts; organizations have long been implementing data catalogs in both traditional data warehouses and modern data platforms. I’ll explain these core concepts first in order to set up our discussion of the advanced features later in the chapter.

We will discuss how data catalogs differ in lakehouse architecture, as compared to the traditional and combined architectures, and how they help users get a unified view of all metadata. We will also discuss the additional benefits of data catalogs in lakehouse architecture that allow users to leverage metadata to implement a unified data governance, permission control, lineage, and sharing mechanism.

In the last section of this chapter, I’ll discuss some of the popular data catalog technology options available across cloud platforms. You’ll learn about design considerations and practical limitations that can help you make an informed decision while designing the data catalogs in your lakehouse platform.

Understanding Metadata

Just as we need processes to manage the data within the platform, we also need well-defined approaches to manage the metadata. A sound metadata management process helps to simplify data search and discovery for platform users.

Metadata is often defined as “data about data.” It is as significant as the data itself. Metadata helps define the data by providing additional information that describes the data like attribute name, datatype, filename, and file size.

Metadata provides the required structure and other relevant information to make sense of data. It helps users discover, understand, and find the exact data they need for their specific requirements.

Metadata is broadly categorized as technical metadata and business metadata.

Technical Metadata

Technical metadata provides technical information about the data. A simple example of technical metadata is the schema details of any table. The schema comprises attribute names, datatypes, lengths, and other associated information. Table 4-1 lists the schema of a product table with three attributes.

Table 4-1. Product table schema
Attribute name Attribute type Attribute length Attribute constraint
product_id integer   Not Null
product_name string 100 Null
product_category string 50 Null

Similar to tables, other objects (like files) also have metadata. File metadata provides details like the filename, creation or update time, file size, and access permission. Files like CSVs sometimes have a header record that defines the attribute names of the data. JSON and XML files also have attribute names within them. As seen in Chapter 3, file formats like Apache Parquet, Apache ORC, and Apache Avro also carry metadata information.

Business Metadata

Business metadata helps users understand the business meaning of data. Business metadata augments the technical metadata to give a business context to the data. Table 4-2 lists the business metadata of the product table.

Table 4-2. Product table business metadata
Attribute technical name Attribute business name Attribute business meaning
product_id Product identifier Unique identifier of the product
product_name Product name Name of the product
product_category Product category Category of the product

In this example, the technical attribute names are self-explanatory and you can easily understand their business meaning. However, this is not always the case.

Consider a scenario where you are using SAP as a source system and the SAP logistics module Materials Management (MM). MARA, which holds the general material data, is one of the widely used tables in this SAP module. As shown in Table 4-3, the technical names of these attributes are not self-explanatory and you would need to add a business context for users to understand what data each attribute holds.

Table 4-3. SAP MARA table business metadata
Attribute technical name Attribute business name Attribute business meaning
MANDT Client Client name
MATNR Material number Unique identifier of material
ERSDA Created on Date when the material entry was created

Technical and business metadata are essential to better understand the data in your platform. A sound metadata management process should provide capabilities to maintain and manage technical and business metadata. It should support governance and security features like access control, sensitive data handling, and data sharing, which we will discuss later in this chapter.

How Metastores and Data Catalogs Work Together

While metadata management is a process to manage the metadata and make it available to users, we need solutions and tools to implement this process. Metastores and data catalogs are the solutions that help to build a sound metadata management process.

A metastore is a repository within the data platform where the metadata is physically stored. It acts as the central metadata storage system. You can access all metadata from this central storage.

A data catalog provides a mechanism to access the metadata stored within the metastore. It provides the required user interface to explore the metadata and search for various tables and attributes.

Figure 4-1 shows how metastores and data catalogs are related, and how they enable users to access metadata.

Figure 4-1. Metadata flow diagram

For example, in the traditional on-premises Hadoop ecosystems, Hive provided Hive Metastore (HMS) for storing metadata (for Hive tables created on top of HDFS data) and Hive catalog (HCatalog) to access the HMS tables from Spark or MapReduce applications.

Modern data catalogs provide a mechanism to manage metadata in a more organized way. They enable you to provision the right access controls to the right users so that they may access your data securely. You can logically divide the catalogs into databases or schemas that hold tables, views, and other objects. You can manage user access permissions at catalog-level or at the more granular schema- or table-level.

Figure 4-2 shows a real-world scenario of how a user might access specific catalogs based on their roles and permissions.

Figure 4-2. Catalog based on business units

As shown in Figure 4-2, users from “X” business unit can access only the catalog for business unit “X”, and users from “Y” business unit can only access the catalog for business unit “Y.”

It is not always necessary to categorize the catalogs based on business units. There can be various approaches and you can select what works best for you. You can create catalogs based on the various environments—like dev, test, and prod—or you can create one single catalog and control permissions at the schema- or table-level.

Note

Many data practitioners use the terms metastores and data catalogs interchangeably to describe metadata storage systems. Most of the modern cloud services that offer data catalog capabilities abstract the physical storage of metadata and only expose the data catalogs for users to browse and access schema, tables, and attributes. Behind every catalog, there is a physical storage where the actual metadata is stored.

Features of a Data Catalog

Data catalogs provide several key features that help platform administrators organize, manage, and govern data. The features discussed in this section help platform users search, explore, and discover the relevant data quickly.

Search, Explore, and Discover Data

Data catalogs provide users with an easy mechanism to search the required data and to understand where (which schema, table, or attribute) data exists so that they can query it. Data catalogs also offer features to add business descriptions to the tables and attributes.

Users can traverse the catalog, understand the business context, and discover data that might help them in further analysis.

Data Classification

Classification is the process of categorizing attributes based on certain specifications or standards. You can classify attributes based on domains (like customer, product, and sales) or sensitivity (like confidential, internal, or public). Classification helps users to more fully understand and leverage the data. For example, an attribute classified as “internal” indicates that users should not share the data outside their organization.

As part of the classification process, you can add tags to your metadata. For example, consider a scenario where you are implementing a lakehouse for an insurance provider. You would have several tables with data related to customers—like customer name, date of birth, and national identifier. All such attributes are personally identifiable information, or PII, attributes. You can tag these attributes as “pii_attributes” in your catalog and use these tags to implement governance policies to abstract this sensitive data from non-eligible or external users. We will discuss how to handle sensitive data in more detail in Chapter 6.

Note

PII attributes are portions of data that can be used to identify a particular individual and include national IDs, email IDs, phone numbers, and date of birth.

For compliance purposes, it is mandatory to abstract such information from data consumers. You should give access to PII attributes only to a specific set of users based on their organizational role.

You should also implement data governance policies to hide or mask such PII attributes from non-eligible users who are not authorized to see the values in the PII attributes.

Data classification helps in managing data, implementing governance policies, and securing data within the platform.

Data Governance and Security

Data catalogs act as gatekeepers of data and help in implementing the data governance and security policies necessary to manage, govern, and secure the data across the organization.

Data catalogs provide following governance and security features:

  • Support for implementing standard rules and constraints for maintaining data quality

  • Implement the audit process, like tracking users accessing specific tables or attributes, required for compliance reporting

  • Support for fine-grained permission control for users who access the data

  • Secure the platform data by providing capabilities to filter or abstract sensitive data stored within the platform

  • Enable secure data sharing with data consumers

Data governance is a broad topic, and we’ll cover it in more detail in Chapter 6.

Data Lineage

Any data and analytics ecosystem consists of multiple jobs that ingest the data from source systems, transform it, and finally load it to the target storage for users’ consumption. Within this storage, there can be hundreds of tables with thousands of attributes, through which the data flows across various components. As the system grows, the data assets keep increasing. To track the data flow across components in your platform, you need a tracking mechanism to give end-to-end details about how the data navigates through these attributes. Data lineage is a process that provides information about this data flow across various components.

Data lineage can also help to perform impact analysis whenever any attribute name, type, or length changes. And it can help you to audit data assets, like tables, that are redundant or not used by any consumers. Data catalogs help you implement a data lineage solution to track the relationship between source and target attributes. We will discuss this in more detail in Chapter 6.

The features of a data catalog enable collaboration between different data teams and data personas within an organization, enabling business users to perform self-serve analysis by discovering and leveraging the data that they need to make better decisions.

Unified Data Catalog

As discussed in Chapter 2, the combined architecture faces several limitations because it uses two different storage tiers—one for data lake and one for data warehouse. In such systems, you also face challenges associated with managing separate, siloed metastores and catalogs for the data lake and the data warehouse.

Challenges of Siloed Metadata Management

Most of the challenges associated with the siloed, individual data storage tiers in the combined architecture also apply to metadata management. These challenges include:

Maintenance

You need to maintain separate metadata for data lake objects and data warehouse tables, which adds to the overall maintenance efforts. You have to frequently replicate metadata between the two systems to sync changes from one system to another.

Data discovery

Data discovery becomes challenging in combined architecture as users have to browse two different data catalogs. Some objects, like summarized tables and aggregated views, might be available only in the data warehouse. In such cases, the platform users should know which system holds the data that they seek.

Data governance and security

Due to siloed storage tiers, implementing data governance and security policies like access control, sensitive data handling, and secure sharing becomes challenging. In such environments, you cannot have a unified data governance policy that is easy and practical to implement and maintain.

Data lineage

For any change in name, datatype, or length of a specific column, you need to perform an impact analysis to identify the tables where the specific column is present. In combined architectures, the lineage view is limited to individual ecosystems (data lake or data warehouse); you can’t get an end-to-end understanding of the data flow.

Considering these challenges, it is beneficial to use a unified data catalog that can simplify the metadata management, data discovery, and governance processes. Lakehouse architecture enables you to implement this unified data catalog.

What Is a Unified Data Catalog?

A unified data catalog is a catalog that can hold metadata of all data assets like tables, views, reports, functions, as well as AI assets like ML models and feature tables. A unified data catalog enables its users to govern all their data and AI assets from a single, central platform. In lakehouse architecture, all the assets across data and AI workloads reside within the single cloud storage layer, enabling platform administrators to implement a unified data catalog to manage and govern the entire ecosystem.

Figure 4-3 shows a unified data catalog within a lakehouse platform and the key features that it provides.

Figure 4-3. Unified data catalog in a lakehouse platform

As discussed earlier, a data catalog offers key features like search, discovery, governance, and lineage. In a unified data catalog, organizations can implement these features across all data objects like tables, views, and reports as well across all assets like models and feature stores used in AI workloads.

A unified data catalog provides a single interface for different data personas—like data engineers, analysts, and scientists—to collaborate efficiently and work together to explore and leverage data. It acts as a central repository for technical as well as business users to search and discover data.

To summarize, as a data consumer, a unified data catalog is your window to explore the entire data and AI assets within the platform, browse the technical metadata of these assets, and understand the business context of data.

Benefits of a Unified Data Catalog

The key benefits of a unified data catalog are as follows:

Unified search and data discovery

In lakehouse architecture, you can implement a single metastore layer to hold all the metadata across the ecosystem. Unlike the combined architecture, users can browse and explore the metadata of all data assets using a unified data catalog. This enables users to search the required tables or attributes quickly without knowing where the data physically resides within the system.

Data catalogs also provide features to augment technical data with business context. Data owners can add business descriptions and business meanings to attributes. This enables business users and technical users to easily discover data.

Consistent access controls

Managing and maintaining access to data is difficult. It becomes more challenging when you want to implement consistent access levels across your platform. Unified data catalogs help implement a consistent access control mechanism across the data ecosystem.

You can implement a consistent access control mechanism for different persons, irrespective of which compute engines they use. Consider a scenario where you want your sales team’s data engineers and data scientists to be able to access their business unit-specific data assets. Data engineers might use notebooks, while data scientists might want to query the feature store tables to access the data. Using the unified access control mechanism, you can provide the same access levels to both personas.

Unified data governance and security

With a unified data catalog, you can implement unified data governance and security policies that apply to all assets including tables, files, functions, ML models, and feature tables. You can secure your data by applying the consistent masking policies to sensitive data within the lakehouse. Any persona, irrespective of the tool used for accessing the data, can only see the data that they are eligible to access.

End-to-end data lineage

With a lakehouse employing a unified metastore and catalog, you can easily see the end-to-end lineage across all components. Some of the advanced data catalogs also provide capabilities for implementing federated catalogs, which can show the metadata for sources outside your data platform, as well as lineage that includes these sources.

Unifying various aspects of data management processes across all assets gives lakehouse platform users a consistent experience from wherever they access the data.

Implementing a Data Catalog: Key Design Considerations and Options

There are multiple tools and platforms that can help you implement data catalogs in a lakehouse. Every cloud provider has their own native services and most of the leading third-party products have features to implement data catalogs.

You can design and implement a unified data catalog based on your use case and the overall technical landscape. In this section, I’ll discuss some of the leading data catalog tools, design considerations, possible design choices, and key limitations to implementing a data catalog in a lakehouse platform.

We will discuss the widely adopted Hive metastore; cloud native data catalogs from AWS, Azure, and GCP; and data catalog offerings from third parties like Databricks.

Using Hive metastore

Hive has been popular since the Hadoop days. Many organizations have adopted Hive metastore (HMS) to support their metadata management needs while implementing Hadoop ecosystems or modern data platforms. Traditional Hadoop ecosystems used MapReduce as a compute engine and HCatalog as the data catalog for accessing HMS. Spark also has a data catalog API to access metadata stored in HMS.

You can consider using HMS for storing the metadata for your data platform. HMS provides users the flexibility to use an external RDBMS to store metadata like table types, column names, and column datatypes. It serves as the central repository for storing and managing the metadata of tables created using different compute engines like Hive, Spark, or Flink. Native cloud services like AWS Glue and third-parties like Databricks, and many others, offer options to use HMS for storing metadata. 

Though many organizations have adopted HMS as their primary metastore, it has a couple of key challenges:

  • You have to provision and manage a separate RDBMS to store the metadata, adding to the maintenance overhead.

  • Since it is not a native cloud service, you have to spend extra effort for its integration, compared to cloud native data catalog services.

Considering these challenges, CSPs have introduced native cataloging services for implementing easy and simple metadata management processes.

Using AWS Services

AWS offers two options for storing metadata—HMS and Glue Data Catalog.

Glue Data Catalog, a native AWS service, integrates easily with services like AWS Glue ETL, Amazon EMR, Amazon Athena, and AWS Lake Formation. You would use most of these services while implementing a lakehouse platform in AWS.

Note

Here is a quick description of the AWS services just mentioned. I’ll discuss these in detail in subsequent chapters of this book:

  • AWS Glue ETL is a serverless data integration service to create Spark jobs for data processing.

  • Amazon EMR provides a big data platform to execute frameworks like Spark, Hive, Presto, HBase, and other big data frameworks for data processing, interactive analytics, and machine learning.

  • Amazon Athena is a serverless service for interactive analysis of data stored on S3.

  • AWS Lake Formation provides capabilities to secure and govern data in S3.

Figure 4-4 shows a simple flow diagram of how you can create Hudi files in S3, parse them to create metadata in the Glue Data Catalog, govern Hudi tables using Lake Formation, and query them using the Amazon Athena engine. You can also use other open table formats, like Iceberg or Delta Lake, instead of Hudi.

Figure 4-4. Lakehouse data flow in AWS ecosystem

As shown in the diagram, the Glue Data Catalog plays a significant role in lakehouse architecture to enable different personas to create, access, and query data residing on S3.

Like HMS, Glue Data Catalog also provides a central metadata repository for all data assets. Key features of AWS Glue Data Catalog are as follows:

  • It has deep integrations with other AWS services.

  • It is a fully managed, serverless service that does not need to be deployed or maintained by the user.

  • You can use Glue crawlers to parse the data files from S3 to create metadata within the catalog.

  • Its integration with Amazon Athena provides a UI to easily explore the schema, tables, and attributes.

  • Tables created using open source frameworks like Spark, Hive, and Presto within the EMR cluster can use Glue Data Catalog to store their metadata.

  • It integrates with AWS Lake Formation to provide users with fine-grained access control.

Amazon also offers a service called Amazon DataZone, which can help you implement a data catalog that has capabilities to augment the technical metadata with business metadata. You can import the metadata stored in Glue Data Catalog into DataZone and add business descriptions to the technical attributes to give them business context. You can further govern and share your data using DataZone, which internally uses Lake Formation for permission management and data sharing.

Consider following key points when using AWS services to implement a data catalog for your lakehouse platform:

Glue Data Catalog instead of HMS

Glue Data Catalog is an alternative to HMS. You can use Glue Data Catalog as a metastore to metadata of the tables that are created using query engines like Hive, Spark, and Presto within the Amazon EMR cluster. Glue Data Catalog supports storing metadata for Hudi, Iceberg, and Delta Lake tables. Ability to support your preferred open table format is one of the most important considerations when selecting a data catalog service.

Glue crawlers for automated metadata creation

You can use AWS Glue crawlers to crawl the data files in S3 and fetch (parse) the metadata. Glue crawlers store the metadata in the Glue Data Catalog and create the tables based on the records parsed from the files. You can use crawlers to generate the metadata for all your files stored in S3. Glue crawlers can also detect schema changes in the S3 data store. You can configure the crawlers to either update or ignore the table changes in the data catalog.

Table format support

Glue crawlers now also support Hudi, Iceberg, and Delta Lake files to automatically create tables in the Glue Data Catalog. Depending on your choice of table format, you can select the relevant option while creating the crawlers.

Lake Formation for data governance

Glue Data Catalog is well-integrated with Lake Formation, which helps you implement fine-grained access controls and other data governance features like role-based data filtering and secure data sharing.

Athena for data exploration

Glue Data Catalog has deep integration with Amazon Athena, a service for querying data in the S3 data lake. We will discuss this in detail in Chapter 5. Athena allows you to explore all databases, tables, and columns in the Glue Data Catalog.

Using Azure Services

If you plan to implement a lakehouse using the Azure ecosystem, you will use services like Azure Synapse Analytics as the compute layer and ADLS as the storage layer.

Synapse Analytics offers two compute engines to process the data stored in ADLS: Synapse Spark pools and Synapse serverless SQL pools. Data engineers familiar with Spark programming can use the Spark pools. For data analysts who are more comfortable with SQL, Synapse offers serverless SQL pools. You can use either of them to process data stored in ADLS. We will discuss these compute engines in more detail in Chapter 5.

Figure 4-5 shows the two options that Synapse Analytics offers to maintain and manage the metadata for the data stored in ADLS—the lake database and the SQL database:

Lake database

Synapse Spark pools manage the lake databases. You can use lake databases for storing the metadata of the objects created using the Synapse notebooks. This includes the metadata of delta tables created using Spark pools.

SQL database

Serverless SQL pools manage the Synapse SQL databases. You can create tables using Synapse serverless SQL pools in the SQL database, and you can use the serverless SQL endpoints to connect Management Studio or Power BI to the tables within the SQL database and query data.

Figure 4-5. Metadata management using Synapse Analytics
Note

The Synapse SQL database is different from the Azure SQL database (relational database) service. The Synapse SQL database holds the metadata of tables created using the Synapse serverless SQL pools. It does not hold the actual data, as the data resides on ADLS.

Depending on which compute engine you use, you can select lake database (for Spark pools) or SQL database (for serverless SQL pools). As shown in Figure 4-5, the advantage of using the lake database is that you can access it from Synapse notebooks, as well as Synapse serverless SQL pools.

Figure 4-6 shows how you can create delta tables in ADLS, metadata in a lake database, and implement a unified catalog using Microsoft Purview to query it securely using Synapse serverless SQL pools for further analysis.

Figure 4-6. Lakehouse data flow in Azure ecosystem

As shown in Figure 4-6, you can create delta tables using Synapse notebooks, and then use serverless SQL pools endpoint to query the data using Power BI or any other database query editors like SQL Server Management Studio (SSMS).

While the Synapse lake database and SQL database persist the metadata, they are not full-fledged cataloging solutions. Azure provides a service called Microsoft Purview that provides catalog capabilities. You can consider using it to implement a data catalog for your lakehouse.

Microsoft Purview offers support for unified data governance across on-premises, Azure native, and multi-cloud platforms, as well as support for data classification and the handling of sensitive data. It also offers features like data lineage, access control, and data sharing, as well as features to create and maintain a business glossary for business users. You can import the metadata from the Synapse SQL database into Microsoft Purview and leverage these features in your platform.

Tip

Microsoft Purview also supports importing metadata from Databricks. This is a useful feature if you are processing data using Databricks compute and want to maintain a data catalog using Microsoft Purview.

Using GCP Services

Like AWS and Azure, you can follow a similar pattern for creating your data catalog in GCP and using it as a central, unified layer for metadata management.

Figure 4-7 shows how you can create Iceberg tables in GCS and metadata in BigLake, centrally govern the metadata using Dataplex, and query it securely using BigQuery for further analysis.

Figure 4-7. Lakehouse data flow in GCP ecosystem
Note

Here is a quick description of the compute services shown in Figure 4-7, which I’ll discuss in subsequent chapters in detail:

  • Dataproc is a GCP-managed service to run Spark and Hadoop clusters.

  • BigQuery is a serverless data warehouse service that offers built-in BI and ML features.

BigLake

BigLake is a GCP service that enables BigQuery and other open source frameworks like Spark to access data stored in GCS with fine-grained access control. It supports the Iceberg open table format and enables BigQuery users to query Iceberg data stored in GCS as BigLake tables with controlled permissions.

Key features of BigLake are:

  • It provides a metastore to access Iceberg tables from BigQuery.

  • It can sync Iceberg tables created in Dataproc or BigQuery and make these available to users via BigQuery SQL interface.

  • It enables administrators to implement fine-grained access control for Iceberg tables.

BigLake currently only supports Parquet data files for Iceberg and has a few more limitations.

Dataplex

Dataplex is a GCP service that enables organizations to discover and govern their data assets. It provides capabilities to explore data, manage data lifecycle, and understand the data flow using end-to-end lineage.

BigLake integrates with Dataplex to provide a central access control mechanism for BigLake tables. You can consider Dataplex for your platform if you want to manage all your data assets from a single pane of glass.

Using Databricks

With the notion of multi-cloud picking up, many organizations have started looking for third-party products that integrate well with their multi-cloud strategy. Databricks offers one such product that enables organizations to adopt a multi-cloud strategy, as it can leverage AWS, Azure, and GCP infrastructure for compute and storage.

Note

Multi-cloud strategy is an approach to use multiple cloud platforms to get the best features and cost advantages provided by different CSPs. Many organizations now opt for more than one cloud provider to implement their data ecosystems.

Databricks offers a couple of options for cataloging metadata. You can use HMS or a native service known as Databricks Unity Catalog for managing, maintaining, and governing metadata. Unity Catalog helps implement a unified governance solution across the data and AI assets on a lakehouse.

Similar to AWS Glue Data Catalog, Unity Catalog, being a native service within Databricks, provides easy integrations with Databricks features like Notebooks and Databricks SQL.

Note

Databricks SQL is a serverless compute engine within the Databricks lakehouse platform. You can use it to execute interactive queries in the lakehouse. When combined with a query editor (a service available within the Databricks UI for query authoring), you can easily browse data assets from the HMS or Unity Catalog and execute queries using SQL commands.

Figure 4-8 shows a simple flow diagram within a lakehouse implemented using Azure Databricks. You can create the delta files using Databricks Notebooks and access the delta tables from Databricks SQL.

Figure 4-8. Lakehouse data flow in Databricks ecosystem

Unity Catalog plays a significant role by providing capabilities to manage metadata at a central location that is accessible by Databricks Notebooks as well as Databricks SQL. It also provides the central access control mechanism for implementing data governance policies.

Key features offered by the Unity Catalog are as follows:

  • Capability to manage and govern all your data and AI assets like tables, views, notebooks, ML models, feature tables, and dashboards

  • Ability to add business context, resulting in easy search and data discovery

  • Ability to provide federated catalogs (in preview at the time of writing this book) for external sources like MySQL, PostgreSQL/Postgres, Snowflake, and Redshift

  • End-to-end data lineage across the Databricks ecosystems, including AI components

  • Secure data sharing by providing fine-grained access controls on data shares

  • Any schema and metadata changes done in Notebooks are reflected immediately in Databricks SQL without any lag (compared to Azure Synapse Analytics with Delta Lake)

Unity Catalog has recently been open sourced by Databricks and will soon start supporting various data and AI products.

Unity Catalog is an excellent option for implementing a lakehouse within the Databricks ecosystem. However, if you want to access and govern metadata outside Databricks, you can also consider other enterprise-grade catalogs that can ingest data from Databricks Unity Catalog and make it available to users outside Databricks for easier discovery and central governance.

Tip

Many features and services discussed in this section are relatively new or still in preview mode. These will evolve and mature gradually, and workarounds or solutions will be offered for some of the limitations we’ve discussed here. When exploring these tools for your use case, please consult the latest documentation and evaluate the latest versions.

Along with the cloud native catalogs, there are open source catalogs like Project Nessie and enterprise cataloging tools like Alation, Collibra, and Atlan that provide additional features and benefits that you can explore for specific requirements.

Key Takeaways

In this chapter, we discussed how you can store the metadata for all of your data assets in a metastore and access it using data catalogs based on access permissions. Lakehouse architecture enables you to implement a unified data catalog to manage, govern, and share all your data and AI assets.

Table 4-4 summarizes the various services available across the cloud platforms for implementing metadata management processes.

Table 4-4. Metadata management services across providers
Provider Technical metadata management Business metadata management and data governance
AWS HMS, Glue Data Catalog DataZone
Azure Synapse Lake Database, Synapse SQL Database Microsoft Purview
GCP BigLake Dataplex
Databricks HMS, Unity Catalog Unity Catalog

Table 4-5 summarizes the key design considerations, per ecosystem, when implementing a data catalog in lakehouse architecture.

Table 4-5. Key points for data catalog implementation
Ecosystem Key design considerations
AWS
  • Consider implementing a unified data catalog using services like Glue Data Catalog, Lake Formation, and DataZone.

  • You can manage, govern, and share data using these services.

  • Use DataZone to add the business context to the technical metadata.

  • Glue Data Catalog and Lake Formation integrate with DataZone.

  • Organizations can store metadata from Hudi, Iceberg, or Delta Lake in the Glue Data Catalog and query the data stored in S3 using Athena.

Azure
  • You can use Azure Synapse Analytics for implementing lakehouse architecture.

  • Use Synapse notebooks to create delta files in ADLS and metadata in the Synapse Lake database.

  • Synapse serverless SQL pools can query data from the Lake database as well as the SQL database.

  • Organizations can use Microsoft Purview to catalog technical and business metadata and apply governance policies.

GCP
  • BigLake stores the metadata for tables created in GCP.

  • BigLake supports Iceberg tables natively and BigQuery can use BigLake tables to query data.

  • Organizations can use Dataplex to implement a unified catalog within GCP for governing data.

Databricks
  • Organizations can use Unity Catalog to implement a unified catalog.

  • Unity Catalog offers features to implement fine-grained access controls across all data and AI/ML assets.

  • Databricks has recently open sourced Unity Catalog and can now be used outside Databricks.

In this chapter, we focused our learning on metastores and data catalogs, and their features and benefits in lakehouse architecture. In the next chapter, we will discuss the different compute and data consumption options within lakehouse architecture, and how they help different personas perform data and analytics workloads efficiently.

References

AWS
Azure
GCP
Databricks
Other

Get Practical Lakehouse Architecture now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.