book

Building Medallion Architectures

by Piethein Strengholt

March 2025

Intermediate to advanced

396 pages

11h 36m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Includes

Includes Quizzes

Who Should Read This BookNavigating This BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
What Is a Medallion Architecture?A Brief History of Data Warehouse ArchitectureOLTP SystemsData WarehousesThe Staging AreaInmon MethodologyKimball MethodologyKey Takeaways from Traditional Data WarehousesA Brief History of Data LakesHadoop’s Distributed File SystemMapReduceApache HiveSpark ProjectMoving Forward with Data LakesA Brief History of Lakehouse ArchitectureFounders of SparkEmergence of Open Table FormatsThe Rise of Lakehouse ArchitecturesMedallion Architecture and Its Practical ChallengesConclusion
Foundational PreconditionsExtra Landing ZonesRaw DataBatch ProcessingReal-Time Data ProcessingSpark Structured StreamingChange Data FeedChange Data CaptureConsiderations and Learning ResourcesETL and Orchestration ToolsManaging Delta TablesZ-OrderingV-OrderingTable PartitioningLiquid ClusteringCompaction and Optimized WritesDeltaLogConclusion
The Three-Layered DesignBronze LayerProcessing HierarchyProcessing Full Data LoadsProcessing Incremental Data LoadsData Historization Within the Bronze LayerSchema Evolution and ManagementMergeSchema and Schema EnforcementTechnical Validation ChecksUsage and GovernanceThe Bronze Layer in PracticeSilver LayerCleaning Data ActivitiesDesigning the Silver Layer’s Data ModelHarmonization with Other Sources3NF and Data VaultOperational Querying and Machine LearningManaging Overlapping RequirementsAutomation TasksThe Silver Layer in PracticeGold LayerStar SchemaStar Schema Design NuancesCurated, Semantic, and Platinum LayersOne-Big-Table DesignServing LayerThe Gold Layer in PracticeConclusion
Our Case Study: Oceanic AirlinesIntroducing Microsoft FabricDomainsWorkspaces and CapacitiesOneLakeData Engineering with SparkData Warehousing with T-SQLOther Fabric Workload TypesSetting Up the FoundationSetting up CapacitiesSetting up DomainsSetting up WorkspacesCreating LakehousesCapacity ConsiderationsDomain ConsiderationsWorkspace ConsiderationsLakehouse Entities ConsiderationsStorage Account ConsiderationsConclusion
Building the Data PipelineDeploying the AdventureWorks Sample DatabaseSet Up an Azure SQL Database ConnectionCreating a New Data PipelineAdditional ConsiderationsImplementation of Lakehouse TablesTraverse Parquet Files to Managed Delta TablesUsing External TablesUpdating Tables with MERGE OperationsSpark Structured StreamingUsing Change Data CaptureNavigating Data Handling TechniquesSchema ManagementCreate Tables Without Defining SchemasDefine Schemas with the DataFrame APISQL DDL StatementsYAML or JSON ConfigurationsMetadata-Driven ApproachDatabricks Auto LoaderThird-Party ToolsHandling Schema EvolutionConclusion
Quick RecapImplementation of a Metadata-Driven ApproachImplementation of the Metadata StoreImplementation of Dynamic Data ValidationsImprovement AreasData CleansingImplementation of Data Cleansing TasksData Cleansing ConsiderationsData Transformation Frameworks and Data Quality ToolsOptimization of Query Performance with DenormalizationLightweight EnrichmentsData HistorizationOptimization JobsOrchestration with Apache AirFlowFinal RecommendationsSilver-Layer Data as a ProductConclusion

Design of the Gold LayerTransform Data Using a Star SchemaCreation of the Semantic ModelCreation of the First Power BI ReportCreation of Task FlowsEnhancements for Gold-Layer DesignMicrosoft Fabric in PracticeData ProductsData Governance with Microsoft PurviewMicrosoft Purview Design ConsiderationsGuidance for Medallion ArchitecturesConclusion
Medallion ArchitectureOther ConsiderationsFinal Recommendations
Medallion ArchitectureFinOpsData ModelsData ContractsData Governance
Data Platform EvolutionMedallion ArchitectureData Products and SharingRecommendations and Best Practices
Decentralization of Data ManagementFlexibility in FederationMedallion MeshNumber of Medallion ArchitecturesMedallion Inner Architecture VariationsSeparate Data Product LayersTailored Medallions ArchitecturesAdaptability of the Bronze LayerSilver Layer VariationsGold Layer VariationsEnterprise Data ModelsMaster Data ManagementReference Data ManagementConclusion
Data GovernanceGovernance Within a Medallion ArchitectureUnity CatalogMedallion Architecture with Unity CatalogData ContractsContracts Within a CatalogContracts Within a MetastoreData Contracts Using YAML Files and GitOpsOther Data Contract SpecificationsData Security and Access ManagementConclusion
Unstructured Data ProcessingRetrieval-Augmented GenerationBronze LayerSilver LayerGold LayerIntegration of LLMs and Medallion ArchitecturesRole of AgentsTraining and Fine-Tuning LLMsFuture of Medallion ArchitecturesConclusion

Content preview from Building Medallion Architectures

Chapter 5. Construct the Bronze Layer

Having established the foundation of your data platform, whether it is Microsoft Fabric or Azure Databricks, it’s time to build the Bronze layer. This is the layer where all the raw data first lands, and the data is maintained in its original form. It serves both as a historical archive and a reliable single source.

As part of the exercise of setting up the first layer, you’ll tackle tasks such as setting up connections, building your first data pipeline, and exploring how to handle data ingestion and schema management. You’ll come across various code snippets along the way. These snippets are here to help clarify the process—some are just for learning, and some you can actually use in your coding exercises. Keep in mind, though, these examples are streamlined for educational purposes, so you might need to tweak them a bit when you apply them to real-world scenarios.

By the end of this chapter, you will thoroughly understand how to build and implement the Bronze layer of your Medallion architecture, including the nuances that come with ingestion and managing data in the Bronze layer. This solid base will prepare you for the subsequent Silver and Gold stages. Let’s start by building the data pipeline.

Building the Data Pipeline

In this section, we will construct a data pipeline using Data Factory,¹ while integrating Spark and Delta Lake into the process. This hands-on journey will equip you with the skills to understand how these tools interconnect ...