Deciphering Data Architectures

Book description

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they're also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of these architectures to help data professionals understand the pros and cons of each. James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You'll learn what data lakehouses can help you achieve, as well as how to distinguish data mesh hype from reality. Best of all, you'll be able to determine the most appropriate data architecture for your needs. With this book, you'll:

  • Gain a working understanding of several data architectures
  • Learn the strengths and weaknesses of each approach
  • Distinguish data architecture theory from reality
  • Pick the best architecture for your use case
  • Understand the differences between data warehouses and data lakes
  • Learn common data architecture concepts to help you build better solutions
  • Explore the historical evolution and characteristics of data architectures
  • Learn essentials of running an architecture design session, team organization, and project success factors

Free from product discussions, this book will serve as a timeless resource for years to come.

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Conventions Used in This Book
    2. O’Reilly Online Learning
    3. How to Contact Us
    4. Acknowledgments
  3. I. Foundation
  4. 1. Big Data
    1. What Is Big Data, and How Can It Help You?
    2. Data Maturity
      1. Stage 1: Reactive
      2. Stage 2: Informative
      3. Stage 3: Predictive
      4. Stage 4: Transformative
    3. Self-Service Business Intelligence
    4. Summary
  5. 2. Types of Data Architectures
    1. Evolution of Data Architectures
    2. Relational Data Warehouse
    3. Data Lake
    4. Modern Data Warehouse
    5. Data Fabric
    6. Data Lakehouse
    7. Data Mesh
    8. Summary
  6. 3. The Architecture Design Session
    1. What Is an ADS?
    2. Why Hold an ADS?
    3. Before the ADS
      1. Preparing
      2. Inviting Participants
    4. Conducting the ADS
      1. Introductions
      2. Discovery
      3. Whiteboarding
    5. After the ADS
    6. Tips for Conducting an ADS
    7. Summary
  7. II. Common Data Architecture Concepts
  8. 4. The Relational Data Warehouse
    1. What Is a Relational Data Warehouse?
    2. What a Data Warehouse Is Not
    3. The Top-Down Approach
    4. Why Use a Relational Data Warehouse?
    5. Drawbacks to Using a Relational Data Warehouse
    6. Populating a Data Warehouse
      1. How Often to Extract the Data
      2. Extraction Methods
      3. How to Determine What Data Has Changed Since the Last Extraction
    7. The Death of the Relational Data Warehouse Has Been Greatly Exaggerated
    8. Summary
  9. 5. Data Lake
    1. What Is a Data Lake?
    2. Why Use a Data Lake?
    3. Bottom-Up Approach
    4. Best Practices for Data Lake Design
    5. Multiple Data Lakes
      1. Advantages
      2. Disadvantages
    6. Summary
  10. 6. Data Storage Solutions and Processes
    1. Data Storage Solutions
      1. Data Marts
      2. Operational Data Stores
      3. Data Hubs
    2. Data Processes
      1. Master Data Management
      2. Data Virtualization and Data Federation
      3. Data Catalogs
      4. Data Marketplaces
    3. Summary
  11. 7. Approaches to Design
    1. Online Transaction Processing Versus Online Analytical Processing
    2. Operational and Analytical Data
    3. Symmetric Multiprocessing and Massively Parallel Processing
    4. Lambda Architecture
    5. Kappa Architecture
    6. Polyglot Persistence and Polyglot Data Stores
    7. Summary
  12. 8. Approaches to Data Modeling
    1. Relational Modeling
      1. Keys
      2. Entity–Relationship Diagrams
      3. Normalization Rules and Forms
      4. Tracking Changes
    2. Dimensional Modeling
      1. Facts, Dimensions, and Keys
      2. Tracking Changes
      3. Denormalization
    3. Common Data Model
    4. Data Vault
    5. The Kimball and Inmon Data Warehousing Methodologies
      1. Inmon’s Top-Down Methodology
      2. Kimball’s Bottom-Up Methodology
      3. Choosing a Methodology
      4. Hybrid Models
    6. Methodology Myths
    7. Summary
  13. 9. Approaches to Data Ingestion
    1. ETL Versus ELT
    2. Reverse ETL
    3. Batch Processing Versus Real-Time Processing
      1. Batch Processing Pros and Cons
      2. Real-Time Processing Pros and Cons
    4. Data Governance
    5. Summary
  14. III. Data Architectures
  15. 10. The Modern Data Warehouse
    1. The MDW Architecture
    2. Pros and Cons of the MDW Architecture
    3. Combining the RDW and Data Lake
      1. Data Lake
      2. Relational Data Warehouse
    4. Stepping Stones to the MDW
      1. EDW Augmentation
      2. Temporary Data Lake Plus EDW
      3. All-in-One
    5. Case Study: Wilson & Gunkerk’s Strategic Shift to an MDW
      1. Challenge
      2. Solution
      3. Outcome
    6. Summary
  16. 11. Data Fabric
    1. The Data Fabric Architecture
      1. Data Access Policies
      2. Metadata Catalog
      3. Master Data Management
      4. Data Virtualization
      5. Real-Time Processing
      6. APIs
      7. Services
      8. Products
    2. Why Transition from an MDW to a Data Fabric Architecture?
    3. Potential Drawbacks
    4. Summary
  17. 12. Data Lakehouse
    1. Delta Lake Features
    2. Performance Improvements
    3. The Data Lakehouse Architecture
    4. What If You Skip the Relational Data Warehouse?
    5. Relational Serving Layer
    6. Summary
  18. 13. Data Mesh Foundation
    1. A Decentralized Data Architecture
    2. Data Mesh Hype
    3. Dehghani’s Four Principles of Data Mesh
      1. Principle #1: Domain Ownership
      2. Principle #2: Data as a Product
      3. Principle #3: Self-Serve Data Infrastructure as a Platform
      4. Principle #4: Federated Computational Governance
    4. The “Pure” Data Mesh
    5. Data Domains
    6. Data Mesh Logical Architecture
    7. Different Topologies
    8. Data Mesh Versus Data Fabric
    9. Use Cases
    10. Summary
  19. 14. Should You Adopt Data Mesh? Myths, Concerns, and the Future
    1. Myths
      1. Myth: Using Data Mesh Is a Silver Bullet That Solves All Data Challenges Quickly
      2. Myth: A Data Mesh Will Replace Your Data Lake and Data Warehouse
      3. Myth: Data Warehouse Projects Are All Failing, and a Data Mesh Will Solve That Problem
      4. Myth: Building a Data Mesh Means Decentralizing Absolutely Everything
      5. Myth: You Can Use Data Virtualization to Create a Data Mesh
    2. Concerns
      1. Philosophical and Conceptual Matters
      2. Combining Data in a Decentralized Environment
      3. Other Issues of Decentralization
      4. Complexity
      5. Duplication
      6. Feasibility
      7. People
      8. Domain-Level Barriers
    3. Organizational Assessment: Should You Adopt a Data Mesh?
    4. Recommendations for Implementing a Successful Data Mesh
    5. The Future of Data Mesh
    6. Zooming Out: Understanding Data Architectures and Their Applications
    7. Summary
  20. IV. People, Processes, and Technology
  21. 15. People and Processes
    1. Team Organization: Roles and Responsibilities
      1. Roles for MDW, Data Fabric, or Data Lakehouse
      2. Roles for Data Mesh
    2. Why Projects Fail: Pitfalls and Prevention
      1. Pitfall: Allowing Executives to Think That BI Is “Easy”
      2. Pitfall: Using the Wrong Technologies
      3. Pitfall: Gathering Too Many Business Requirements
      4. Pitfall: Gathering Too Few Business Requirements
      5. Pitfall: Presenting Reports Without Validating Their Contents First
      6. Pitfall: Hiring an Inexperienced Consulting Company
      7. Pitfall: Hiring a Consulting Company That Outsources Development to Offshore Workers
      8. Pitfall: Passing Project Ownership Off to Consultants
      9. Pitfall: Neglecting the Need to Transfer Knowledge Back into the Organization
      10. Pitfall: Slashing the Budget Midway Through the Project
      11. Pitfall: Starting with an End Date and Working Backward
      12. Pitfall: Structuring the Data Warehouse to Reflect the Source Data Rather Than the Business’s Needs
      13. Pitfall: Presenting End Users with a Solution with Slow Response Times or Other Performance Issues
      14. Pitfall: Overdesigning (or Underdesigning) Your Data Architecture
      15. Pitfall: Poor Communication Between IT and the Business Domains
    3. Tips for Success
      1. Don’t Skimp on Your Investment
      2. Involve Users, Show Them Results, and Get Them Excited
      3. Add Value to New Reports and Dashboards
      4. Ask End Users to Build a Prototype
      5. Find a Project Champion/Sponsor
      6. Make a Project Plan That Aims for 80% Efficiency
    4. Summary
  22. 16. Technologies
    1. Choosing a Platform
      1. Open Source Solutions
      2. On-Premises Solutions
      3. Cloud Provider Solutions
    2. Cloud Service Models
      1. Major Cloud Providers
      2. Multi-Cloud Solutions
    3. Software Frameworks
      1. Hadoop
      2. Databricks
      3. Snowflake
    4. Summary
  23. Index
  24. About the Author

Product information

  • Title: Deciphering Data Architectures
  • Author(s): James Serra
  • Release date: February 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098150761