Practical Lakehouse Architecture

Book description

This concise yet comprehensive guide explains how to adopt a data lakehouse architecture to implement modern data platforms. It reviews the design considerations, challenges, and best practices for implementing a lakehouse and provides key insights into the ways that using a lakehouse can impact your data platform, from managing structured and unstructured data and supporting BI and AI/ML use cases to enabling more rigorous data governance and security measures.

Practical Lakehouse Architecture shows you how to:

  • Understand key lakehouse concepts and features like transaction support, time travel, and schema evolution
  • Understand the differences between traditional and lakehouse data architectures
  • Differentiate between various file formats and table formats
  • Design lakehouse architecture layers for storage, compute, metadata management, and data consumption
  • Implement data governance and data security within the platform
  • Evaluate technologies and decide on the best technology stack to implement the lakehouse for your use case
  • Make critical design decisions and address practical challenges to build a future-ready data platform
  • Start your lakehouse implementation journey and migrate data from existing systems to the lakehouse

Publisher resources

View/Submit Errata

Table of contents

  1. Brief Table of Contents (Not Yet Final)
  2. 1. Introduction to Lakehouse Architecture
    1. Understanding a Data Architecture
      1. What is a Data Architecture?
      2. How does Data Architecture Helps Build the Data Platform?
      3. Core Components of a Data Platform
    2. Why do we need a New Data Architecture?
    3. Lakehouse - A New Architectural Pattern
      1. Lakehouse - Best of both the worlds
      2. Understanding a Lakehouse Architecture
      3. Lakehouse Characteristics
      4. Lakehouse Benefits
    4. Key Takeaways
      1. Data Architecture
      2. Lakehouse Architecture
      3. Lakehouse Benefits
    5. References
  3. 2. Traditional Architectures and Modern Data Platforms
    1. Traditional Architectures: Data Lakes and Data Warehouses
      1. Data Warehouse Fundamentals
      2. Benefits and Advantages
      3. Limitations and Challenges
      4. Data Lake Fundamentals
      5. Benefits and Advantages
      6. Limitations and Challenges
    2. Modern Data Platforms
      1. Finding answers in the cloud
      2. Standalone Approach
      3. Combined Approach
      4. Modern Data Platform Expectations
    3. Comparison - Data Warehouse, Data Lake, Lakehouse
      1. Platform Capabilities and Limitations
      2. Platform Implementation Activities
      3. Platform Administration and Management
      4. Business Outcomes
    4. Lakehouse - The default choice for future data platforms?
    5. Comparison Summary
  4. 3. Storage: Heart of the Lakehouse
    1. Lakehouse Storage - Key Concepts
      1. Row vs Columnar Storage
      2. Storage based Performance Optimization
    2. Lakehouse Storage Components
      1. Cloud Storage
      2. File Formats
      3. Table Formats
    3. Key Design Considerations
    4. Comparison Summary
  5. 4. Data Catalogs
    1. Understanding Metadata
      1. Technical metadata
      2. Business metadata
    2. How Metastores and Data Catalogs Work Together?
    3. Features of Data Catalog
      1. Search, explore and discover data
      2. Data classification
      3. Data governance
      4. Data lineage
    4. Unified Data Catalog
      1. Challenges with Two-Tier Combined Architectures
      2. What is a Unified Data Catalog
      3. Benefits of Unified Data Catalog
    5. Implementing a Data Catalog: Key Design Considerations and Options
      1. Data catalog using Hive metastore
      2. Data catalog using AWS services
    6. Key Takeaways
    7. References
  6. 5. Compute Engines for Lakehouse Architectures
    1. Data Computation Benefits of a Lakehouse Architecture
      1. Multiple Compute Engines for Single Storage Tier
      2. Unified Batch and Real-time Processing
      3. Enhanced BI Performance
      4. Freedom to Choose Different Engines Types
      5. Perform Analysis across Storage Zones
    2. Compute Options for Lakehouse Platforms
      1. Open Source Tools
      2. Cloud Services
      3. Third-Party Platforms
    3. Key Design Considerations
      1. Open Table Format Support
      2. Supported Version and Features
      3. Ecosystem Support
      4. Persona-based Preferences
      5. Managed Open-source vs Cloud-native vs Third Party
      6. Data Consumption Workloads
    4. Key Takeaways
    5. References
  7. 6. Lakehouse Data (and AI) Governance and Security
    1. What is Data Governance and Data Security?
    2. Benefits of Data Governance and Data Security
    3. Unified Governance and Security in Lakehouse
    4. Governance and Security Processes in Lakehouse
      1. Metadata Management
      2. Compliance and Regulations
      3. Data and ML Model Quality
      4. Lineage across Data and AI assets
      5. Data and AI Asset Sharing
      6. Data Ownership
      7. Auditing and Monitoring
      8. Access Management
      9. Data Protection
      10. Sensitive Data Handling
    5. What’s Your Role?
      1. Business sponsors
      2. Data owners
      3. Data stewards
      4. Data analysts
      5. Data architects
      6. Data engineers
      7. Platform and data administrators
      8. Data scientists, ML engineers, and BI engineers
      9. Data governance committee members
    6. Key Takeaways
    7. References
  8. 7. The Big Picture—Designing and Implementing Your Lakehouse Platform
    1. Pre-Design Activities
      1. Understanding Requirements
      2. Studying Existing System
      3. Understanding Vision and Data Strategy
      4. Conducting Workshops and Interviews
    2. Choosing the Right Architecture
    3. Establishing Guiding Principles
      1. Data Ecosystem
      2. Scalability and Performance
      3. Cost Control and Optimization
      4. Platform Operations
      5. Governance and Security
    4. Design Considerations and Implementation Best Practices
      1. Architecture Blueprint
      2. Data Ingestion
      3. Data Storage
      4. Data Processing
      5. Data Consumption and Delivery
      6. Common Services
    5. Design References
      1. A Step-by-Step Design Guide
      2. Design Questionnaire
    6. Key Takeaways
    7. References
  9. 8. Lakehouse in the Real World
    1. Delivering a Real World Lakehouse
    2. Planning Phase
      1. Estimation and planning
    3. Analysis and Design Phase
      1. Analyzing the existing system
      2. Data modeling
      3. Finalizing the tech stack
    4. Implementation and Test Phase
      1. Historical data migration
      2. Data reconciliation and testing
      3. Reverse engineering
      4. Data quality and sensitive data handling
    5. Support and Maintenance Phase
      1. Auditing and tracking
      2. Disaster recovery strategy
      3. Decommissioning old system
    6. Delivery References
      1. Project Deliverables
      2. Reference Architectures
    7. Key Takeaways
    8. References
  10. About the Author

Product information

  • Title: Practical Lakehouse Architecture
  • Author(s): Gaurav Ashok Thalpati
  • Release date: August 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098153014