Data Virtualization in the Cloud Era

Book description

Data virtualization had been held back by complexity for decades until recent advances in cloud technology, data lakes, networking hardware, and machine learning transformed the dream into reality. It's becoming increasingly practical to access data through an interface that hides low-level details about where it's stored, how it's organized, and which systems are needed to manipulate or process it. You can combine and query data from anywhere and leave the complex details behind.

In this practical book, authors Dr. Daniel Abadi and Andrew Mott discuss in detail what data virtualization is and the trends in technology that are making data virtualization increasingly useful. With this book, data engineers, data architects, and data scientists will explore the architecture of modern data virtualization systems and learn how these systems differ from one another at technical and practical levels.

By the end of the book, you'll understand:

  • The architecture of data virtualization systems
  • Technical and practical ways that data virtualization systems differ from one another
  • Where data virtualization fits into modern data mesh and data fabric paradigms
  • Modern best practices and case study use cases

Publisher resources

View/Submit Errata

Table of contents

  1. 1. Introduction to Data Virtualization and Data Lakes
    1. A Quick Overview of Data Virtualization System Architecture
      1. Data Lakes
      2. Horizontal Scalability
      3. Support for Structured, Semi-Structured and Unstructured Data
      4. Open File Formats
      5. Support for Schema on Read
      6. The Cloud Era
    2. Data Virtualization Over Data Lakes
  2. 2. Recent Technology Developments Driving the Rebirth of Data Virtualization
    1. Definitions
    2. Five Challenges of Data Virtualization
    3. The Death and Rebirth of Data Virtualization
      1. Technology Trends Driving the Rebirth of Data Virtualization
      2. Data Virtualization and Mainstream Adoption
  3. 3. How Data Virtualization Systems Work
    1. The Basic Architecture of Data Virtualization
      1. Push-Based DV Engines
      2. Pull-Based DV Engines
      3. Hybrid Approaches
      4. Common Pitfalls
  4. 4. Advanced Architectural Components
    1. Caching
    2. Query Cache
    3. Block/Partition Cache
    4. Database Table Cache
    5. Automated Pre-Computation Based Cache
    6. Materialized View Caching
    7. DV Engine–Initiated Writes to Underlying Data Sources
    8. Multiregion (and/or Multicloud) DV Systems
    9. Multiregion DV Architecture
  5. 5. Data Virtualization Systems in Practice
    1. Benchmark
    2. Additional Considerations
      1. Interfaces
      2. Abstraction Layer
      3. Centralized Metadata Layer
      4. Security Management
      5. Query Optimization
      6. Caching
      7. Native Data Lake Access
      8. Multiregion DV Architecture
      9. Support for On-Premises, Cloud, and Hybrid Data Sources
    3. Choosing a System: Both the Quantitative and the Qualitative Matter
  6. 6. Case Studies
    1. Data Platforms Used to Virtualize Data
      1. Organization 1
      2. Organization 2
      3. Organization 3
    2. Accessing Data
      1. Duplicate Data
      2. Hybrid Architectures and Storage
      3. Caching and Freshness of Data
      4. Mergers and Acquisitions
      5. Data Discovery
      6. Historical Data and Regulatory Compliance
    3. Abstraction
      1. Translation Layer and Reducing the Barrier to Entry
      2. Reducing the Swivel
      3. Fail Fast
    4. Decentralized Data Ownership
      1. Redundant Technology
      2. Ownership of the Truth
      3. Distributed Pipeline Responsibility
    5. Performance and Scale
      1. Query Performance
      2. Scale
    6. Security
    7. Decision Criteria
      1. Connectivity
      2. Pull-Based
      3. Caching Capabilities
      4. Open Source
    8. Reducing Friction
  7. 7. Data Architectures Supported by Data Virtualization Systems
    1. Data Warehouse
    2. Data Lakehouses and Icehouses
    3. Data Products
    4. Data Mesh
      1. Domain-Oriented Ownership
      2. Data as a Product
      3. Self-Service Data Platform
      4. Federated Computational Governance
      5. DV System Features for the Data Mesh
    5. Data Fabric
  8. 8. The Future of Data Virtualization
    1. Hybrid Push-Pull Systems
    2. Data Lakehouses and Icehouses
    3. Conclusion
  9. About the Authors

Product information

  • Title: Data Virtualization in the Cloud Era
  • Author(s): Daniel Abadi, Andrew Mott
  • Release date: July 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098160340