Streaming Data Mesh

by Hubert Dulay, Stephen Mooney
Released June 2023
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098130725

Book description

Data lakes and warehouses have become increasingly fragile, costly, and difficult to maintain as data gets bigger and moves faster. Data meshes can help your organization decentralize data, giving ownership back to the engineers who produced it. This book provides a concise yet comprehensive overview of data mesh patterns for streaming and real-time data services.

Authors Hubert Dulay and Stephen Mooney examine the vast differences between streaming and batch data meshes. Data engineers, architects, data product owners, and those in DevOps and MLOps roles will learn steps for implementing a streaming data mesh, from defining a data domain to building a good data product. Through the course of the book, you'll create a complete self-service data platform and devise a data governance system that enables your mesh to work seamlessly.

With this book, you will:

  • Design a streaming data mesh using Kafka
  • Learn how to identify a domain
  • Build your first data product using self-service tools
  • Apply data governance to the data products you create
  • Learn the differences between synchronous and asynchronous data services
  • Implement self-services that support decentralized data

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. OâReilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  2. 1. Data Mesh Introduction
    1. Data Divide
    2. Data Mesh Pillars
      1. Domain Ownership
      2. Data Products
      3. Federated Computational Data Governance
      4. Self-service Data Infrastructure
      5. Data Mesh Diagram
    3. Other Similar Architectural Patterns
      1. Data Fabric
      2. Data Gateways and Data Services (DaaS)
      3. Data Democratization
      4. Data Virtualization
    4. Focusing on Implementation
      1. Apache Kafka
      2. AsyncAPI
  3. 2. Streaming Data Mesh Introduction
    1. The Streaming Advantage
      1. Streaming Enables Real-time Use Cases
      2. Streaming Enables Data Optimization Advantages
    2. The Kappa Architecture
      1. Lambda Architecture Introduction
      2. Kappa Architecture Introduction
    3. Summary
  4. 3. Domain Ownership
    1. Identifying domains
      1. Discernible Domains
      2. Geographic Regions
      3. Hybrid Architecture
      4. Multi-cloud
    2. Avoiding Ambiguous Domains
    3. Domain-Driven Design
      1. Domain model
      2. Domain logic
      3. Bounded context
      4. The Ubiquitous Language
    4. Data Mesh Domain Roles
      1. Data Product Engineer
      2. Data Product Owner or Data Steward
    5. Streaming Data Mesh Tools and Platforms to consider
    6. Domain Chargebacks
      1. Usage-based Chargebacks
      2. Task-level resource Chargebacks
      3. Cost-splitting Chargebacks
      4. Data Products Chargebacks
  5. 4. Streaming Data Products
    1. Motivation
    2. Define Data Product Requirements
    3. Identifying Data Product Derivatives
      1. Derivatives from Other Domains
    4. Ingestion Data Product Derivatives with Kafka Connect
      1. Consumability
      2. Synchronous Data Sources
      3. Asynchronous Data Sources & Change Data Capture (CDC)
    5. Transforming Data Derivatives to Data Products
      1. Data Quality
      2. Data Security
      3. SQL
      4. Extract Transform Load (ETL)
    6. Step 5: Publishing Data Products with AsyncAPI
      1. Building an AsyncAPI YAML Document
      2. Determining data product type
      3. Data Tags
      4. Versioning
      5. Monitoring
  6. 5. Federated Computational Data Governance
    1. Data Governance in a Streaming Data Mesh
      1. Streaming Data Catalog to Organize Data Products
    2. Metadata
      1. Schemas
      2. Lineage
      3. Security
      4. Scalability
    3. Generating the Data Product Page from AsyncAPI
      1. APICurio Registry
      2. Access Workflow
    4. Centralized vs Decentralized
      1. Centralized Engineers
      2. Decentralized (Domain) Engineers
    5. Data Governance Tools Summary
      1. Streaming Data Catalog
  7. 6. Self-Service Data Infrastructure
    1. Resource Related Commands
      1. Cluster Related Commands
      2. Topic Related Commands
      3. Domain Command
      4. Connect Command
      5. Streaming Command
      6. Publishing a Streaming Data Product
    2. Data Governance Related Services
      1. Security Services
      2. Standards Services
      3. Lineage Services
    3. SaaS Services and APIs
      1. Infrastructure As Code
  8. 7. Architecting a Streaming Data Mesh
    1. Infrastructure
    2. Two Architecture Solutions
      1. Dedicated Infrastructure
      2. Multi-Tenant Infrastructure
    3. Streaming Data Mesh Central Architecture
      1. The Domain Agent (aka SideCar)
      2. Data Plane
      3. Control Plane
    4. Summary
