Architecting Data and Machine Learning Platforms

Book description

All cloud architects need to know how to build data platforms that enable businesses to make data-driven decisions and deliver enterprise-wide intelligence in a fast and efficient way. This handbook shows you how to design, build, and modernize cloud native data and machine learning platforms using AWS, Azure, Google Cloud, and multicloud tools like Snowflake and Databricks.

Authors Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner cover the entire data lifecycle from ingestion to activation in a cloud environment using real-world enterprise architectures. You'll learn how to transform, secure, and modernize familiar solutions like data warehouses and data lakes, and you'll be able to leverage recent AI/ML patterns to get accurate and quicker insights to drive competitive advantage.

You'll learn how to:

  • Design a modern and secure cloud native or hybrid data analytics and machine learning platform
  • Accelerate data-led innovation by consolidating enterprise data in a governed, scalable, and resilient data platform
  • Democratize access to enterprise data and govern how business teams extract insights and build AI/ML capabilities
  • Enable your business to make decisions in real time using streaming pipelines
  • Build an MLOps platform to move to a predictive and prescriptive analytics approach

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Why Do You Need a Cloud Data Platform?
    2. Who Is This Book For?
    3. Organization of This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Modernizing Your Data Platform: An Introductory Overview
    1. The Data Lifecycle
      1. The Journey to Wisdom
      2. Water Pipes Analogy
      3. Collect
      4. Store
      5. Process/Transform
      6. Analyze/Visualize
      7. Activate
    2. Limitations of Traditional Approaches
      1. Antipattern: Breaking Down Silos Through ETL
      2. Antipattern: Centralization of Control
      3. Antipattern: Data Marts and Hadoop
    3. Creating a Unified Analytics Platform
      1. Cloud Instead of On-Premises
      2. Drawbacks of Data Marts and Data Lakes
      3. Convergence of DWHs and Data Lakes
    4. Hybrid Cloud
      1. Reasons Why Hybrid Is Necessary
      2. Challenges of Hybrid Cloud
      3. Why Hybrid Can Work
      4. Edge Computing
    5. Applying AI
      1. Machine Learning
      2. Uses of ML
    6. Why Cloud for AI?
      1. Cloud Infrastructure
      2. Democratization
      3. Real Time
      4. MLOps
    7. Core Principles
    8. Summary
  3. 2. Strategic Steps to Innovate with Data
    1. Step 1: Strategy and Planning
      1. Strategic Goals
      2. Identify Stakeholders
      3. Change Management
    2. Step 2: Reduce Total Cost of Ownership by Adopting a Cloud Approach
      1. Why Cloud Costs Less
      2. How Much Are the Savings?
      3. When Does Cloud Help?
    3. Step 3: Break Down Silos
      1. Unifying Data Access
      2. Choosing Storage
      3. Semantic Layer
    4. Step 4: Make Decisions in Context Faster
      1. Batch to Stream
      2. Contextual Information
      3. Cost Management
    5. Step 5: Leapfrog with Packaged AI Solutions
      1. Predictive Analytics
      2. Understanding and Generating Unstructured Data
      3. Personalization
      4. Packaged Solutions
    6. Step 6: Operationalize AI-Driven Workflows
      1. Identifying the Right Balance of Automation and Assistance
      2. Building a Data Culture
      3. Populating Your Data Science Team
    7. Step 7: Product Management for Data
      1. Applying Product Management Principles to Data
      2. 1. Understand and Maintain a Map of Data Flows in the Enterprise
      3. 2. Identify Key Metrics
      4. 3. Agreed Criteria, Committed Roadmap, and Visionary Backlog
      5. 4. Build for the Customers You Have
      6. 5. Don’t Shift the Burden of Change Management
      7. 6. Interview Customers to Discover Their Data Needs
      8. 7. Whiteboard and Prototype Extensively
      9. 8. Build Only What Will Be Used Immediately
      10. 9. Standardize Common Entities and KPIs
      11. 10. Provide Self-Service Capabilities in Your Data Platform
    8. Summary
  4. 3. Designing for Your Data Team
    1. Classifying Data Processing Organizations
    2. Data Analysis–Driven Organization
      1. The Vision
      2. The Personas
      3. The Technological Framework
    3. Data Engineering–Driven Organization
      1. The Vision
      2. The Personas
      3. The Technological Framework
    4. Data Science–Driven Organization
      1. The Vision
      2. The Personas
      3. The Technological Framework
    5. Summary
  5. 4. A Migration Framework
    1. Modernize Data Workflows
      1. Holistic View
      2. Modernize Workflows
      3. Transform the Workflow Itself
    2. A Four-Step Migration Framework
      1. Prepare and Discover
      2. Assess and Plan
      3. Execute
      4. Optimize
    3. Estimating the Overall Cost of the Solution
      1. Audit of the Existing Infrastructure
      2. Request for Information/Proposal and Quotation
      3. Proof of Concept/Minimum Viable Product
    4. Setting Up Security and Data Governance
      1. Framework
      2. Artifacts
      3. Governance over the Life of the Data
    5. Schema, Pipeline, and Data Migration
      1. Schema Migration
      2. Pipeline Migration
      3. Data Migration
      4. Migration Stages
    6. Summary
  6. 5. Architecting a Data Lake
    1. Data Lake and the Cloud—A Perfect Marriage
      1. Challenges with On-Premises Data Lakes
      2. Benefits of Cloud Data Lakes
    2. Design and Implementation
      1. Batch and Stream
      2. Data Catalog
      3. Hadoop Landscape
      4. Cloud Data Lake Reference Architecture
    3. Integrating the Data Lake: The Real Superpower
      1. APIs to Extend the Lake
      2. The Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta Lake
      3. Interactive Analytics with Notebooks
    4. Democratizing Data Processing and Reporting
      1. Build Trust in the Data
      2. Data Ingestion Is Still an IT Matter
    5. ML in the Data Lake
      1. Training on Raw Data
      2. Predicting in the Data Lake
    6. Summary
  7. 6. Innovating with an Enterprise Data Warehouse
    1. A Modern Data Platform
      1. Organizational Goals
      2. Technological Challenges
      3. Technology Trends and Tools
    2. Hub-and-Spoke Architecture
      1. Data Ingest
      2. Business Intelligence
      3. Transformations
      4. Organizational Structure
    3. DWH to Enable Data Scientists
      1. Query Interface
      2. Storage API
      3. ML Without Moving Your Data
    4. Summary
  8. 7. Converging to a Lakehouse
    1. The Need for a Unique Architecture
      1. User Personas
      2. Antipattern: Disconnected Systems
      3. Antipattern: Duplicated Data
    2. Converged Architecture
      1. Two Forms
      2. Lakehouse on Cloud Storage
      3. SQL-First Lakehouse
      4. The Benefits of Convergence
    3. Summary
  9. 8. Architectures for Streaming
    1. The Value of Streaming
      1. Industry Use Cases
      2. Streaming Use Cases
    2. Streaming Ingest
      1. Streaming ETL
      2. Streaming ELT
      3. Streaming Insert
      4. Streaming from Edge Devices (IoT)
      5. Streaming Sinks
    3. Real-Time Dashboards
      1. Live Querying
      2. Materialize Some Views
    4. Stream Analytics
      1. Time-Series Analytics
      2. Clickstream Analytics
      3. Anomaly Detection
      4. Resilient Streaming
    5. Continuous Intelligence Through ML
      1. Training Model on Streaming Data
      2. Streaming ML Inference
      3. Automated Actions
    6. Summary
  10. 9. Extending a Data Platform Using Hybrid and Edge
    1. Why Multicloud?
      1. A Single Cloud Is Simpler and Cost-Effective
      2. Multicloud Is Inevitable
      3. Multicloud Could Be Strategic
    2. Multicloud Architectural Patterns
      1. Single Pane of Glass
      2. Write Once, Run Anywhere
      3. Bursting from On Premises to Cloud
      4. Pass-Through from On Premises to Cloud
      5. Data Integration Through Streaming
    3. Adopting Multicloud
      1. Framework
      2. Time Scale
      3. Define a Target Multicloud Architecture
    4. Why Edge Computing?
      1. Bandwidth, Latency, and Patchy Connectivity
      2. Use Cases
      3. Benefits
      4. Challenges
    5. Edge Computing Architectural Patterns
      1. Smart Devices
      2. Smart Gateways
      3. ML Activation
    6. Adopting Edge Computing
      1. The Initial Context
      2. The Project
      3. The Final Outcomes and Next Steps
    7. Summary
  11. 10. AI Application Architecture
    1. Is This an AI/ML Problem?
      1. Subfields of AI
      2. Generative AI
      3. Problems Fit for ML
    2. Buy, Adapt, or Build?
      1. Data Considerations
      2. When to Buy
      3. What Can You Buy?
      4. How Adapting Works
    3. AI Architectures
      1. Understanding Unstructured Data
      2. Generating Unstructured Data
      3. Predicting Outcomes
      4. Forecasting Values
      5. Anomaly Detection
      6. Personalization
      7. Automation
    4. Responsible AI
      1. AI Principles
      2. ML Fairness
      3. Explainability
    5. Summary
  12. 11. Architecting an ML Platform
    1. ML Activities
    2. Developing ML Models
      1. Labeling Environment
      2. Development Environment
      3. User Environment
      4. Preparing Data
      5. Training ML Models
    3. Deploying ML Models
      1. Deploying to an Endpoint
      2. Evaluate Model
      3. Hybrid and Multicloud
      4. Training-Serving Skew
    4. Automation
      1. Automate Training and Deployment
      2. Orchestration with Pipelines
      3. Continuous Evaluation and Training
    5. Choosing the ML Framework
      1. Team Skills
      2. Task Considerations
      3. User-Centric
    6. Summary
  13. 12. Data Platform Modernization: A Model Case
    1. New Technology for a New Era
      1. The Need for Change
      2. It Is Not Only a Matter of Technology
    2. The Beginning of the Journey
      1. The Current Environment
      2. The Target Environment
      3. The PoC Use Case
    3. The RFP Responses Proposed by Cloud Vendors
      1. The Target Environment
      2. The Approach on Migration
    4. The RFP Evaluation Process
      1. The Scope of the PoC
      2. The Execution of the PoC
      3. The Final Decision
    5. Peroration
    6. Summary
  14. Index
  15. About the Authors

Product information

  • Title: Architecting Data and Machine Learning Platforms
  • Author(s): Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner
  • Release date: October 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098151614