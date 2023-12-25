Architecting Data and Machine Learning Platforms

Architecting Data and Machine Learning Platforms

by Marco Tranquillin, Valliappa Lakshmanan, Firat Tekiner
Released December 2023
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098151614

Book description

All cloud architects need to know how to build data platforms—the key to enabling businesses with data and delivering enterprise-wide intelligence in a fast and efficient way. This handbook is ideal for learning how to design, build, and modernize cloud native data and machine learning platforms using AWS, Azure, Google Cloud, or multicloud tools like Fivetran, dbt, Snowflake, and Databricks.

Authors Marco Tranquillin, Valliappa Lakshmanan, and Firat Tekiner cover the entire data lifecycle in a cloud environment, from ingestion to activation, using real-world enterprise architectures. You'll learn how to transform and modernize familiar solutions, like data warehouses and data lakes, and you'll be able to leverage recent AI/ML patterns to get accurate and quicker insights to drive competitive advantage.

This book shows you how to:

  • Design a modern cloud native or hybrid data analytics and machine learning platform
  • Accelerate data-led innovation by consolidating enterprise data in a data platform
  • Democratize access to enterprise data and allow business teams to extract insights and build AI/ML capabilities
  • Enable your business to make decisions in real time using streaming pipelines
  • Move from a descriptive analytics approach to a more predictive and prescriptive one by building an MLOps platform
  • Make your organization more effective in working with data analytics and machine learning in a cloud environment

Table of contents

  1. Preface
    1. Why do you need a Cloud data platform?
    2. Who is this book for?
    3. Organization of this book
  2. 1. Modernizing Your Data Platform: An Introductory Overview
    1. Why do organizations need a data platform?
      1. Data silos means data movement tools
      2. Centralization of control
      3. The technologies underlying data warehouses and data lakes
      4. The cloud as a solution for main challenges
    2. Creating a Unified Analytics Platform
      1. Drawbacks of Data Warehouses and Data Lakes
      2. Convergence of Data Warehouses and Data Lakes
      3. Lakehouse
      4. Data Mesh
    3. Hybrid Cloud
      1. Reasons why hybrid is necessary
      2. Challenges of Hybrid Cloud
      3. Why Hybrid can Work
      4. Edge Computing
    4. Applying AI
      1. Machine Learning
      2. Uses of ML
    5. Why Cloud for AI?
      1. Cloud infrastructure
      2. Democratization
      3. Real-time
      4. ML Ops
    6. Core Principles
    7. Summary
  3. 2. Strategic steps to innovate with data
    1. Step 1: Strategy and Planning
      1. Strategic goals
      2. Identify stakeholders
      3. The importance of change management
      4. Craft the digital transformation journey
    2. Step 2: Reduce Total Cost of Ownership adopting a cloud approach
      1. Why Cloud costs less
      2. How much are the savings?
      3. When does Cloud help?
    3. Step 3: Break down silos
      1. Centralizing data
      2. Choosing storage
      3. Semantic layer
    4. Step 4: Make decisions in context faster
      1. Batch to Stream
      2. Contextual information
      3. Cost management
    5. Step 5: Leapfrog with packaged AI solutions
      1. Predictive analytics
      2. Unstructured data
      3. Personalization
      4. Packaged solutions
    6. Step 6: Operationalize AI-driven Workflows
      1. Identifying the right balance of automation and assistance
      2. Building a data culture
      3. Populating Your Data science team
    7. Step 7: Product Management for Data
      1. Applying Product Management Principles to Data
      2. 1. Understand and maintain a map of data flows in the enterprise
      3. 2. Identify key metrics
      4. 3. Agreed criteria, committed roadmap, and visionary backlog
      5. 4. Build for the customers you have
      6. 5. Don’t shift the burden of change management
      7. 6. Interview customers to discover their data needs
      8. 7. Whiteboard and prototype extensively
      9. 8. Build only what will be used immediately
      10. 9. Standardize common entities and KPIs
      11. 10. Provide self-service capabilities in your data platform
    8. Summary
  4. 3. Creating a Modern Data Analytics Capability
    1. The Data Life Cycle
      1. The journey to wisdom
      2. The water pipes structure analogy
      3. Collect
      4. Store
      5. Process / Transform
      6. Analyze/ Visualize
      7. Activate
    2. Foundational elements
      1. From spreadsheet to the data warehouse
      2. The advent of Hadoop and the birth of Data Lake
    3. Governance and security
      1. The role of metadata
      2. Data lineage to foster security
    4. Moving to the public cloud
      1. Handling data explosion
      2. Real time analytics as a new standard
      3. Easy integration of AI/ML capabilities
    5. Modernizing Data Workflows
      1. Job to be Done
      2. It’s all about dependencies
      3. Modernize workflows
      4. Templatized setup
      5. Transform the workflow itself
    6. Summary
  5. 4. Designing your data team
    1. Classifying data processing organizations
    2. Data-analysis driven organization (DADO)
      1. The vision
      2. The personas
      3. The technological framework
    3. Data-engineering driven organization (DEDO)
      1. The vision
      2. The personas
      3. The technological framework
    4. Data-science driven organization (DSDO)
      1. The vision
      2. The personas
      3. The technological framework
    5. Summary
  6. 5. A Migration Framework
    1. A Four-Step Migration Framework
      1. Prepare and discover
      2. Assess and plan
      3. Execute
      4. Optimize
    2. Estimating the overall cost of the solution
      1. Audit of the existing infrastructure
      2. Request for Information/ Proposal and Quotation
      3. Proof of Concept (PoC)/ Minimum Viable Product (MVP)
    3. Setting up security and data governance
      1. Framework
      2. Artifacts
      3. Governance over the life of the data
    4. Schema, pipeline and data migration
      1. Schema migration
      2. Pipeline migration
      3. Data migration
    5. Summary
  7. 6. Architecting a data lake
    1. Data Lake and the cloud - A perfect marriage
      1. ROI Challenges with on-premises data lakes
      2. Cloud data lakes as a perfect habitat
    2. Architecture design and implementation details
      1. The role of the data catalog
      2. Batch, streaming and lambda/ kappa
      3. Hadoop landscape
      4. Cloud data lake reference architecture
    3. Integrating the data lake: the real superpower
      1. APIs to extend the lake
      2. The evolution of data lake with Apache Iceberg, Apache Hudi and Delta Lake
      3. Interactive analytics with notebooks
    4. Democratizing data processing and reporting
      1. Build trust in the data
      2. Data ingestion is still an IT matter
    5. Machine Learning in the Data Lake
      1. Training on Raw data
      2. Predicting in the Data Lake
    6. Summary
  8. 7. Innovate with an enterprise data warehouse
    1. A modern data platform
      1. Organizational goals
      2. Technological challenges
      3. Technology trends and tools
    2. Hub-and-Spoke architecture
      1. Data ingest
      2. Business intelligence
      3. Transformations
      4. Organizational Structure
    3. Data Warehouse to enable Data Scientists
      1. Query interface
      2. Storage API
      3. Machine learning without moving your data
    4. Summary
  9. 8. Converging to a Lakehouse
    1. The need for a unique architecture
      1. User personas
      2. Anti-pattern: Duplicated data
    2. Converged architecture
      1. Lakehouse on cloud storage
      2. SQL-first lakehouse
    3. The benefits of convergence
    4. Summary
  10. 9. Architectures for Streaming
    1. The value of streaming
      1. Industry use cases
      2. Streaming use cases
    2. Streaming Ingest
      1. Streaming ETL
      2. Streaming ELT
      3. Streaming Insert
      4. Streaming from Edge Devices (IoT)
      5. Streaming Sinks
    3. Real-time Dashboards
      1. Live Querying
      2. Materialize Some Views
    4. Stream Analytics
      1. Time Series Analytics
      2. Clickstream Analytics
      3. Anomaly Detection
      4. Resilient Streaming
    5. Continuous Intelligence through ML
      1. Training model on streaming data
      2. Streaming ML Inference
      3. Automated Actions
    6. Summary
  11. About the Authors

