The Self-Service Data Roadmap

Book description

Data-driven insights are a key competitive advantage for any industry today, but deriving insights from raw data can still take days or weeks. Most organizations can’t scale data science teams fast enough to keep up with the growing amounts of data to transform. What’s the answer? Self-service data.

With this practical book, data engineers, data scientists, and team managers will learn how to build a self-service data science platform that helps anyone in your organization extract insights from data. Sandeep Uttamchandani provides a scorecard to track and address bottlenecks that slow down time to insight across data discovery, transformation, processing, and production. This book bridges the gap between data scientists bottlenecked by engineering realities and data engineers unclear about ways to make self-service work.

  • Build a self-service portal to support data discovery, quality, lineage, and governance
  • Select the best approach for each self-service capability using open source cloud technologies
  • Tailor self-service for the people, processes, and technology maturity of your data platform
  • Implement capabilities to democratize data and reduce time to insight
  • Scale your self-service portal to support a large number of users within your organization

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
  2. 1. Introduction
    1. Journey Map from Raw Data to Insights
      1. Discover
      2. Prep
      3. Build
      4. Operationalize
    2. Defining Your Time-to-Insight Scorecard
    3. Build Your Self-Service Data Roadmap
  3. I. Self-Service Data Discovery
  4. 2. Metadata Catalog Service
    1. Journey Map
      1. Understanding Datasets
      2. Analyzing Datasets
      3. Knowledge Scaling
    2. Minimizing Time to Interpret
      1. Extracting Technical Metadata
      2. Extracting Operational Metadata
      3. Gathering Team Knowledge
    3. Defining Requirements
      1. Technical Metadata Extractor Requirements
      2. Operational Metadata Requirements
      3. Team Knowledge Aggregator Requirements
    4. Implementation Patterns
      1. Source-Specific Connectors Pattern
      2. Lineage Correlation Pattern
      3. Team Knowledge Pattern
    5. Summary
  5. 3. Search Service
    1. Journey Map
      1. Determining Feasibility of the Business Problem
      2. Selecting Relevant Datasets for Data Prep
      3. Reusing Existing Artifacts for Prototyping
    2. Minimizing Time to Find
      1. Indexing Datasets and Artifacts
      2. Ranking Results
      3. Access Control
    3. Defining Requirements
      1. Indexer Requirements
      2. Ranking Requirements
      3. Access Control Requirements
      4. Nonfunctional Requirements
    4. Implementation Patterns
      1. Push-Pull Indexer Pattern
      2. Hybrid Search Ranking Pattern
      3. Catalog Access Control Pattern
    5. Summary
  6. 4. Feature Store Service
    1. Journey Map
      1. Finding Available Features
      2. Training Set Generation
      3. Feature Pipeline for Online Inference
    2. Minimize Time to Featurize
      1. Feature Computation
      2. Feature Serving
    3. Defining Requirements
      1. Feature Computation
      2. Feature Serving
      3. Nonfunctional Requirements
    4. Implementation Patterns
      1. Hybrid Feature Computation Pattern
      2. Feature Registry Pattern
    5. Summary
  7. 5. Data Movement Service
    1. Journey Map
      1. Aggregating Data Across Sources
      2. Moving Raw Data to Specialized Query Engines
      3. Moving Processed Data to Serving Stores
      4. Exploratory Analysis Across Sources
    2. Minimizing Time to Data Availability
      1. Data Ingestion Configuration and Change Management
      2. Compliance
      3. Data Quality Verification
    3. Defining Requirements
      1. Ingestion Requirements
      2. Transformation Requirements
      3. Compliance Requirements
      4. Verification Requirements
      5. Nonfunctional Requirements
    4. Implementation Patterns
      1. Batch Ingestion Pattern
      2. Change Data Capture Ingestion Pattern
      3. Event Aggregation Pattern
    5. Summary
  8. 6. Clickstream Tracking Service
    1. Journey Map
    2. Minimizing Time to Click Metrics
      1. Managing Instrumentation
      2. Event Enrichment
      3. Building Insights
    3. Defining Requirements
      1. Instrumentation Requirements Checklist
      2. Enrichment Requirements Checklist
    4. Implementation Patterns
      1. Instrumentation Pattern
      2. Rule-Based Enrichment Patterns
      3. Consumption Patterns
    5. Summary
  9. II. Self-Service Data Prep
  10. 7. Data Lake Management Service
    1. Journey Map
      1. Primitive Life Cycle Management
      2. Managing Data Updates
      3. Managing Batching and Streaming Data Flows
    2. Minimizing Time to Data Lake Management
      1. Requirements
    3. Implementation Patterns
      1. Data Life Cycle Primitives Pattern
      2. Transactional Pattern
      3. Advanced Data Management Pattern
    4. Summary
  11. 8. Data Wrangling Service
    1. Journey Map
    2. Minimizing Time to Wrangle
      1. Defining Requirements
      2. Curating Data
      3. Operational Monitoring
    3. Defining Requirements
    4. Implementation Patterns
      1. Exploratory Data Analysis Patterns
      2. Analytical Transformation Patterns
    5. Summary
  12. 9. Data Rights Governance Service
    1. Journey Map
      1. Executing Data Rights Requests
      2. Discovery of Datasets
      3. Model Retraining
    2. Minimizing Time to Comply
      1. Tracking the Customer Data Life Cycle
      2. Executing Customer Data Rights Requests
      3. Limiting Data Access
    3. Defining Requirements
      1. Current Pain Point Questionnaire
      2. Interop Checklist
      3. Functional Requirements
      4. Nonfunctional Requirements
    4. Implementation Patterns
      1. Sensitive Data Discovery and Classification Pattern
      2. Data Lake Deletion Pattern
      3. Use Case–Dependent Access Control
    5. Summary
  13. III. Self-Service Build
  14. 10. Data Virtualization Service
    1. Journey Map
      1. Exploring Data Sources
      2. Picking a Processing Cluster
    2. Minimizing Time to Query
      1. Picking the Execution Environment
      2. Formulating Polyglot Queries
      3. Joining Data Across Silos
    3. Defining Requirements
      1. Current Pain Point Analysis
      2. Operational Requirements
      3. Functional Requirements
      4. Nonfunctional Requirements
    4. Implementation Patterns
      1. Automatic Query Routing Pattern
      2. Unified Query Pattern
      3. Federated Query Pattern
    5. Summary
  15. 11. Data Transformation Service
    1. Journey Map
      1. Production Dashboard and ML Pipelines
      2. Data-Driven Storytelling
    2. Minimizing Time to Transform
      1. Transformation Implementation
      2. Transformation Execution
      3. Transformation Operations
    3. Defining Requirements
      1. Current State Questionnaire
      2. Functional Requirements
      3. Nonfunctional Requirements
    4. Implementation Patterns
      1. Implementation Pattern
      2. Execution Patterns
    5. Summary
  16. 12. Model Training Service
    1. Journey Map
      1. Model Prototyping
      2. Continuous Training
      3. Model Debugging
    2. Minimizing Time to Train
      1. Training Orchestration
      2. Tuning
      3. Continuous Training
    3. Defining Requirements
      1. Training Orchestration
      2. Tuning
      3. Continuous Training
      4. Nonfunctional Requirements
    4. Implementation Patterns
      1. Distributed Training Orchestrator Pattern
      2. Automated Tuning Pattern
      3. Data-Aware Continuous Training
    5. Summary
  17. 13. Continuous Integration Service
    1. Journey Map
      1. Collaborating on an ML Pipeline
      2. Integrating ETL Changes
      3. Validating Schema Changes
    2. Minimizing Time to Integrate
      1. Experiment Tracking
      2. Reproducible Deployment
      3. Testing Validation
    3. Defining Requirements
      1. Experiment Tracking Module
      2. Pipeline Packaging Module
      3. Testing Automation Module
    4. Implementation Patterns
      1. Programmable Tracking Pattern
      2. Reproducible Project Pattern
    5. Summary
  18. 14. A/B Testing Service
    1. Journey Map
    2. Minimizing Time to A/B Test
      1. Experiment Design
      2. Execution at Scale
      3. Experiment Optimization
    3. Implementation Patterns
      1. Experiment Specification Pattern
      2. Metrics Definition Pattern
      3. Automated Experiment Optimization
    4. Summary
  19. IV. Self-Service Operationalize
  20. 15. Query Optimization Service
    1. Journey Map
      1. Avoiding Cluster Clogs
      2. Resolving Runtime Query Issues
      3. Speeding Up Applications
    2. Minimizing Time to Optimize
      1. Aggregating Statistics
      2. Analyzing Statistics
      3. Optimizing Jobs
    3. Defining Requirements
      1. Current Pain Points Questionnaire
      2. Interop Requirements
      3. Functionality Requirements
      4. Nonfunctional Requirements
    4. Implementation Patterns
      1. Avoidance Pattern
      2. Operational Insights Pattern
      3. Automated Tuning Pattern
    5. Summary
  21. 16. Pipeline Orchestration Service
    1. Journey Map
      1. Invoke Exploratory Pipelines
      2. Run SLA-Bound Pipelines
    2. Minimizing Time to Orchestrate
      1. Defining Job Dependencies
      2. Distributed Execution
      3. Production Monitoring
    3. Defining Requirements
      1. Current Pain Points Questionnaire
      2. Operational Requirements
      3. Functional Requirements
      4. Nonfunctional Requirements
    4. Implementation Patterns
      1. Dependency Authoring Patterns
      2. Orchestration Observability Patterns
      3. Distributed Execution Pattern
    5. Summary
  22. 17. Model Deploy Service
    1. Journey Map
      1. Model Deployment in Production
      2. Model Maintenance and Upgrade
    2. Minimizing Time to Deploy
      1. Deployment Orchestration
      2. Performance Scaling
      3. Drift Monitoring
    3. Defining Requirements
      1. Orchestration
      2. Model Scaling and Performance
      3. Drift Verification
      4. Nonfunctional Requirements
    4. Implementation Patterns
      1. Universal Deployment Pattern
      2. Autoscaling Deployment Pattern
      3. Model Drift Tracking Pattern
    5. Summary
  23. 18. Quality Observability Service
    1. Journey Map
      1. Daily Data Quality Monitoring Reports
      2. Debugging Quality Issues
      3. Handling Low-Quality Data Records
    2. Minimizing Time to Insight Quality
      1. Verify the Accuracy of the Data
      2. Detect Quality Anomalies
      3. Prevent Data Quality Issues
    3. Defining Requirements
      1. Detection and Handling Data Quality Issues
      2. Functional Requirements
      3. Nonfunctional Requirements
    4. Implementation Patterns
      1. Accuracy Models Pattern
      2. Profiling-Based Anomaly Detection Pattern
      3. Avoidance Pattern
    5. Summary
  24. 19. Cost Management Service
    1. Journey Map
      1. Monitoring Cost Usage
      2. Continuous Cost Optimization
    2. Minimizing Time to Optimize Cost
      1. Expenditure Observability
      2. Matching Supply and Demand
      3. Continuous Cost Optimization
    3. Defining Requirements
      1. Pain Points Questionnaire
      2. Functional Requirements
      3. Nonfunctional Requirements
    4. Implementation Patterns
      1. Continuous Cost Monitoring Pattern
      2. Automated Scaling Pattern
      3. Cost Advisor Pattern
    5. Summary
  25. Index
  26. About the Author

Product information

  • Title: The Self-Service Data Roadmap
  • Author(s): Sandeep Uttamchandani
  • Release date: September 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492075257