Data Science on AWS

Book description

If you use data to make critical business decisions, this book is for you. Whether you're a data analyst, research scientist, data engineer, ML engineer, data scientist, application developer, or systems developer, this guide helps you broaden your understanding of the modern data science stack, create your own machine learning pipelines, and deploy them to applications at production scale.

The AWS data science stack unifies data science, data engineering, and application development to help you level up your skills beyond your current role. Authors Antje Barth and Chris Fregly show you how to build your own ML pipelines from existing APIs, submit them to the cloud, and integrate results into your application in minutes instead of days.

  • Innovate quickly and save money with AWS's on-demand, serverless, and cloud-managed services
  • Implement open source technologies such as Kubeflow, Kubernetes, TensorFlow, and Apache Spark on AWS
  • Build and deploy an end-to-end, continuous ML pipeline with the AWS data science stack
  • Perform advanced analytics on at-rest and streaming data with AWS and Spark
  • Integrate streaming data into your ML pipeline for continuous delivery of ML models using AWS and Apache Kafka

Publisher resources

View/Submit Errata

Table of contents

  1. 1. Automated Machine Learning
    1. Automated Machine Learning
    2. Automated Machine Learning with SageMaker Autopilot
      1. Understand Autopilot’s Transparent Approach to AutoML
      2. Track Experiments with Autopilot
    3. Train and Deploy a Text Classifier with Autopilot
      1. Train and Deploy with Autopilot UI
      2. Train and Deploy a Model with the Autopilot Python SDK
      3. Predict with Amazon Athena and Autopilot
      4. Train and Predict with Amazon Redshift ML and Autopilot
    4. Automated Machine Learning with Comprehend AI Service
      1. Predict with Comprehend Built-In Model
      2. Train and Deploy Comprehend Custom Model with Comprehend UI
      3. Train and Deploy Comprehend Custom Model with the Comprehend Python SDK
    5. Summary
  2. 2. Ingest Data Into The Cloud
    1. Data Lakes
      1. Import Your Data into the S3 Data Lake
      2. Describe the Dataset
    2. Query the S3 Data Lake with Amazon Athena
      1. Access Athena from the AWS Console
      2. Register Your S3 Data as an Athena Table
      3. Update the Athena Table as New Data Arrives in S3
      4. Create a Parquet-based Table in Athena
    3. Continuously Ingest New Data with AWS Glue Crawler
    4. Build a Lake House with Redshift Spectrum
      1. Export Redshift Data to S3 Data Lake as Parquet
      2. Share Data Between Redshift Clusters
    5. Choose Between Athena and Redshift
    6. Reduce Cost and Increase Performance
      1. S3 Intelligent Tiering
      2. Parquet Partitions and Compression
      3. Redshift Table Design and Compression
    7. Summary
  3. 3. Explore the Dataset
    1. Tools for Exploring Data in AWS
    2. Visualize our Data Lake with SageMaker Studio
      1. Prepare SageMaker Studio to Visualize our Dataset
      2. Run a Sample Athena Query in SageMaker Studio
      3. Dive Deep into the Dataset with Athena and SageMaker
    3. Query our Data Warehouse
      1. Prepare SageMaker Studio for Redshift Queries
      2. Run a Sample Redshift Query from SageMaker Studio
      3. Dive Deep into the Dataset with Redshift and SageMaker
    4. Create Dashboards with QuickSight
      1. Setup the Data Source
      2. Query and Visualize the Dataset Within QuickSight
    5. Detect Data Quality Issues with SageMaker and Apache Spark
      1. SageMaker Processing Jobs
      2. Analyze Our Dataset with Deequ and Apache Spark
    6. Detect Data Bias with SageMaker Clarify
      1. Generate and Visualize Bias Reports with SageMaker Data Wrangler
      2. Detect Bias with a SageMaker Clarify Processing Job
      3. Integrate Bias Detection into Custom Scripts with Clarify Open Source
    7. Identify Feature Importance with SageMaker Data Wrangler Quick Model
    8. Detect Different Types of Drift with SageMaker Clarify
    9. Analyze our Data with AWS Glue DataBrew
    10. Reduce Cost and Increase Performance
      1. Approximate Counts with HyperLogLog
      2. Dynamically Scale Your Data Warehouse with Redshift AQUA
      3. Improve Dashboard Performance with QuickSight SPICE
    11. Summary
  4. 4. Prepare the Dataset for Model Training
    1. Perform Feature Selection and Engineering
      1. Select Training Features and Labels
      2. Balance the Dataset to Improve Your Model
      3. Split the Dataset into Train, Validation, and Test
      4. Transform Raw Text into BERT Embeddings
      5. Convert Features to TFRecord File Format
    2. Scale Feature Engineering with SageMaker Processing Jobs
      1. Transform with Scikit-Learn and TensorFlow
      2. Transform with Apache Spark and TensorFlow
    3. Share Features through a Feature Store
      1. Ingest Features into SageMaker Feature Store
      2. Retrieve Features from SageMaker Feature Store
    4. Ingest and Transform Data with SageMaker Data Wrangler
    5. Track Lineage with SageMaker Lineage and Experiments
      1. Understand Lineage-Tracking Concepts
      2. Show Lineage of a Feature Engineering Job
      3. Understand the SageMaker Experiments API
    6. Ingest and Transform Data with AWS Glue DataBrew
    7. Reduce Cost and Increase Performance
      1. Test Processing Scripts Locally in the Notebook
    8. Summary
  5. 5. Train Your First Model
    1. Understand the SageMaker Infrastructure
      1. Introduction to SageMaker Containers
      2. Increase Availability with Compute and Network Isolation
    2. Deploy A Pre-Trained BERT Model with SageMaker JumpStart
    3. Develop a SageMaker Model
      1. Built-In Algorithms
      2. Bring Your Own Script or Script Mode
      3. Bring Your Own Container
    4. A Brief History of Natural Language Processing
      1. BERT Transformer Architecture
    5. Training BERT from Scratch
      1. Masked Language Model (Masked LM)
      2. Next Sentence Prediction
    6. Use Pre-Trained BERT Models
      1. Fine Tune the BERT Model to Create a Custom Classifier
    7. Create the Training Script
      1. Setup the Train, Validation, and Test Datasets
      2. Set Up the Custom Classifier Model
      3. Train and Validate the Model
      4. Save the Model
    8. Launch the Training Script from a SageMaker Notebook
      1. Define the Metrics to Capture and Monitor
      2. Configure the Hyper-Parameters for Our Algorithm
      3. Putting it All Together in the Notebook
      4. Download and Inspect our Trained Model from S3
      5. Show Experiment Lineage for our SageMaker Training Job
      6. Show Artifact Lineage For Our SageMaker Training Job
    9. Evaluate Our Models
      1. Run Some AdHoc Predictions from the Notebook
      2. Analyze our Classifier with a Confusion Matrix
      3. Visualize our Neural Network with TensorBoard
      4. Monitor Metrics with SageMaker Studio
      5. Monitor Metrics with CloudWatch Metrics
    10. Debug and Profile Model Training with SageMaker Debugger
      1. Detect and Resolve Issues with Debugger Rules and Actions
      2. Profile Training Jobs
    11. Interpret and Explain Model Predictions
    12. Detect Model Bias and Explain Predictions
      1. Detect Bias with a SageMaker Clarify Processing Job
      2. Feature Attribution and Importance with SageMaker Clarify and SHAP
    13. More Training Options for BERT
      1. Convert TensorFlow BERT Model to PyTorch
      2. Train PyTorch BERT Models with SageMaker
      3. Train MXNet BERT Models with SageMaker
      4. Train BERT Models PyTorch and AWS Deep Java Library
    14. Reduce Cost and Increase Performance
      1. Use Small Notebook Instances
      2. Test Model-Training Scripts Locally in the Notebook
      3. Profile Training Jobs with SageMaker Debugger
      4. Start with a Pre-Trained Model
      5. Use 16-bit Half Precision and bfloat16
      6. Mixed 32-bit Full and 16-bit Half Precision
      7. Quantization
      8. Use Training-Optimized Hardware
      9. Spot Instances and Checkpoints
      10. Early Stopping Rule in SageMaker Debugger
    15. Summary
  6. 6. Train and Optimize Models at Scale
    1. Automatically Find the Best Model Hyper-Parameters
      1. Set Up the Hyper-Parameter Ranges
      2. Run the Hyper-Parameter Tuning Job
      3. Analyze the Best Hyper-Parameters from the Tuning Job
      4. Show Experiment Lineage for our SageMaker Tuning Job
    2. Warm Start Additional Hyper-Parameter Tuning Jobs
      1. Run Hyper-Parameter Tuning Job using Warm Start
      2. Analyze the Best Hyper-Parameters from the Warm-Start Tuning Job
    3. Scale Out with SageMaker Distributed Training
      1. Choose a Distributed-Communication Strategy
      2. Choose a Parallelism Strategy
      3. Choose a Distributed File System
      4. Launch the Distributed Training Job
    4. Reduce Cost and Increase Performance
      1. Start with Reasonable Hyper-Parameter Ranges
      2. Shard the Data with ShardedByS3Key
      3. Stream Data On-the-Fly with PipeMode
      4. Enable Enhanced Networking
    5. Summary
  7. 7. Deploy Models to Production
    1. Choose Real-Time or Batch Predictions
    2. Real-Time Predictions with SageMaker Endpoints
      1. Deploy Model using SageMaker Python SDK
      2. Track Model Deployment in our Experiment
      3. Analyze the Lineage of a Deployed Model
      4. Invoking Predictions using the SageMaker Python SDK
      5. Invoke Predictions from Using HTTP POST
      6. Creating Inference Pipelines
      7. Invoke SageMaker Models from SQL and Graph-based Queries
    3. Auto-Scale SageMaker Endpoints using CloudWatch
      1. Define a Scaling Policy with AWS-Provided Metrics
      2. Define a Scaling Policy with a Custom Metric
      3. Tuning Responsiveness Using a Cool Down Period
      4. Auto-Scale Policies
    4. Strategies to Deploy New and Updated Models
      1. Split Traffic for Canary Rollouts
      2. Shift Traffic for Blue/Green Deployments
    5. Testing and Comparing New Models
      1. Perform A/B Tests to Compare Model Variants
      2. Reinforcement Learning with Multi-Armed Bandit Testing
    6. Monitor Model Performance and Detect Drift
      1. Enable Data Capture
      2. Understand Baselines and Drift
    7. Monitor Data Quality of a Deployed SageMaker Endpoint
      1. Create a Baseline to Measure Data Quality
      2. Schedule Data-Quality Monitoring Jobs
      3. Inspect Data-Quality Results
    8. Monitor Model Quality of Deployed SageMaker Endpoints
      1. Create a Baseline to Measure Model Quality
      2. Schedule Model-Quality Monitoring Jobs
      3. Inspect Model-Quality Monitoring Results
    9. Monitor Bias Drift of Deployed SageMaker Endpoints
      1. Create a Baseline to Detect Bias
      2. Schedule Bias-Drift Monitoring Jobs
      3. Inspect Bias-Drift Monitoring Results
    10. Monitor Explainability Drift of Deployed SageMaker Endpoints
      1. Create a Baseline to Monitor Explainability
      2. Schedule Explainability-Drift Monitoring Jobs
      3. Inspect Explainability-Drift Monitoring Results
    11. Perform Batch Predictions with SageMaker Batch Transform
      1. Select an Instance Type
      2. Setup the Input Data
      3. Tune the Batch Transformation Configuration
      4. Prepare the Batch Transformation Job
      5. Run the Batch Transformation Job
      6. Review the Batch Predictions
    12. Lambda Functions and API Gateway
    13. Optimize and Manage Models at the Edge
    14. Deploy a PyTorch Model with TorchServe
    15. TensorFlow-BERT Inference with AWS Deep Java Library
    16. Reduce Cost and Increase Performance
      1. Delete Unused Endpoints and Scale In Under-Utilized Clusters
      2. Deploy Multiple Models in One Container
      3. Attach a GPU-based Elastic Inference Accelerator
      4. Optimize a Trained Model with SageMaker Neo and TensorFlow Lite
      5. Use Inference-Optimized Hardware
    17. Summary
  8. 8. Pipelines and MLOps
    1. Machine Learning Operations
    2. Software Pipelines
    3. Machine Learning Pipelines
      1. Components of Effective Machine Learning Pipelines
      2. Steps of an Effective Machine Learning Pipeline
    4. Pipeline Orchestration with SageMaker Pipelines
    5. More Pipeline Automation Options
      1. Step Functions and the Data Science SDK
      2. Kubeflow Pipelines
      3. Apache Airflow
      4. MLflow
      5. TensorFlow Extended (TFX)
    6. Pipeline Automation with SageMaker Pipelines
      1. GitOps Trigger When New Code is Committed
      2. S3 Trigger When New Data Arrives
      3. Time-Based Schedule Trigger
      4. Statistical Drift Trigger
    7. Human-in-the-Loop Workflows
      1. Improving Model Accuracy with Amazon Augmented AI
      2. Active-Learning Feedback Loops with Ground Truth
    8. Summary

Product information

  • Title: Data Science on AWS
  • Author(s): Chris Fregly, Antje Barth
  • Release date: July 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492079392