Book description
With this practical book, AI and machine learning practitioners will learn how to successfully build and deploy data science projects on Amazon Web Services. The Amazon AI and machine learning stack unifies data science, data engineering, and application development to help level up your skills. This guide shows you how to build and run pipelines in the cloud, then integrate the results into applications in minutes instead of days. Throughout the book, authors Chris Fregly and Antje Barth demonstrate how to reduce cost and improve performance.
- Apply the Amazon AI and ML stack to real-world use cases for natural language processing, computer vision, fraud detection, conversational devices, and more
- Use automated machine learning to implement a specific subset of use cases with SageMaker Autopilot
- Dive deep into the complete model development lifecycle for a BERT-based NLP use case including data ingestion, analysis, model training, and deployment
- Tie everything together into a repeatable machine learning operations pipeline
- Explore real-time ML, anomaly detection, and streaming analytics on data streams with Amazon Kinesis and Managed Streaming for Apache Kafka
- Learn security best practices for data science projects and workflows including identity and access management, authentication, authorization, and more
Publisher resources
Table of contents
- Preface
-
1. Introduction to Data Science on AWS
- Benefits of Cloud Computing
- Data Science Pipelines and Workflows
- MLOps Best Practices
- Amazon AI Services and AutoML with Amazon SageMaker
-
Data Ingestion, Exploration, and Preparation in AWS
- Data Ingestion and Data Lakes with Amazon S3 and AWS Lake Formation
- Data Analysis with Amazon Athena, Amazon Redshift, and Amazon QuickSight
- Evaluate Data Quality with AWS Deequ and SageMaker Processing Jobs
- Label Training Data with SageMaker Ground Truth
- Data Transformation with AWS Glue DataBrew, SageMaker Data Wrangler, and SageMaker Processing Jobs
- Model Training and Tuning with Amazon SageMaker
- Model Deployment with Amazon SageMaker and AWS Lambda Functions
- Streaming Analytics and Machine Learning on AWS
- AWS Infrastructure and Custom-Built Hardware
- Reduce Cost with Tags, Budgets, and Alerts
- Summary
-
2. Data Science Use Cases
- Innovation Across Every Industry
- Personalized Product Recommendations
- Detect Inappropriate Videos with Amazon Rekognition
- Demand Forecasting
- Identify Fake Accounts with Amazon Fraud Detector
- Enable Privacy-Leak Detection with Amazon Macie
- Conversational Devices and Voice Assistants
- Text Analysis and Natural Language Processing
- Cognitive Search and Natural Language Understanding
- Intelligent Customer Support Centers
- Industrial AI Services and Predictive Maintenance
- Home Automation with AWS IoT and Amazon SageMaker
- Extract Medical Information from Healthcare Documents
- Self-Optimizing and Intelligent Cloud Infrastructure
- Cognitive and Predictive Business Intelligence
- Educating the Next Generation of AI and ML Developers
- Program Nature’s Operating System with Quantum Computing
- Increase Performance and Reduce Cost
- Summary
- 3. Automated Machine Learning
- 4. Ingest Data into the Cloud
-
5. Explore the Dataset
- Tools for Exploring Data in AWS
- Visualize Our Data Lake with SageMaker Studio
- Query Our Data Warehouse
- Create Dashboards with Amazon QuickSight
- Detect Data-Quality Issues with Amazon SageMaker and Apache Spark
- Detect Bias in Our Dataset
- Detect Different Types of Drift with SageMaker Clarify
- Analyze Our Data with AWS Glue DataBrew
- Reduce Cost and Increase Performance
- Summary
-
6. Prepare the Dataset for Model Training
- Perform Feature Selection and Engineering
- Scale Feature Engineering with SageMaker Processing Jobs
- Share Features Through SageMaker Feature Store
- Ingest and Transform Data with SageMaker Data Wrangler
- Track Artifact and Experiment Lineage with Amazon SageMaker
- Ingest and Transform Data with AWS Glue DataBrew
- Summary
-
7. Train Your First Model
- Understand the SageMaker Infrastructure
- Deploy a Pre-Trained BERT Model with SageMaker JumpStart
- Develop a SageMaker Model
- A Brief History of Natural Language Processing
- BERT Transformer Architecture
- Training BERT from Scratch
- Fine Tune a Pre-Trained BERT Model
- Create the Training Script
-
Launch the Training Script from a SageMaker Notebook
- Define the Metrics to Capture and Monitor
- Configure the Hyper-Parameters for Our Algorithm
- Select Instance Type and Instance Count
- Putting It All Together in the Notebook
- Download and Inspect Our Trained Model from S3
- Show Experiment Lineage for Our SageMaker Training Job
- Show Artifact Lineage for Our SageMaker Training Job
- Evaluate Models
- Debug and Profile Model Training with SageMaker Debugger
- Interpret and Explain Model Predictions
- Detect Model Bias and Explain Predictions
- More Training Options for BERT
-
Reduce Cost and Increase Performance
- Use Small Notebook Instances
- Test Model-Training Scripts Locally in the Notebook
- Profile Training Jobs with SageMaker Debugger
- Start with a Pre-Trained Model
- Use 16-Bit Half Precision and bfloat16
- Mixed 32-Bit Full and 16-Bit Half Precision
- Quantization
- Use Training-Optimized Hardware
- Spot Instances and Checkpoints
- Early Stopping Rule in SageMaker Debugger
- Summary
- 8. Train and Optimize Models at Scale
-
9. Deploy Models to Production
- Choose Real-Time or Batch Predictions
-
Real-Time Predictions with SageMaker Endpoints
- Deploy Model Using SageMaker Python SDK
- Track Model Deployment in Our Experiment
- Analyze the Experiment Lineage of a Deployed Model
- Invoke Predictions Using the SageMaker Python SDK
- Invoke Predictions Using HTTP POST
- Create Inference Pipelines
- Invoke SageMaker Models from SQL and Graph-Based Queries
- Auto-Scale SageMaker Endpoints Using Amazon CloudWatch
- Strategies to Deploy New and Updated Models
- Testing and Comparing New Models
- Monitor Model Performance and Detect Drift
- Monitor Data Quality of Deployed SageMaker Endpoints
- Monitor Model Quality of Deployed SageMaker Endpoints
- Monitor Bias Drift of Deployed SageMaker Endpoints
- Monitor Feature Attribution Drift of Deployed SageMaker Endpoints
- Perform Batch Predictions with SageMaker Batch Transform
- AWS Lambda Functions and Amazon API Gateway
- Optimize and Manage Models at the Edge
- Deploy a PyTorch Model with TorchServe
- TensorFlow-BERT Inference with AWS Deep Java Library
- Reduce Cost and Increase Performance
- Summary
-
10. Pipelines and MLOps
- Machine Learning Operations
- Software Pipelines
- Machine Learning Pipelines
-
Pipeline Orchestration with SageMaker Pipelines
- Create an Experiment to Track Our Pipeline Lineage
- Define Our Pipeline Steps
- Configure the Pipeline Parameters
- Create the Pipeline
- Start the Pipeline with the Python SDK
- Start the Pipeline with the SageMaker Studio UI
- Approve the Model for Staging and Production
- Review the Pipeline Artifact Lineage
- Review the Pipeline Experiment Lineage
- Automation with SageMaker Pipelines
- More Pipeline Options
- Human-in-the-Loop Workflows
- Reduce Cost and Improve Performance
- Summary
-
11. Streaming Analytics and Machine Learning
- Online Learning Versus Offline Learning
- Streaming Applications
- Windowed Queries on Streaming Data
- Streaming Analytics and Machine Learning on AWS
- Classify Real-Time Product Reviews with Amazon Kinesis, AWS Lambda, and Amazon SageMaker
- Implement Streaming Data Ingest Using Amazon Kinesis Data Firehose
- Summarize Real-Time Product Reviews with Streaming Analytics
- Setting Up Amazon Kinesis Data Analytics
- Amazon Kinesis Data Analytics Applications
- Classify Product Reviews with Apache Kafka, AWS Lambda, and Amazon SageMaker
- Reduce Cost and Improve Performance
- Summary
-
12. Secure Data Science on AWS
- Shared Responsibility Model Between AWS and Customers
- Applying AWS Identity and Access Management
- Isolating Compute and Network Environments
-
Securing Amazon S3 Data Access
- Require a VPC Endpoint with an S3 Bucket Policy
- Limit S3 APIs for an S3 Bucket with a VPC Endpoint Policy
- Restrict S3 Bucket Access to a Specific VPC with an S3 Bucket Policy
- Limit S3 APIs with an S3 Bucket Policy
- Restrict S3 Data Access Using IAM Role Policies
- Restrict S3 Bucket Access to a Specific VPC with an IAM Role Policy
- Restrict S3 Data Access Using S3 Access Points
-
Encryption at Rest
- Create an AWS KMS Key
- Encrypt the Amazon EBS Volumes During Training
- Encrypt the Uploaded Model in S3 After Training
- Store Encryption Keys with AWS KMS
- Enforce S3 Encryption for Uploaded S3 Objects
- Enforce Encryption at Rest for SageMaker Jobs
- Enforce Encryption at Rest for SageMaker Notebooks
- Enforce Encryption at Rest for SageMaker Studio
- Encryption in Transit
- Securing SageMaker Notebook Instances
- Securing SageMaker Studio
- Securing SageMaker Jobs and Models
- Securing AWS Lake Formation
- Securing Database Credentials with AWS Secrets Manager
- Governance
- Auditability
- Reduce Cost and Improve Performance
- Summary
- Index
Product information
- Title: Data Science on AWS
- Author(s):
- Release date: April 2021
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781492079392
You might also like
book
Data Engineering with AWS
The missing expert-led manual for the AWS ecosystem — go from foundations to building data engineering …
book
Data Science on the Google Cloud Platform, 2nd Edition
Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems …
book
Simplify Big Data Analytics with Amazon EMR
Design scalable big data solutions using Hadoop, Spark, and AWS cloud native services Key Features Build …
video
AWS Certified Data Analytics Specialty (2023) Hands-on
In this course, you will learn streaming massive data with AWS Kinesis; queuing messages with Simple …