book

Data Science on AWS

by Chris Fregly, Antje Barth

April 2021

Intermediate to advanced

521 pages

13h 33m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Overview of the ChaptersWho Should Read This BookOther ResourcesConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Benefits of Cloud ComputingAgilityCost SavingsElasticityInnovate FasterDeploy Globally in MinutesSmooth Transition from Prototype to ProductionData Science Pipelines and WorkflowsAmazon SageMaker PipelinesAWS Step Functions Data Science SDKKubeflow PipelinesManaged Workflows for Apache Airflow on AWSMLflowTensorFlow ExtendedHuman-in-the-Loop WorkflowsMLOps Best PracticesOperational ExcellenceSecurityReliabilityPerformance EfficiencyCost OptimizationAmazon AI Services and AutoML with Amazon SageMakerAmazon AI ServicesAutoML with SageMaker AutopilotData Ingestion, Exploration, and Preparation in AWSData Ingestion and Data Lakes with Amazon S3 and AWS Lake FormationData Analysis with Amazon Athena, Amazon Redshift, and Amazon QuickSightEvaluate Data Quality with AWS Deequ and SageMaker Processing JobsLabel Training Data with SageMaker Ground TruthData Transformation with AWS Glue DataBrew, SageMaker Data Wrangler, and SageMaker Processing JobsModel Training and Tuning with Amazon SageMakerTrain Models with SageMaker Training and ExperimentsBuilt-in AlgorithmsBring Your Own Script (Script Mode)Bring Your Own ContainerPre-Built Solutions and Pre-Trained Models with SageMaker JumpStartTune and Validate Models with SageMaker Hyper-Parameter TuningModel Deployment with Amazon SageMaker and AWS Lambda FunctionsSageMaker EndpointsSageMaker Batch TransformServerless Model Deployment with AWS LambdaStreaming Analytics and Machine Learning on AWSAmazon Kinesis StreamingAmazon Managed Streaming for Apache KafkaStreaming Predictions and Anomaly DetectionAWS Infrastructure and Custom-Built HardwareSageMaker Compute Instance TypesGPUs and Amazon Custom-Built Compute HardwareGPU-Optimized Networking and Custom-Built HardwareStorage Options Optimized for Large-Scale Model TrainingReduce Cost with Tags, Budgets, and AlertsSummary
Innovation Across Every IndustryPersonalized Product RecommendationsRecommend Products with Amazon PersonalizeGenerate Recommendations with Amazon SageMaker and TensorFlowGenerate Recommendations with Amazon SageMaker and Apache SparkDetect Inappropriate Videos with Amazon RekognitionDemand ForecastingPredict Energy Consumption with Amazon ForecastPredict Demand for Amazon EC2 Instances with Amazon ForecastIdentify Fake Accounts with Amazon Fraud DetectorEnable Privacy-Leak Detection with Amazon MacieConversational Devices and Voice AssistantsSpeech Recognition with Amazon LexText-to-Speech Conversion with Amazon PollySpeech-to-Text Conversion with Amazon TranscribeText Analysis and Natural Language ProcessingTranslate Languages with Amazon TranslateClassify Customer-Support Messages with Amazon ComprehendExtract Resume Details with Amazon Textract and ComprehendCognitive Search and Natural Language UnderstandingIntelligent Customer Support CentersIndustrial AI Services and Predictive MaintenanceHome Automation with AWS IoT and Amazon SageMakerExtract Medical Information from Healthcare DocumentsSelf-Optimizing and Intelligent Cloud InfrastructurePredictive Auto Scaling for Amazon EC2Anomaly Detection on Streams of DataCognitive and Predictive Business IntelligenceAsk Natural-Language Questions with Amazon QuickSightTrain and Invoke SageMaker Models with Amazon RedshiftInvoke Amazon Comprehend and SageMaker Models from Amazon Aurora SQL DatabaseInvoke SageMaker Model from Amazon AthenaRun Predictions on Graph Data Using Amazon NeptuneEducating the Next Generation of AI and ML DevelopersBuild Computer Vision Models with AWS DeepLensLearn Reinforcement Learning with AWS DeepRacerUnderstand GANs with AWS DeepComposerProgram Nature’s Operating System with Quantum ComputingQuantum Bits Versus Digital BitsQuantum Supremacy and the Quantum Computing ErasCracking CryptographyMolecular Simulations and Drug DiscoveryLogistics and Financial OptimizationsQuantum Machine Learning and AIProgramming a Quantum Computer with Amazon BraketAWS Center for Quantum ComputingIncrease Performance and Reduce CostAutomatic Code Reviews with CodeGuru ReviewerImprove Application Performance with CodeGuru ProfilerImprove Application Availability with DevOps GuruSummary
Automated Machine Learning with SageMaker AutopilotTrack Experiments with SageMaker AutopilotTrain and Deploy a Text Classifier with SageMaker AutopilotTrain and Deploy with SageMaker Autopilot UITrain and Deploy a Model with the SageMaker Autopilot Python SDKPredict with Amazon Athena and SageMaker AutopilotTrain and Predict with Amazon Redshift ML and SageMaker AutopilotAutomated Machine Learning with Amazon ComprehendPredict with Amazon Comprehend’s Built-in ModelTrain and Deploy a Custom Model with the Amazon Comprehend UITrain and Deploy a Custom Model with the Amazon Comprehend Python SDKSummary
Data LakesImport Data into the S3 Data LakeDescribe the DatasetQuery the Amazon S3 Data Lake with Amazon AthenaAccess Athena from the AWS ConsoleRegister S3 Data as an Athena TableUpdate Athena Tables as New Data Arrives with AWS Glue CrawlerCreate a Parquet-Based Table in AthenaContinuously Ingest New Data with AWS Glue CrawlerBuild a Lake House with Amazon Redshift SpectrumExport Amazon Redshift Data to S3 Data Lake as ParquetShare Data Between Amazon Redshift ClustersChoose Between Amazon Athena and Amazon RedshiftReduce Cost and Increase PerformanceS3 Intelligent-TieringParquet Partitions and CompressionAmazon Redshift Table Design and CompressionUse Bloom Filters to Improve Query PerformanceMaterialized Views in Amazon Redshift SpectrumSummary
Tools for Exploring Data in AWSVisualize Our Data Lake with SageMaker StudioPrepare SageMaker Studio to Visualize Our DatasetRun a Sample Athena Query in SageMaker StudioDive Deep into the Dataset with Athena and SageMakerQuery Our Data WarehouseRun a Sample Amazon Redshift Query from SageMaker StudioDive Deep into the Dataset with Amazon Redshift and SageMakerCreate Dashboards with Amazon QuickSightDetect Data-Quality Issues with Amazon SageMaker and Apache SparkSageMaker Processing JobsAnalyze Our Dataset with Deequ and Apache SparkDetect Bias in Our DatasetGenerate and Visualize Bias Reports with SageMaker Data WranglerDetect Bias with a SageMaker Clarify Processing JobIntegrate Bias Detection into Custom Scripts with SageMaker Clarify Open SourceMitigate Data Bias by Balancing the DataDetect Different Types of Drift with SageMaker ClarifyAnalyze Our Data with AWS Glue DataBrewReduce Cost and Increase PerformanceUse a Shared S3 Bucket for Nonsensitive Athena Query ResultsApproximate Counts with HyperLogLogDynamically Scale a Data Warehouse with AQUA for Amazon RedshiftImprove Dashboard Performance with QuickSight SPICESummary
Perform Feature Selection and EngineeringSelect Training Features Based on Feature ImportanceBalance the Dataset to Improve Model AccuracySplit the Dataset into Train, Validation, and Test SetsTransform Raw Text into BERT EmbeddingsConvert Features and Labels to Optimized TensorFlow File FormatScale Feature Engineering with SageMaker Processing JobsTransform with scikit-learn and TensorFlowTransform with Apache Spark and TensorFlowShare Features Through SageMaker Feature StoreIngest Features into SageMaker Feature StoreRetrieve Features from SageMaker Feature StoreIngest and Transform Data with SageMaker Data WranglerTrack Artifact and Experiment Lineage with Amazon SageMakerUnderstand Lineage-Tracking ConceptsShow Lineage of a Feature Engineering JobUnderstand the SageMaker Experiments APIIngest and Transform Data with AWS Glue DataBrewSummary
Understand the SageMaker InfrastructureIntroduction to SageMaker ContainersIncrease Availability with Compute and Network IsolationDeploy a Pre-Trained BERT Model with SageMaker JumpStartDevelop a SageMaker ModelBuilt-in AlgorithmsBring Your Own ScriptBring Your Own ContainerA Brief History of Natural Language ProcessingBERT Transformer ArchitectureTraining BERT from ScratchMasked Language ModelNext Sentence PredictionFine Tune a Pre-Trained BERT ModelCreate the Training ScriptSetup the Train, Validation, and Test Dataset SplitsSet Up the Custom Classifier ModelTrain and Validate the ModelSave the ModelLaunch the Training Script from a SageMaker NotebookDefine the Metrics to Capture and MonitorConfigure the Hyper-Parameters for Our AlgorithmSelect Instance Type and Instance CountPutting It All Together in the NotebookDownload and Inspect Our Trained Model from S3Show Experiment Lineage for Our SageMaker Training JobShow Artifact Lineage for Our SageMaker Training JobEvaluate ModelsRun Some Ad Hoc Predictions from the NotebookAnalyze Our Classifier with a Confusion MatrixVisualize Our Neural Network with TensorBoardMonitor Metrics with SageMaker StudioMonitor Metrics with CloudWatch MetricsDebug and Profile Model Training with SageMaker DebuggerDetect and Resolve Issues with SageMaker Debugger Rules and ActionsProfile Training JobsInterpret and Explain Model PredictionsDetect Model Bias and Explain PredictionsDetect Bias with a SageMaker Clarify Processing JobFeature Attribution and Importance with SageMaker Clarify and SHAPMore Training Options for BERTConvert TensorFlow BERT Model to PyTorchTrain PyTorch BERT Models with SageMakerTrain Apache MXNet BERT Models with SageMakerTrain BERT Models with PyTorch and AWS Deep Java LibraryReduce Cost and Increase PerformanceUse Small Notebook InstancesTest Model-Training Scripts Locally in the NotebookProfile Training Jobs with SageMaker DebuggerStart with a Pre-Trained ModelUse 16-Bit Half Precision and bfloat16Mixed 32-Bit Full and 16-Bit Half PrecisionQuantizationUse Training-Optimized HardwareSpot Instances and CheckpointsEarly Stopping Rule in SageMaker DebuggerSummary
Automatically Find the Best Model Hyper-ParametersSet Up the Hyper-Parameter RangesRun the Hyper-Parameter Tuning JobAnalyze the Best Hyper-Parameters from the Tuning JobShow Experiment Lineage for Our SageMaker Tuning JobUse Warm Start for Additional SageMaker Hyper-Parameter Tuning JobsRun HPT Job Using Warm StartAnalyze the Best Hyper-Parameters from the Warm-Start Tuning JobScale Out with SageMaker Distributed TrainingChoose a Distributed-Communication StrategyChoose a Parallelism StrategyChoose a Distributed File SystemLaunch the Distributed Training JobReduce Cost and Increase PerformanceStart with Reasonable Hyper-Parameter RangesShard the Data with ShardedByS3KeyStream Data on the Fly with Pipe ModeEnable Enhanced NetworkingSummary
Choose Real-Time or Batch PredictionsReal-Time Predictions with SageMaker EndpointsDeploy Model Using SageMaker Python SDKTrack Model Deployment in Our ExperimentAnalyze the Experiment Lineage of a Deployed ModelInvoke Predictions Using the SageMaker Python SDKInvoke Predictions Using HTTP POSTCreate Inference PipelinesInvoke SageMaker Models from SQL and Graph-Based QueriesAuto-Scale SageMaker Endpoints Using Amazon CloudWatchDefine a Scaling Policy with AWS-Provided MetricsDefine a Scaling Policy with a Custom MetricTuning Responsiveness Using a Cooldown PeriodAuto-Scale PoliciesStrategies to Deploy New and Updated ModelsSplit Traffic for Canary RolloutsShift Traffic for Blue/Green DeploymentsTesting and Comparing New ModelsPerform A/B Tests to Compare Model VariantsReinforcement Learning with Multiarmed Bandit TestingMonitor Model Performance and Detect DriftEnable Data CaptureUnderstand Baselines and DriftMonitor Data Quality of Deployed SageMaker EndpointsCreate a Baseline to Measure Data QualitySchedule Data-Quality Monitoring JobsInspect Data-Quality ResultsMonitor Model Quality of Deployed SageMaker EndpointsCreate a Baseline to Measure Model QualitySchedule Model-Quality Monitoring JobsInspect Model-Quality Monitoring ResultsMonitor Bias Drift of Deployed SageMaker EndpointsCreate a Baseline to Detect BiasSchedule Bias-Drift Monitoring JobsInspect Bias-Drift Monitoring ResultsMonitor Feature Attribution Drift of Deployed SageMaker EndpointsCreate a Baseline to Monitor Feature AttributionSchedule Feature Attribution Drift Monitoring JobsInspect Feature Attribution Drift Monitoring ResultsPerform Batch Predictions with SageMaker Batch TransformSelect an Instance TypeSet Up the Input DataTune the SageMaker Batch Transform ConfigurationPrepare the SageMaker Batch Transform JobRun the SageMaker Batch Transform JobReview the Batch PredictionsAWS Lambda Functions and Amazon API GatewayOptimize and Manage Models at the EdgeDeploy a PyTorch Model with TorchServeTensorFlow-BERT Inference with AWS Deep Java LibraryReduce Cost and Increase PerformanceDelete Unused Endpoints and Scale In Underutilized ClustersDeploy Multiple Models in One ContainerAttach a GPU-Based Elastic Inference AcceleratorOptimize a Trained Model with SageMaker Neo and TensorFlow LiteUse Inference-Optimized HardwareSummary

Machine Learning OperationsSoftware PipelinesMachine Learning PipelinesComponents of Effective Machine Learning PipelinesSteps of an Effective Machine Learning PipelinePipeline Orchestration with SageMaker PipelinesCreate an Experiment to Track Our Pipeline LineageDefine Our Pipeline StepsConfigure the Pipeline ParametersCreate the PipelineStart the Pipeline with the Python SDKStart the Pipeline with the SageMaker Studio UIApprove the Model for Staging and ProductionReview the Pipeline Artifact LineageReview the Pipeline Experiment LineageAutomation with SageMaker PipelinesGitOps Trigger When Committing CodeS3 Trigger When New Data ArrivesTime-Based Schedule TriggerStatistical Drift TriggerMore Pipeline OptionsAWS Step Functions and the Data Science SDKKubeflow PipelinesApache AirflowMLflowTensorFlow ExtendedHuman-in-the-Loop WorkflowsImproving Model Accuracy with Amazon A2IActive-Learning Feedback Loops with SageMaker Ground TruthReduce Cost and Improve PerformanceCache Pipeline StepsUse Less-Expensive Spot InstancesSummary
Online Learning Versus Offline LearningStreaming ApplicationsWindowed Queries on Streaming DataStagger WindowsTumbling WindowsSliding WindowsStreaming Analytics and Machine Learning on AWSClassify Real-Time Product Reviews with Amazon Kinesis, AWS Lambda, and Amazon SageMakerImplement Streaming Data Ingest Using Amazon Kinesis Data FirehoseCreate Lambda Function to Invoke SageMaker EndpointCreate the Kinesis Data Firehose Delivery StreamPut Messages on the StreamSummarize Real-Time Product Reviews with Streaming AnalyticsSetting Up Amazon Kinesis Data AnalyticsCreate a Kinesis Data Stream to Deliver Data to a Custom ApplicationCreate AWS Lambda Function to Send Notifications via Amazon SNSCreate AWS Lambda Function to Publish Metrics to Amazon CloudWatchTransform Streaming Data in Kinesis Data AnalyticsUnderstand In-Application Streams and PumpsAmazon Kinesis Data Analytics ApplicationsCalculate Average Star RatingDetect Anomalies in Streaming DataCalculate Approximate Counts of Streaming DataCreate Kinesis Data Analytics ApplicationStart the Kinesis Data Analytics ApplicationPut Messages on the StreamClassify Product Reviews with Apache Kafka, AWS Lambda, and Amazon SageMakerReduce Cost and Improve PerformanceAggregate MessagesConsider Kinesis Firehose Versus Kinesis Data StreamsEnable Enhanced Fan-Out for Kinesis Data StreamsSummary
Shared Responsibility Model Between AWS and CustomersApplying AWS Identity and Access ManagementIAM UsersIAM PoliciesIAM User RolesIAM Service RolesSpecifying Condition Keys for IAM RolesEnable Multifactor AuthenticationLeast Privilege Access with IAM Roles and PoliciesResource-Based IAM PoliciesIdentity-Based IAM PoliciesIsolating Compute and Network EnvironmentsVirtual Private CloudVPC Endpoints and PrivateLinkLimiting Athena APIs with a VPC Endpoint PolicySecuring Amazon S3 Data AccessRequire a VPC Endpoint with an S3 Bucket PolicyLimit S3 APIs for an S3 Bucket with a VPC Endpoint PolicyRestrict S3 Bucket Access to a Specific VPC with an S3 Bucket PolicyLimit S3 APIs with an S3 Bucket PolicyRestrict S3 Data Access Using IAM Role PoliciesRestrict S3 Bucket Access to a Specific VPC with an IAM Role PolicyRestrict S3 Data Access Using S3 Access PointsEncryption at RestCreate an AWS KMS KeyEncrypt the Amazon EBS Volumes During TrainingEncrypt the Uploaded Model in S3 After TrainingStore Encryption Keys with AWS KMSEnforce S3 Encryption for Uploaded S3 ObjectsEnforce Encryption at Rest for SageMaker JobsEnforce Encryption at Rest for SageMaker NotebooksEnforce Encryption at Rest for SageMaker StudioEncryption in TransitPost-Quantum TLS Encryption in Transit with KMSEncrypt Traffic Between Training-Cluster ContainersEnforce Inter-Container Encryption for SageMaker JobsSecuring SageMaker Notebook InstancesDeny Root Access Inside SageMaker NotebooksDisable Internet Access for SageMaker NotebooksSecuring SageMaker StudioRequire a VPC for SageMaker StudioSageMaker Studio AuthenticationSecuring SageMaker Jobs and ModelsRequire a VPC for SageMaker JobsRequire Network Isolation for SageMaker JobsSecuring AWS Lake FormationSecuring Database Credentials with AWS Secrets ManagerGovernanceSecure Multiaccount AWS Environments with AWS Control TowerManage Accounts with AWS OrganizationsEnforce Account-Level Permissions with SCPsImplement Multiaccount Model DeploymentsAuditabilityTag ResourcesLog Activities and Collect EventsTrack User Activity and API CallsReduce Cost and Improve PerformanceLimit Instance Types to Control CostQuarantine or Delete Untagged ResourcesUse S3 Bucket KMS Keys to Reduce Cost and Increase PerformanceSummary

Content preview from Data Science on AWS

Chapter 4. Ingest Data into the Cloud

In this chapter, we will show how to ingest data into the cloud. For that purpose, we will look at a typical scenario in which an application writes files into an Amazon S3 data lake, which in turn needs to be accessed by the ML engineering/data science team as well as the business intelligence/data analyst team, as shown in Figure 4-1.

Amazon Simple Storage Service (Amazon S3) is fully managed object storage that offers extreme durability, high availability, and infinite data scalability at a very low cost. Hence, it is the perfect foundation for data lakes, training datasets, and models. We will learn more about the advantages of building data lakes on Amazon S3 in the next section.

Let’s assume our application continually captures data (i.e., customer interactions on our website, product review messages) and writes the data to S3 in the tab-separated values (TSV) file format.

As a data scientist or machine learning engineer, we want to quickly explore raw datasets. We will introduce Amazon Athena and show how to leverage Athena as an interactive query service to analyze data in S3 using standard SQL, without moving the data. In the first step, we will register the TSV data in our S3 bucket with ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781492079385Errata Page Supplemental Content

Data Science on AWS

by Chris Fregly, Antje Barth

Chapter 4. Ingest Data into the Cloud

Figure 4-1. An application writes data into our S3 data lake for the data science, machine learning engineering, and business intelligence teams.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Data Engineering with AWS

AWS Certified Data Analytics Specialty (2023) Hands-on

Simplify Big Data Analytics with Amazon EMR

Data Engineering with Python and AWS Lambda LiveLessons

Publisher Resources

Chapter 4. Ingest Data into the Cloud

Figure 4-1. An application writes data into our S3 data lake for the data science, machine learning engineering, and business intelligence teams.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Data Engineering with AWS

AWS Certified Data Analytics Specialty (2023) Hands-on

Simplify Big Data Analytics with Amazon EMR

Data Engineering with Python and AWS Lambda LiveLessons

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.