book

AWS Certified Data Engineer Associate Study Guide

by Sakti Mishra, Dylan Qu, Anusha Challa

August 2025

Intermediate to advanced

476 pages

12h 52m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
What This Book Isn’tWhat This Book Is AboutWho Should Read This BookHow This Book Is OrganizedAccessing the Book’s Images OnlineConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Certification Essentials
Who Is a Data Engineer?Becoming an AWS Data Engineer AssociateExam TopicsExam FormatRegistering for the ExamExam-Style QuestionsThink Like an AWS Solutions Architect: Translating a Real-World Problem-Solving Framework into CertificationThe Solutions Architect’s Problem-Solving FrameworkReal-World Example: Designing a Serverless Stream Analytics Platform to Detect FraudHow This Thought Process Applies to Certification QuestionsStudy PlanConclusion
2. Prerequisite Knowledge for Aspiring Data Engineers
Databases and Types of DatabasesWhat Is a Database?What Is a Database Management System?Types of DatabasesHierarchical DatabasesRelational DatabasesNoSQL DatabasesOLTP Versus OLAPOverview of Big DataDistributed Processing Frameworks for Big DataMapReduceSparkFlinkHivePrestoTrinoWhat Is a Data Lake?What Is a Data Warehouse?Data Warehouse Versus Data LakeETL Versus ELTDifferent Ways to Process DataBatch Processing PipelineReal-Time Stream ProcessingEvent-Driven ProcessingHigh-Level Architecture Overview of Data Processing PipelinesWorking with Code RepositoriesWhat Is a Code Repository?How to Work with Code RepositoriesCI/CDCloud Computing and AWSWhat Is Cloud Computing?An Overview of Amazon Web ServicesGetting Started with AWSHow to Set Up an AWS AccountConfigure Access with AWS IAMCreate an IAM User for AuthenticationAdd Permissions to Authorize the UserWhat Is an IAM Policy?What Is an IAM Role?Best Practices to Follow with AWS IAMConclusionResources
3. Overview of AWS Analytics and Auxiliary Services
AWS Analytics ServicesAmazon Kinesis Data StreamsAmazon Data FirehoseAmazon Managed Service for Apache FlinkAmazon Managed Streaming for Apache KafkaReference Architecture: Streaming Analytics Pattern with Apache Flink and MSKAWS GlueAWS Glue DataBrewAmazon AthenaAmazon EMRAmazon RedshiftAmazon QuickSightReference Architecture: Lakehouse with Glue, Redshift, and AthenaAmazon OpenSearch ServiceAmazon DataZoneAWS Lake FormationAuxiliary Services for AnalyticsApplication IntegrationCompute and ContainersDatabaseStorageMachine LearningMigration and TransferNetworking and Content DeliverySecurity, Identity, and ComplianceManagement GovernanceDeveloper ToolsCloud Financial ManagementAWS Well-Architected ToolConclusionAdditional Resources
4. Data Ingestion and Transformation
Data IngestionReal-Time Streaming Data IngestionKinesis Data Streams Versus Amazon MSKSample Streaming Ingestion Use CasesIngesting Data Using Zero-ETL IntegrationsIngesting Data from Databases with CDC Using AWS Data Migration ServiceSupported Sources for AWS DMSSupported Targets for AWS DMSSample Use CasesBest Practices for Data IngestionBest Practices for Streaming IngestionBest Practices for Choosing Data Stream Capacity ModeBest Practices for ShardingBest Practices for Consuming Data from KDSBest Practices for Amazon MSKBest Practices for Amazon Data FirehoseBest Practices for AWS DMS Replication Instances and TasksBest Practices for AWS DMS Tasks with Amazon Redshift TargetData TransformationBatch Data TransformationStreaming Data TransformationData Transformation Using AWS GlueGlue ConnectorsGlue BookmarksData Processing UnitsWorker TypeGlue JobsData Sources and DestinationsBest Practices for AWS GlueData Transformation Using Amazon EMRStorageDeployment OptionsInstance TypesBest Practices for Amazon EMRAWS Glue Versus Amazon EMR OptionsSQL-Based Data Transformation Using Amazon RedshiftAmazon Redshift ComputeAmazon Redshift StorageSQL Data TransformationsAmazon Managed Service for Apache FlinkAmazon Data Firehose for TransformationAWS Lambda for TransformationChoosing the Right Streaming Transformation ServiceChoosing the Right Batch Transformation ServiceData Preparation for Nontechnical PersonasFill Missing ValuesIdentify Duplicate RecordsFormatting FunctionsIntegrating Data from Multiple SourcesNesting and Unnesting Data StructuresProtecting Sensitive DataOther Data Preparation TransformationsOrchestrating Data PipelinesAWS Step FunctionsManaged Workflows for Apache AirflowSample Use CaseAWS Glue WorkflowsSample Use CaseAmazon Redshift SchedulerAmazon EventBridgeSample Use CaseChoosing the Right Orchestration ServiceConclusionPractice QuestionsAdditional Resources
5. Data Store Management
Choosing a Data StoreAWS Core Storage ServicesAWS Cloud DatabasesData Storage Formats for Data LakesRow-Based File FormatsColumn-Based File FormatsTable FormatsBuilding a Data Strategy with Multiple Data StoresData Cataloging SystemsComponents of Metadata and Data CatalogsPopulating an AWS Glue Data CatalogData Catalog Best PracticesEnriching Data Catalogs with Data ClassificationManaging the Lifecycle of DataSelecting Storage Solutions for Hot and Cold DataExample: Building a Petabyte-Scale Log Analytics Solution on AWSStorage Tier Decisions for Different Access PatternsDefining Data Retention Policy and Archiving StrategiesPerforming COPY and UNLOAD Operations to Move Data Between Amazon S3 and Amazon RedshiftOptimizing Data Management with Amazon S3Overview of S3 Storage ClassesChoosing the Right Storage ClassS3 Intelligent-TieringManaging the Data Lifecycle with Amazon S3 LifecycleMonitoring the Amazon S3 Data LifecycleExpiring Snapshots from Open Table FormatsArchiving Data from Amazon DynamoDB to Amazon S3Ensuring S3 Data Resiliency with S3 VersioningEnabling Versioning on an S3 BucketS3 Versioning and Object Lifecycle ManagementDesigning Data Models and SchemaIntroduction to Data ModelingData Modeling Strategies for Amazon RedshiftData Modeling Strategies for Amazon DynamoDBData Modeling Strategies for Data LakesAmazon S3 Data Lake Best PracticesConclusionPractice QuestionsAdditional Resources
6. Data Operations and Support
Amazon QuickSightData SourcesDatasetsRefreshing SPICE DatasetsVisualizationsPresentation FormatsQuickSight GenBI Capabilities (QuickSight Q)SQL Analytics Using Amazon AthenaChoice of Querying EngineWorkgroupsCapacity ReservationsAthena Federated SQLUse CasesDDL CapabilitiesBest Practices When Using Amazon AthenaSQL Analytics Using Amazon RedshiftSQL FunctionsSemi-Structured Data AnalysisGeospatial Data AnalysisQuery Data from Data LakeAnalyzing Data from Operational Data Stores Using Amazon RedshiftRedshift ML and Generative AIUser-Defined FunctionsAnalyzing Data Using NotebooksAWS Glue Interactive SessionsAmazon EMR NotebooksData Pipeline ResiliencyMonitoringAlertingEvent-Driven Pipeline Maintenance with EventBridgeEnsuring Data Quality and Reliability: Deequ and DQDLAutomated Data Quality Checks and Error HandlingTroubleshooting and Performance TuningCI/CD PipelinesVersion Control and CollaborationInfrastructure as CodeDisaster Recovery and High AvailabilityCost Optimization for Data PipelinesLeveraging Serverless ServicesAutoscalingTiered StorageColumnar FormatsMonitor and Control Data Transfer CostsFollow Cost Optimization Best PracticesConclusionPractice QuestionsAdditional Resources
7. Data Security and Governance
Network SecurityAmazon VPC OverviewSecurity Groups OverviewBest Practices for Configuring Security Groups for Your WorkloadsConfiguring a VPC and Security Group for an Amazon EMR ClusterManaged Services Versus Unmanaged ServicesVPC Endpoints OverviewUser Authentication and AuthorizationAuthenticating Users with IAM CredentialsIAM Role-Based Authentication and AuthorizationService-Linked RolesManaged Versus Self-Managed PoliciesEnable Single Sign-on with AWS IAM Identity CenterData Security and PrivacySecure Data in Amazon S3Manage Database CredentialsData Encryption and Decryption and Managing the Encryption KeysManaging Encryption Keys with AWS KMSEnabling Encryption in AWS Analytics ServicesSensitive Data Detection and RedactionFine-Grained Access Control with AWS Lake FormationDatabase Security in Amazon RedshiftFine-Grained Access Control in Amazon QuickSightData GovernanceMetadata Management and Technical CatalogData SharingData QualityData ProfilingData Lifecycle ManagementData LineageLogging and AuditingAnalyzing Logs Using AWS ServicesConclusionPractice QuestionsAdditional Resources
8. Implementing Batch and Streaming Pipelines
Data Processing PipelineImplementing a Batch Processing PipelineUse Case and Architecture OverviewOverview of Input DatasetStep-by-Step Implementation GuideBest Practices and Optimization TechniquesImplementing a Real-Time Streaming PipelineUse Case and Architecture OverviewStep-by-Step Implementation GuideConclusionResources
9. Practice Exam

10. What’s New in AWS for Data Engineers
Amazon SageMaker Unified StudioAmazon SageMaker CatalogAmazon SageMaker LakehouseAmazon SageMaker AIAmazon S3 TablesAmazon S3 MetadataImproving the Developer Experience with Generative AIGenerative AI–Powered Code Generation with Amazon Q DeveloperAutomated Script Upgrade in AWS GlueGenAI-Powered Troubleshooting for Spark in AWS GlueConclusionResources
Appendix. Solutions to the Practice Questions
Chapter 4Chapter 5Chapter 6Chapter 7Chapter 9
Index
About the Authors

Content preview from AWS Certified Data Engineer Associate Study Guide

Chapter 6. Data Operations and Support

In the evolving world of data-driven decision making, the ability to effectively manage, monitor, and optimize data processing pipelines is crucial for organizations seeking to unlock the full potential of their data assets. As data engineers, you play a pivotal role in ensuring the reliability, performance, and cost-effectiveness of these data pipelines, which power the critical analytics and business intelligence initiatives within your organization.

This chapter will explore the key aspects of data operations and support, equipping you with the knowledge and skills required to automate data processing, analyze data, maintain and monitor data pipelines, and ensure data quality. By mastering these techniques, you will become a valuable asset in your organization’s data-driven journey, enabling seamless data operations and supporting the delivery of actionable insights.

This chapter will help you learn how to do the following:

Analyze data using a variety of AWS services, including Amazon QuickSight, Amazon Athena, and Amazon Redshift.
Monitor data pipelines by deploying comprehensive logging and monitoring solutions, leveraging tools like Amazon CloudWatch, AWS CloudTrail, Amazon Macie, and system tables for specific services.
Apply best practices for performance tuning and troubleshooting data processing pipelines.
Build robust data pipelines to achieve your recovery point objective (RPO) and recovery time objective (RTO) in case ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098170066Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

AWS Certified Data Engineer Associate Study Guide

by Sakti Mishra, Dylan Qu, Anusha Challa

Chapter 6. Data Operations and Support

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.