book

Data Engineering with AWS

Name: Data Engineering with AWS
Author: Gareth Eagar
ISBN: 9781800560413

by Gareth Eagar

December 2021

Beginner to intermediate

482 pages

11h 27m

English

Packt Publishing

Read now

Unlock full access

Data Engineering with AWS
ContributorsAbout the authorAdditional contributorsAbout the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare Your Thoughts
Section 1: AWS Data Engineering Concepts and Trends
Chapter 1: An Introduction to Data Engineering
Technical requirementsThe rise of big data as a corporate assetThe challenges of ever-growing datasetsData engineers – the big data enablersUnderstanding the role of the data engineerUnderstanding the role of the data scientistUnderstanding the role of the data analystUnderstanding other common data-related rolesThe benefits of the cloud when building big data analytic solutionsHands-on – creating and accessing your AWS accountCreating a new AWS accountAccessing your AWS accountSummary
Chapter 2: Data Management Architectures for Analytics
Technical requirementsThe evolution of data management for analytics Databases and data warehousesDealing with big, unstructured data A lake on the cloud and a house on that lake Understanding data warehouses and data marts – fountains of truth Distributed storage and massively parallel processingColumnar data storage and efficient data compressionDimensional modeling in data warehousesUnderstanding the role of data martsFeeding data into the warehouse – ETL and ELT pipelinesBuilding data lakes to tame the variety and volume of big dataData lake logical architectureBringing together the best of both worlds with the lake house architectureData lakehouse implementationsBuilding a data lakehouse on AWSHands-on – configuring the AWS Command Line Interface tool and creating an S3 bucketInstalling and configuring the AWS CLICreating a new Amazon S3 bucketSummary
Chapter 3: The AWS Data Engineer's Toolkit
Technical requirementsAWS services for ingesting dataOverview of Amazon Database Migration Service (DMS)Overview of Amazon Kinesis for streaming data ingestionOverview of Amazon MSK for streaming data ingestionOverview of Amazon AppFlow for ingesting data from SaaS servicesOverview of Amazon Transfer Family for ingestion using FTP/SFTP protocolsOverview of Amazon DataSync for ingesting from on-premises storageOverview of the AWS Snow family of devices for large data transfersAWS services for transforming dataOverview of AWS Lambda for light transformationsOverview of AWS Glue for serverless Spark processingOverview of Amazon EMR for Hadoop ecosystem processingAWS services for orchestrating big data pipelinesOverview of AWS Glue workflows for orchestrating Glue componentsOverview of AWS Step Functions for complex workflowsOverview of Amazon managed workflows for Apache AirflowAWS services for consuming dataOverview of Amazon Athena for SQL queries in the data lakeOverview of Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architecturesOverview of Amazon QuickSight for visualizing dataHands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucketCreating a Lambda layer containing the AWS Data Wrangler libraryCreating new Amazon S3 bucketsCreating an IAM policy and role for your Lambda functionCreating a Lambda function Configuring our Lambda function to be triggered by an S3 uploadSummary
Chapter 4: Data Cataloging, Security, and Governance
Technical requirementsGetting data security and governance rightCommon data regulatory requirementsCore data protection conceptsPersonal dataEncryptionAnonymized dataPseudonymized data/tokenizationAuthenticationAuthorizationPutting these concepts togetherCataloging your data to avoid the data swampHow to avoid the data swampThe AWS Glue/Lake Formation data catalogAWS services for data encryption and security monitoringAWS Key Management Service (KMS)Amazon MacieAmazon GuardDutyAWS services for managing identity and permissionsAWS Identity and Access Management (IAM) serviceUsing AWS Lake Formation to manage data lake accessHands-on – configuring Lake Formation permissionsCreating a new user with IAM permissionsTransitioning to managing fine-grained permissions with AWS Lake FormationSummary
Section 2: Architecting and Implementing Data Lakes and Data Lake Houses
Chapter 5: Architecting Data Engineering Pipelines
Technical requirementsApproaching the data pipeline architectureArchitecting houses and architecting pipelinesWhiteboarding as an information-gathering toolConducting a whiteboarding sessionIdentifying data consumers and understanding their requirementsIdentifying data sources and ingesting dataIdentifying data transformations and optimizationsFile format optimizationsData standardizationData quality checksData partitioningData denormalizationData catalogingWhiteboarding data transformationLoading data into data martsWrapping up the whiteboarding sessionHands-on – architecting a sample pipelineDetailed notes from the project "Bright Light" whiteboarding meeting of GP Widgets, Inc Summary
Chapter 6: Ingesting Batch and Streaming Data
Technical requirementsUnderstanding data sourcesData varietyData volumeData velocityData veracityData valueQuestions to askIngesting data from a relational databaseAWS Database Migration Service (DMS)AWS GlueOther ways to ingest data from a databaseDeciding on the best approach for ingesting from a databaseIngesting streaming dataAmazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)Hands-on – ingesting data with AWS DMSCreating a new MySQL database instanceLoading the demo data using an Amazon EC2 instanceCreating an IAM policy and role for DMSConfiguring DMS settings and performing a full load from MySQL to S3Querying data with Amazon AthenaHands-on – ingesting streaming dataConfiguring Kinesis Data Firehose for streaming delivery to Amazon S3Configuring Amazon Kinesis Data Generator (KDG)Adding newly ingested data to the Glue Data CatalogQuerying the data with Amazon AthenaSummary

Chapter 7: Transforming Data to Optimize for Analytics
Technical requirementsTransformations – making raw data more valuableCooking, baking, and data transformationsTransformations as part of a pipelineTypes of data transformation toolsApache SparkHadoop and MapReduceSQLGUI-based toolsData preparation transformationsProtecting PII dataOptimizing the file formatOptimizing with data partitioningData cleansingBusiness use case transformsData denormalizationEnriching dataPre-aggregating dataExtracting metadata from unstructured dataWorking with change data capture (CDC) dataTraditional approaches – data upserts and SQL viewsModern approaches – the transactional data lakeHands-on – joining datasets with AWS Glue StudioCreating a new data lake zone – the curated zoneCreating a new IAM role for the Glue jobConfiguring a denormalization transform using AWS Glue StudioFinalizing the denormalization transform job to write to S3Create a transform job to join streaming and film data using AWS Glue StudioSummary
Chapter 8: Identifying and Enabling Data Consumers
Technical requirementsUnderstanding the impact of data democratization A growing variety of data consumersMeeting the needs of business users with data visualizationAWS tools for business usersMeeting the needs of data analysts with structured reportingAWS tools for data analystsMeeting the needs of data scientists and ML modelsAWS tools used by data scientists to work with dataHands-on – creating data transformations with AWS Glue DataBrewConfiguring new datasets for AWS Glue DataBrewCreating a new Glue DataBrew projectBuilding your Glue DataBrew recipeCreating a Glue DataBrew jobSummary
Chapter 9: Loading Data into a Data Mart
Technical requirementsExtending analytics with data warehouses/data martsCold dataWarm dataHot dataWhat not to do – anti-patterns for a data warehouseUsing a data warehouse as a transactional datastoreUsing a data warehouse as a data lakeUsing data warehouses for real-time, record-level use casesStoring unstructured dataRedshift architecture review and storage deep diveData distribution across slicesRedshift Zone Maps and sorting dataDesigning a high-performance data warehouseSelecting the optimal Redshift node typeSelecting the optimal table distribution style and sort keySelecting the right data type for columnsSelecting the optimal table typeMoving data between a data lake and RedshiftOptimizing data ingestion in RedshiftExporting data from Redshift to the data lakeHands-on – loading data into an Amazon Redshift cluster and running queriesUploading our sample data to Amazon S3IAM roles for RedshiftCreating a Redshift clusterCreating external tables for querying data in S3Creating a schema for a local Redshift tableRunning complex SQL queries against our dataSummary
Chapter 10: Orchestrating the Data Pipeline
Technical requirementsUnderstanding the core concepts for pipeline orchestrationWhat is a data pipeline, and how do you orchestrate it?How do you trigger a data pipeline to run?How do you handle the failures of a step in your pipeline?Examining the options for orchestrating pipelines in AWSAWS Data Pipeline for managing ETL between data sourcesAWS Glue Workflows to orchestrate Glue resourcesApache Airflow as an open source orchestration solutionPros and cons of using MWAAAWS Step Function for a serverless orchestration solutionPros and cons of using AWS Step FunctionDeciding on which data pipeline orchestration tool to useHands-on – orchestrating a data pipeline using AWS Step FunctionCreating new Lambda functionsCreating an SNS topic and subscribing to an email addressCreating a new Step Function state machineConfiguring AWS CloudTrail and Amazon EventBridgeSummary
Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
Chapter 11: Ad Hoc Queries with Amazon Athena
Technical requirementsAmazon Athena – in-place SQL analytics for the data lakeTips and tricks to optimize Amazon Athena queriesCommon file format and layout optimizationsWriting optimized SQL queriesFederating the queries of external data sources with Amazon Athena Query FederationQuerying external data sources using Athena Federated QueryManaging governance and costs with Amazon Athena WorkgroupsAthena Workgroups overviewEnforcing settings for groups of usersEnforcing data usage controlsHands-on – creating an Amazon Athena workgroup and configuring Athena settingsHands-on – switching Workgroups and running queriesSummary
Chapter 12: Visualizing Data with Amazon QuickSight
Technical requirementsRepresenting data visually for maximum impactBenefits of data visualizationPopular uses of data visualizationsUnderstanding Amazon QuickSight's core concepts Standard versus enterprise editionSPICE – the in-memory storage and computation engine for QuickSightIngesting and preparing data from a variety of sourcesPreparing datasets in QuickSight versus performing ETL outside of QuickSightCreating and sharing visuals with QuickSight analyses and dashboardsVisual types in Amazon QuickSightUnderstanding QuickSight's advanced features – ML Insights and embedded dashboardsAmazon QuickSight ML InsightsAmazon QuickSight embedded dashboardsHands-on – creating a simple QuickSight visualizationSetting up a new QuickSight account and loading a datasetCreating a new analysisSummary
Chapter 13: Enabling Artificial Intelligence and Machine Learning
Technical requirementsUnderstanding the value of ML and AI for organizationsSpecialized ML projectsEveryday use cases for ML and AIExploring AWS services for MLAWS ML servicesExploring AWS services for AIAI for unstructured speech and textAI for extracting metadata from images and videoAI for ML-powered forecastsAI for fraud detection and personalizationHands-on – reviewing reviews with Amazon ComprehendSetting up a new Amazon SQS message queueCreating a Lambda function for calling Amazon ComprehendAdding Comprehend permissions for our IAM roleAdding a Lambda function as a trigger for our SQS message queueTesting the solution with Amazon ComprehendSummaryFurther reading
Chapter 14: Wrapping Up the First Part of Your Learning Journey
Technical requirementsLooking at the data analytics big pictureManaging complex data environments with DataOpsExamining examples of real-world data pipelinesA decade of data wrapped up for Spotify usersIngesting and processing streaming files at Netflix scaleImagining the future – a look at emerging trendsACID transactions directly on data lake dataMore data and more streaming ingestionMulti-cloudDecentralized data engineering teams, data platforms, and a data mesh architecture Data and product thinking convergenceData and self-serve platform design convergenceImplementations of the data mesh architectureHands-on – cleaning up your AWS accountReviewing AWS Billing to identify the resources being charged forClosing your AWS accountSummary
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your Thoughts

Content preview from Data Engineering with AWS

Chapter 10: Orchestrating the Data Pipeline

Throughout this book, we have been discussing various services that can be used by data engineers to ingest and transform data, as well as make it available for consumers. We looked at how we could ingest data via Amazon Kinesis Data Firehose and Amazon Database Migration Service, and how we could run AWS Lambda and AWS Glue functions to transform our data. We also discussed the importance of updating a data catalog as new datasets are added to a data lake, and how we can load subsets of data into a data mart for specific use cases.

For the hands-on exercises, we made use of various services, but for the most part, we triggered these services manually. However, in a real production environment, it ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Engineering with AWS - Second Edition

Publisher Resources

ISBN: 9781800560413

Cloud Computing