Data Engineering with AWS - Second Edition

Book description

Looking to revolutionize your data transformation game with AWS? Look no further! From strong foundations to hands-on building of data engineering pipelines, our expert-led manual has got you covered.

Key Features

  • Delve into robust AWS tools for ingesting, transforming, and consuming data, and for orchestrating pipelines
  • Stay up to date with a comprehensive revised chapter on Data Governance
  • Build modern data platforms with a new section covering transactional data lakes and data mesh

Book Description

This book, authored by a seasoned Senior Data Architect with 25 years of experience, aims to help you achieve proficiency in using the AWS ecosystem for data engineering. This revised edition provides updates in every chapter to cover the latest AWS services and features, takes a refreshed look at data governance, and includes a brand-new section on building modern data platforms which covers; implementing a data mesh approach, open-table formats (such as Apache Iceberg), and using DataOps for automation and observability.

You'll begin by reviewing the key concepts and essential AWS tools in a data engineer's toolkit and getting acquainted with modern data management approaches. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how that transformed data is used by various data consumers. You’ll learn how to ensure strong data governance, and about populating data marts and data warehouses along with how a data lakehouse fits into the picture. After that, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. Then, you'll explore how the power of machine learning and artificial intelligence can be used to draw new insights from data. In the final chapters, you'll discover transactional data lakes, data meshes, and how to build a cutting-edge data platform on AWS.

By the end of this AWS book, you'll be able to execute data engineering tasks and implement a data pipeline on AWS like a pro!

What you will learn

  • Seamlessly ingest streaming data with Amazon Kinesis Data Firehose
  • Optimize, denormalize, and join datasets with AWS Glue Studio
  • Use Amazon S3 events to trigger a Lambda process to transform a file
  • Load data into a Redshift data warehouse and run queries with ease
  • Visualize and explore data using Amazon QuickSight
  • Extract sentiment data from a dataset using Amazon Comprehend
  • Build transactional data lakes using Apache Iceberg with Amazon Athena
  • Learn how a data mesh approach can be implemented on AWS

Who this book is for

This book is for data engineers, data analysts, and data architects who are new to AWS and looking to extend their skills to the AWS cloud. Anyone new to data engineering who wants to learn about the foundational concepts, while gaining practical experience with common data engineering services on AWS, will also find this book useful. A basic understanding of big data-related topics and Python coding will help you get the most out of this book, but it’s not a prerequisite. Familiarity with the AWS console and core services will also help you follow along.

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
  2. Section 1: AWS Data Engineering Concepts and Trends
  3. An Introduction to Data Engineering
    1. Technical requirements
    2. The rise of big data as a corporate asset
    3. The challenges of ever-growing datasets
    4. The role of the data engineer as a big data enabler
      1. Understanding the role of the data engineer
      2. Understanding the role of the data scientist
      3. Understanding the role of the data analyst
      4. Understanding other common data-related roles
    5. The benefits of the cloud when building big data analytic solutions
    6. Hands-on – creating and accessing your AWS account
      1. Creating a new AWS account
      2. Accessing your AWS account
    7. Summary
  4. Data Management Architectures for Analytics
    1. Technical requirements
    2. The evolution of data management for analytics
      1. Databases and data warehouses
      2. Dealing with big, unstructured data
      3. Cloud-based solutions for big data analytics
    3. A deeper dive into data warehouse concepts and architecture
      1. Dimensional modeling in data warehouses
      2. Understanding the role of data marts
      3. Distributed storage and massively parallel processing
      4. Columnar data storage and efficient data compression
      5. Feeding data into the warehouse – ETL and ELT pipelines
    4. An overview of data lake architecture and concepts
      1. Data lake logical architecture
        1. The storage layer and storage zones
        2. Catalog and search layers
        3. Ingestion layer
        4. The processing layer
        5. The consumption layer
        6. Data lake architecture summary
    5. Bringing together the best of data warehouses and data lakes
      1. The data lake house approach
        1. New data lake table formats
        2. Federated queries across database engines
    6. Hands-on – using the AWS Command Line Interface (CLI) to create Simple Storage Service (S3) buckets
      1. Accessing the AWS CLI
        1. Using AWS CloudShell to access the CLI
      2. Creating new Amazon S3 buckets
    7. Summary
  5. The AWS Data Engineer’s Toolkit
    1. Technical requirements
    2. An overview of AWS services for ingesting data
      1. Amazon Database Migration Service (DMS)
      2. Amazon Kinesis for streaming data ingestion
        1. Amazon Kinesis Agent
        2. Amazon Kinesis Firehose
        3. Amazon Kinesis Data Streams
        4. Amazon Kinesis Data Analytics
        5. Amazon Kinesis Video Streams
      3. Amazon MSK for streaming data ingestion
      4. Amazon AppFlow for ingesting data from SaaS services
      5. AWS Transfer Family for ingestion using FTP/SFTP protocols
      6. AWS DataSync for ingesting from on premises and multicloud storage services
      7. The AWS Snow family of devices for large data transfers
      8. AWS Glue for data ingestion
    3. An overview of AWS services for transforming data
      1. AWS Lambda for light transformations
      2. AWS Glue for serverless data processing
        1. Serverless ETL processing
        2. AWS Glue DataBrew
        3. AWS Glue Data Catalog
        4. AWS Glue crawlers
      3. Amazon EMR for Hadoop ecosystem processing
    4. An overview of AWS services for orchestrating big data pipelines
      1. AWS Glue workflows for orchestrating Glue components
      2. AWS Step Functions for complex workflows
      3. Amazon Managed Workflows for Apache Airflow (MWAA)
    5. An overview of AWS services for consuming data
      1. Amazon Athena for SQL queries in the data lake
      2. Amazon Redshift and Redshift Spectrum for data warehousing and data lakehouse architectures
      3. Overview of Amazon QuickSight for visualizing data
    6. Hands-on – triggering an AWS Lambda function when a new file arrives in an S3 bucket
      1. Creating a Lambda layer containing the AWS SDK for pandas library
      2. Creating an IAM policy and role for your Lambda function
      3. Creating a Lambda function
      4. Configuring our Lambda function to be triggered by an S3 upload
    7. Summary
  6. Data Governance, Security, and Cataloging
    1. Technical requirements
    2. The many different aspects of data governance
    3. Data security, access, and privacy
      1. Common data regulatory requirements
      2. Core data protection concepts
        1. Personally identifiable information (PII)
      3. Personal data
      4. Encryption
      5. Anonymized data
      6. Pseudonymized data/tokenization
      7. Authentication
      8. Authorization
      9. Putting these concepts together
    4. Data quality, data profiling, and data lineage
      1. Data quality
      2. Data profiling
      3. Data lineage
    5. Business and technical data catalogs
      1. Implementing a data catalog to avoid creating a data swamp
      2. Business data catalogs
      3. Technical data catalogs
    6. AWS services that help with data governance
      1. The AWS Glue/Lake Formation technical data catalog
      2. AWS Glue DataBrew for profiling datasets
      3. AWS Glue Data Quality
      4. AWS Key Management Service (KMS) for data encryption
      5. Amazon Macie for detecting PII data in Amazon S3 objects
      6. The AWS Glue Studio Detect PII transform for detecting PII data in datasets
      7. Amazon GuardDuty for detecting threats in an AWS account
      8. AWS Identity and Access Management (IAM) service
      9. Using AWS Lake Formation to manage data lake access
        1. Permissions management before Lake Formation
        2. Permissions management using AWS Lake Formation
    7. Hands-on – configuring Lake Formation permissions
      1. Creating a new user with IAM permissions
      2. Transitioning to managing fine-grained permissions with AWS Lake Formation
        1. Activating Lake Formation permissions for a database and table
        2. Granting Lake Formation permissions
    8. Summary
  7. Section 2: Architecting and Implementing Data Engineering Pipelines and Transformations
  8. Architecting Data Engineering Pipelines
    1. Technical requirements
    2. Approaching the data pipeline architecture
      1. Architecting houses and pipelines
      2. Whiteboarding as an information-gathering tool
      3. Conducting a whiteboarding session
    3. Identifying data consumers and understanding their requirements
    4. Identifying data sources and ingesting data
    5. Identifying data transformations and optimizations
      1. File format optimizations
      2. Data standardization
      3. Data quality checks
      4. Data partitioning
      5. Data denormalization
      6. Data cataloging
      7. Whiteboarding data transformation
    6. Loading data into data marts
    7. Wrapping up the whiteboarding session
    8. Hands-on – architecting a sample pipeline
      1. Detailed notes from the project “Bright Light” whiteboarding meeting of GP Widgets, Inc
        1. Meeting notes
    9. Summary
  9. Ingesting Batch and Streaming Data
    1. Technical requirements
    2. Understanding data sources
      1. Data variety
        1. Structured data
        2. Semi-structured data
        3. Unstructured data
      2. Data volume
      3. Data velocity
      4. Data veracity
      5. Data value
      6. Questions to ask
    3. Ingesting data from a relational database
      1. AWS DMS
      2. AWS Glue
        1. Full one-off loads from one or more tables
        2. Initial full loads from a table, and subsequent loads of new records
        3. Creating AWS Glue jobs with AWS Lake Formation
      3. Other ways to ingest data from a database
      4. Deciding on the best approach to ingesting from a database
        1. The size of the database
        2. Database load
        3. Data ingestion frequency
        4. Technical requirements and compatibility
    4. Ingesting streaming data
      1. Amazon Kinesis versus Amazon Managed Streaming for Kafka (MSK)
        1. Serverless services versus managed services
        2. Open-source flexibility versus proprietary software with strong AWS integration
        3. At-least-once messaging versus exactly once messaging
        4. A single processing engine versus niche tools
        5. Deciding on a streaming ingestion tool
    5. Hands-on – ingesting data with AWS DMS
      1. Deploying MySQL and an EC2 data loader via CloudFormation
      2. Creating an IAM policy and role for DMS
      3. Configuring DMS settings and performing a full load from MySQL to S3
      4. Querying data with Amazon Athena
    6. Hands-on – ingesting streaming data
      1. Configuring Kinesis Data Firehose for streaming delivery to Amazon S3
      2. Configuring Amazon Kinesis Data Generator (KDG)
      3. Adding newly ingested data to the Glue Data Catalog
      4. Querying the data with Amazon Athena
    7. Summary
  10. Transforming Data to Optimize for Analytics
    1. Technical requirements
    2. Overview of how transformations can create value
      1. Cooking, baking, and data transformations
      2. Transformations as part of a pipeline
    3. Types of data transformation tools
      1. Apache Spark
      2. Hadoop and MapReduce
      3. SQL
      4. GUI-based tools
    4. Common data preparation transformations
      1. Protecting PII data
      2. Optimizing the file format
      3. Optimizing with data partitioning
      4. Data cleansing
    5. Common business use case transformations
      1. Data denormalization
      2. Enriching data
      3. Pre-aggregating data
      4. Extracting metadata from unstructured data
    6. Working with Change Data Capture (CDC) data
      1. Traditional approaches – data upserts and SQL views
      2. Modern approaches – Open Table Formats (OTFs)
        1. Apache Iceberg
        2. Apache Hudi
        3. Databricks Delta Lake
    7. Hands-on – joining datasets with AWS Glue Studio
      1. Creating a new data lake zone – the curated zone
      2. Creating a new IAM role for the Glue job
      3. Configuring a denormalization transform using AWS Glue Studio
      4. Finalizing the denormalization transform job to write to S3
      5. Create a transform job to join streaming and film data using AWS Glue Studio
    8. Summary
  11. Identifying and Enabling Data Consumers
    1. Technical requirements
    2. Understanding the impact of data democratization
      1. A growing variety of data consumers
      2. How a data mesh helps data consumers
    3. Meeting the needs of business users with data visualization
      1. AWS tools for business users
        1. A quick overview of Amazon QuickSight
    4. Meeting the needs of data analysts with structured reporting
      1. AWS tools for data analysts
        1. Amazon Athena
        2. AWS Glue DataBrew
        3. Running Python or R in AWS
    5. Meeting the needs of data scientists and ML models
      1. AWS tools used by data scientists to work with data
        1. SageMaker Ground Truth
        2. SageMaker Data Wrangler
        3. SageMaker Clarify
    6. Hands-on – creating data transformations with AWS Glue DataBrew
      1. Configuring new datasets for AWS Glue DataBrew
      2. Creating a new Glue DataBrew project
      3. Building your Glue DataBrew recipe
      4. Creating a Glue DataBrew job
    7. Summary
  12. A Deeper Dive into Data Marts and Amazon Redshift
    1. Technical requirements
    2. Extending analytics with data warehouses/data marts
      1. Cold and warm data
        1. Cold data
        2. Warm data
        3. Amazon S3 storage classes
      2. Hot data
    3. What not to do – anti-patterns for a data warehouse
      1. Using a data warehouse as a transactional datastore
      2. Using a data warehouse as a data lake
      3. Storing unstructured data
    4. Redshift architecture review and storage deep dive
      1. Data distribution across slices
      2. Redshift Zone Maps and sorting data
    5. Designing a high-performance data warehouse
      1. Provisioned versus Redshift Serverless clusters
      2. Selecting the optimal Redshift node type for provisioned clusters
      3. Selecting the optimal table distribution style and sort key
      4. Selecting the right data type for columns
        1. Character types
        2. Numeric types
        3. Datetime types
        4. Boolean type
        5. HLLSKETCH type
        6. SUPER type
      5. Selecting the optimal table type
        1. Local Redshift tables
        2. External tables for querying data in Amazon S3 with Redshift Spectrum
        3. Temporary staging tables for loading data into Redshift
        4. Data caching using Redshift materialized views
    6. Moving data between a data lake and Redshift
      1. Optimizing data ingestion in Redshift
      2. Automating data loads from Amazon S3 into Redshift
      3. Exporting data from Redshift to the data lake
    7. Exploring advanced Redshift features
      1. Data sharing between Redshift clusters
      2. Machine learning capabilities in Amazon Redshift
      3. Running Redshift clusters across multiple Availability Zones
      4. Redshift Dynamic Data Masking
      5. Zero-ETL between Amazon Aurora and Amazon Redshift
      6. Resizing a Redshift cluster
    8. Hands-on – deploying a Redshift Serverless cluster and running Redshift Spectrum queries
      1. Uploading our sample data to Amazon S3
      2. IAM roles for Redshift
      3. Creating a Redshift cluster
      4. Querying data in the sample database
      5. Using Redshift Spectrum to directly query data in the data lake
    9. Summary
  13. Orchestrating the Data Pipeline
    1. Technical requirements
    2. Understanding the core concepts for pipeline orchestration
      1. What is a data pipeline, and how do you orchestrate it?
        1. What is a directed acyclic graph?
      2. How do you trigger a data pipeline to run?
        1. Using manifest files as pipeline triggers
      3. How do you handle the failures of a step in your pipeline?
        1. Common reasons for failure in data pipelines
        2. Pipeline failure retry strategies
    3. Examining the options for orchestrating pipelines in AWS
      1. AWS Data Pipeline (now in maintenance mode)
      2. AWS Glue workflows to orchestrate Glue resources
        1. Monitoring and error handling
        2. Triggering Glue workflows
      3. Apache Airflow as an open-source orchestration solution
        1. Core concepts for creating Apache Airflow pipelines
      4. AWS Step Functions for a serverless orchestration solution
        1. A sample Step Functions state machine
      5. Deciding on which data pipeline orchestration tool to use
    4. Hands-on – orchestrating a data pipeline using AWS Step Functions
      1. Creating new Lambda functions
        1. Using a Lambda function to determine the file extension
        2. Using Lambda to randomly generate failures
      2. Creating an SNS topic and subscribing to an email address
      3. Creating a new Step Functions state machine
      4. Configuring our S3 bucket to send events to EventBridge
        1. Creating an EventBridge rule for triggering our Step Functions state machine
        2. Testing our event-driven data orchestration pipeline
    5. Summary
  14. Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning
  15. Ad Hoc Queries with Amazon Athena
    1. Technical requirements
    2. An introduction to Amazon Athena
    3. Tips and tricks to optimize Amazon Athena queries
      1. Common file format and layout optimizations
        1. Transforming raw source files to optimized file formats
        2. Partitioning the dataset
        3. Other file-based optimizations
      2. Writing optimized SQL queries
        1. Selecting only the specific columns that you need
        2. Using approximate aggregate functions
        3. Reusing Athena query results
    4. Exploring advanced Athena functionality
      1. Querying external data sources using Athena Federated Query
        1. Pre-built connectors and custom connectors
      2. Using Apache Spark in Amazon Athena
      3. Working with open table formats in Amazon Athena
      4. Provisioning capacity for queries
    5. Managing groups of users with Amazon Athena workgroups
      1. Managing Athena costs with Athena workgroups
        1. Per query data usage control
        2. Athena workgroup data usage controls
      2. Implementing governance controls with Athena workgroups
    6. Hands-on – creating an Amazon Athena workgroup and configuring Athena settings
    7. Hands-on – switching workgroups and running queries
    8. Summary
  16. Visualizing Data with Amazon QuickSight
    1. Technical requirements
    2. Representing data visually for maximum impact
      1. Benefits of data visualization
      2. Popular uses of data visualizations
        1. Trends over time
        2. Data over a geographic area
        3. Heat maps to represent the intersection of data
    3. Understanding Amazon QuickSight’s core concepts
      1. Standard versus Enterprise edition
      2. SPICE – the in-memory storage and computation engine for QuickSight
        1. Managing SPICE capacity
    4. Ingesting and preparing data from a variety of sources
      1. Preparing datasets in QuickSight versus performing ETL outside of QuickSight
    5. Creating and sharing visuals with QuickSight analyses and dashboards
      1. Visual types in Amazon QuickSight
        1. AutoGraph for automatic graphing
        2. Line, geospatial, and heat maps
        3. Bar charts
        4. Key performance indicators
        5. Tables as visuals
        6. Custom visual types
        7. Other visual types
    6. Understanding QuickSight’s advanced features
      1. Amazon QuickSight ML Insights
        1. Amazon QuickSight autonarratives
        2. ML-powered anomaly detection
        3. ML-powered forecasting
      2. Amazon QuickSight Q for natural language queries
        1. Generative BI dashboarding authoring capabilities
        2. QuickSight Q Topics
        3. Fine-tuning your QuickSight Q Topics
      3. Amazon QuickSight embedded dashboards
        1. Embedding for registered QuickSight users
        2. Embedding for unauthenticated users
      4. Generating multi-page formatted reports
    7. Hands-on – creating a simple QuickSight visualization
      1. Setting up a new QuickSight account and loading a dataset
      2. Creating a new analysis
      3. Publishing our visual as a dashboard
    8. Summary
  17. Enabling Artificial Intelligence and Machine Learning
    1. Technical requirements
    2. Understanding the value of AI and ML for organizations
      1. Specialized AI projects
        1. Medical clinical decision support platform
        2. Early detection of diseases
        3. Making sports safer
      2. Everyday use cases for AI and ML
        1. Forecasting
        2. Personalization
        3. Natural language processing
        4. Image recognition
    3. Exploring AWS services for ML
      1. AWS ML services
        1. SageMaker in the ML preparation phase
        2. SageMaker in the ML build phase
        3. SageMaker in the ML training and tuning phase
        4. SageMaker in the ML deployment and management phase
    4. Exploring AWS services for AI
      1. AI for unstructured speech and text
        1. Amazon Transcribe for converting speech into text
        2. Amazon Textract for extracting text from documents
        3. Amazon Comprehend for extracting insights from text
      2. AI for extracting metadata from images and video
        1. Amazon Rekognition
      3. AI for ML-powered forecasts
        1. Amazon Forecast
      4. AI for fraud detection and personalization
        1. Amazon Fraud Detector
        2. Amazon Personalize
    5. Building generative AI solutions on AWS
      1. Understanding the foundations of generative AI technology
      2. Building on foundational models using Amazon SageMaker JumpStart
      3. Building on foundational models using Amazon Bedrock
    6. Common use cases for LLMs
    7. Hands-on – reviewing reviews with Amazon Comprehend
      1. Setting up a new Amazon SQS message queue
      2. Creating a Lambda function for calling Amazon Comprehend
      3. Adding Comprehend permissions for our IAM role
      4. Adding a Lambda function as a trigger for our SQS message queue
      5. Testing the solution with Amazon Comprehend
    8. Summary
  18. Section 4: Modern Strategies: Open Table Formats, Data Mesh, DataOps, and Preparing for the Real World
  19. Building Transactional Data Lakes
    1. Technical requirements
    2. What does it mean for a data lake to be transactional?
      1. Limitations of Hive-based data lakes
      2. High-level benefits of open table formats
        1. ACID transactions
        2. Record level updates
        3. Schema evolution
        4. Time travel
      3. Overview of how open table formats work
      4. Approaches used by table formats for updating tables
        1. COW approach to table updates
        2. MOR approach to table updates
      5. Choosing between COW and MOR
    3. An overview of Delta Lake, Apache Hudi, and Apache Iceberg
      1. Deep dive into Delta Lake
        1. Advanced features available in Delta Lake
      2. Deep dive into Apache Hudi
        1. Hudi Primary Keys
        2. File groups
        3. Compaction
        4. Record level index
      3. Deep dive into Apache Iceberg
        1. Iceberg Metadata file
        2. The manifest list file
        3. The manifest file
        4. Putting it together
        5. Maintenance tasks for Iceberg tables
    4. AWS service integrations for building transactional data lakes
      1. Open table format support in AWS Glue
        1. AWS Glue crawler support
        2. AWS Glue ETL engine support
      2. Open table support in AWS Lake Formation
      3. Open table support in Amazon EMR
      4. Open table support in Amazon Redshift
      5. Open table support in Amazon Athena
    5. Hands-on – Working with Apache Iceberg tables in AWS
      1. Creating an Apache Iceberg table using Amazon Athena
      2. Adding data to our Iceberg table and running queries
      3. Modifying data in our Iceberg table and running queries
      4. Iceberg table maintenance tasks
        1. Optimizing the table layout
        2. Reducing disk space by deleting snapshots
    6. Summary
  20. Implementing a Data Mesh Strategy
    1. Technical requirements
    2. What is a data mesh?
      1. Domain-oriented, decentralized data ownership
      2. Data as a product
      3. Self-service data infrastructure as a platform
      4. Federated computational governance
      5. Data producers and consumers
    3. Challenges that a data mesh approach attempts to resolve
      1. Bottlenecks with a centralized data team
      2. The “Analytics is not my problem” problem
      3. No organization-wide visibility into datasets that are available
    4. The organizational and technical challenges of building a data mesh
      1. Changing the way that an organization approaches analytical data
        1. Changes for the centralized data & analytics team
        2. Changes for line of business teams
      2. Technical challenges for building a data mesh
        1. Integrating existing analytical tools
        2. Centralizing dataset metadata in a single catalog and building automation
        3. Compromising on integrations
    5. AWS services that help enable a data mesh approach
      1. Querying data across AWS accounts
        1. Sharing data with AWS Lake Formation
      2. Amazon DataZone, a business data catalog with data mesh functionality
        1. DataZone concepts
        2. DataZone components
    6. A sample architecture for a data mesh on AWS
      1. Architecture for a data mesh using AWS-native services
      2. Architecture for a data mesh using non-AWS analytic services
        1. Automating the sharing of data in Snowflake
        2. Using query federation instead of data sharing
    7. Hands-on – Setting up Amazon DataZone
      1. Setting up AWS Identity Center
      2. Enabling and configuring Amazon DataZone
      3. Adding a data source to our DataZone project
      4. Adding business metadata
      5. Creating a project for data analysis
      6. Search the data catalog and subscribe to data
      7. Approving the subscription request
    8. Summary
  21. Building a Modern Data Platform on AWS
    1. Technical requirements
    2. Goals of a modern data platform
      1. A flexible and agile platform
      2. A scalable platform
      3. A well-governed platform
      4. A secure platform
      5. An easy-to-use, self-serve platform
    3. Deciding whether to build or buy a data platform
      1. Choosing to buy a data platform
        1. When to buy a data platform
      2. Choosing to build a data platform
        1. When to build a data platform
      3. A third way – implementing an open-source data platform
        1. The Serverless Data Lake Framework (SDLF)
        2. Core SDLF concepts
    4. DataOps as an approach to building data platforms
      1. Automation and observability as a key for DataOps
        1. Automating infrastructure and code deployment
        2. Automating observability
      2. AWS services for implementing a DataOps approach
        1. AWS services for infrastructure deployment
        2. AWS code management and deployment services
    5. Hands-on – automated deployment of data platform components and data transformation code
      1. Setting up a Cloud9 IDE environment
      2. Setting up our AWS CodeCommit repository
      3. Adding a Glue ETL script and CloudFormation template into our repository
      4. Automating deployment of our Glue code
      5. Automating the deployment of our Glue job
      6. Testing our CodePipeline
    6. Summary
  22. Wrapping Up the First Part of Your Learning Journey
    1. Technical requirements
    2. Understanding the complexities of real-world data environments
    3. Examining examples of real-world data pipelines
      1. A decade of data wrapped up for Spotify users
      2. Ingesting and processing streaming files at Netflix scale
        1. Enriching VPC Flow Logs with application information
        2. Working around Amazon SQS quota limits
    4. Imagining the future – a look at emerging trends
      1. Increased adoption of a data mesh approach
      2. Requirement to work in a multi-cloud environment
      3. Migration to open table formats
      4. Managing costs with FinOps
      5. The merging of data warehouses and data lakes
      6. The application of generative AI to business intelligence and analytics
      7. The application of generative AI to building transformations
    5. Hands-on – cleaning up your AWS account
      1. Reviewing AWS Billing to identify the resources being charged for
      2. Closing your AWS account
    6. Summary
  23. Other Books You May Enjoy
  24. Index

Product information

  • Title: Data Engineering with AWS - Second Edition
  • Author(s): Gareth Eagar
  • Release date: October 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781804614426