book

Serverless ETL and Analytics with AWS Glue

Name: Serverless ETL and Analytics with AWS Glue
ISBN: 9781800564985

by Vishal Pathak, Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Albert Quiroga, Ishan Gaur

August 2022

Intermediate to advanced

434 pages

10h 34m

English

Packt Publishing

Read now

Unlock full access

Serverless ETL and Analytics with AWS Glue
ContributorsAbout the authorsAbout the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesDownload the color imagesConventions usedGet in touchShare Your Thoughts
Section 1 – Introduction, Concepts, and the Basics of AWS Glue
Chapter 1: Data Management – Introduction and Concepts
Types of data processing – OLTP and OLAPData warehouses and data martsData lakesData lakehouseData meshDistributed computing for big dataApache SparkApache Spark on the AWS cloudAWS Glue Querying data using AWSSummary
Chapter 2: Introduction to Important AWS Glue Features
Data integration Integrating data with AWS GlueData discoveryData ingestionData preparationData replicationFeatures of AWS GlueAWS Glue Data CatalogGlue connections AWS Glue crawlersCustom classifiersAWS Glue Schema RegistryAWS Glue ETL jobsGlue development endpoints AWS Glue interactive sessions TriggersSummary
Chapter 3: Data Ingestion
Technical requirementsData ingestion from file/object storesData ingestion from Amazon S3Data ingestion from HDFS data storesData ingestion from JDBC data storesAWS Glue custom JDBC connectorsData ingestion from streaming data sourcesAWS Glue Schema RegistryData ingestion from SaaS data storesSummary
Section 2 – Data Preparation, Management, and Security
Chapter 4: Data Preparation
Technical requirementsIntroduction to data preparation Data preparation using AWS GlueVisual data preparation using AWS Glue DataBrewSource code-based approach to data preparation using AWS GlueSelecting the right service/toolSummary
Chapter 5: Data Layouts
Technical requirementsWhy do we need to pay attention to data layout?Key techniques to optimally storing data Selecting a file formatCompressing your dataSplittable or unsplittable filesPartitioningBucketingOptimizing the number of files and each file sizeWhat is compaction?Compaction with AWS Glue ETL Spark jobsAutomatic Compaction with AWS Lake Formation accelerationOptimizing your storage with Amazon S3Selecting suitable S3 storage classes for your dataUsing S3 Lifecycle for managing object lifecyclesSummaryFurther reading
Chapter 6: Data Management
Technical requirementsNormalizing dataCasting data types and map column namesInferring schemasComputing schemas on the flyEnforcing schemasFlattening nested schemasNormalizing scaleHandling missing values and outliersNormalizing date and time valuesHandling error recordsDeduplicating recordsDenormalizing tablesSecuring data contentMasking valuesHashing valuesManaging data qualityAWS Glue DataBrew data quality rulesDeeQuSummary

Chapter 7: Metadata Management
Technical requirementsPopulating metadataGlue Data Catalog APIDDL statementsGlue crawlersCrawler configurationMaintaining metadataGlue crawlersUpdating Data Catalog tables from ETL jobsPartition managementPartition indexesVersioning and rollbackTable versioningLake Formation-governed tablesLineageGlue DataBrewSummary
Chapter 8: Data Security
Technical requirementsAccess controlIAM permissionsGlue dependencies on other AWS servicesS3 bucket policiesS3 object ownershipLake Formation permissionsEncryptionEncryption at restEncryption in transitNetworkGlue network architectureGlue connectionsNetwork configuration requirements and limitationsConnecting to resources on the public internetConnecting to resources in your on-premise networkSummary
Chapter 9: Data Sharing
Technical requirementsOverview of data sharing strategiesSingle tenantHub and spokeData meshSharing data with multiple AWS accounts using S3 bucket policies and Glue catalog policiesScenario 1 – sharing data from one account with another using S3 bucket policies and Glue catalog policiesPrerequisite – S3Prerequisite – GlueConfiguring S3 bucket policies and Glue Catalog resource policiesSharing data with multiple AWS accounts using AWS Lake Formation permissionsLake Formation permission modelLake Formation cross-account sharingLake Formation named resource-based access controlLake Formation tag-based access controlScenario 2 – sharing data from one account with another using Lake Formation Tag-based access controlPrerequisite – S3Prerequisite – GluePrerequisite – Lake Formation and IAMStep 1 – configuring Glue catalog policiesStep 2 – configuring Lake Formation permissions (producer)Step 3 – configuring Lake Formation permissions (consumer)Summary
Chapter 10: Data Pipeline Management
Technical requirementsWhat are data pipelines?Why do we need data pipelines?How do we build and manage data pipelines?Selecting the appropriate data processing services for your analysisAWS BatchAmazon ECSAWS LambdaAWS Glue ETL jobsAmazon EMROrchestrating your pipelines with workflow toolsUsing AWS Glue workflowsUsing AWS Step FunctionsUsing Amazon Managed Workflows for Apache Airflow utomating how you provision your pipelines with provisioning toolsProvisioning resources with AWS CloudFormationProvisioning AWS Glue workflows and resources with AWS Glue BlueprintsDeveloping and maintaining your data pipelinesDeveloping AWS Glue ETL jobs locallyDeploying AWS Glue ETL jobs Deploying workflows and pipelines using provisioning tools such as IaC SummaryFurther reading
Section 3 – Tuning, Monitoring, Data Lake Common Scenarios, and Interesting Edge Cases
Chapter 11: Monitoring
Defining an SLA for a data platformMonitoring the SLA of a data platformMonitoring the components of a data platformMonitoring state changesMonitoring delayMonitoring performanceMonitoring common failuresMonitoring log messagesAnalyzing usageSummary
Chapter 12: Tuning, Debugging, and Troubleshooting
Tuning AWS Glue workloadsTuning AWS Glue crawlersTuning the performance of AWS Glue Spark ETL jobsTroubleshooting and debugging common issues in AWS Glue ETLETL job failuresSummary
Chapter 13: Data Analysis
Creating Marketplace connectionsCreating the Glue Hudi connectionCreating a Delta Lake connectionCreating an OpenSearch connectionCreating the CloudFormation stackPrerequisites for creating the CloudFormation stackThe benefit of ad hoc analysis and how a data lake enables it Amazon AthenaAmazon Redshift SpectrumCreating and updating Hudi tables using GlueCreating and updating Delta Lake tables using GlueInserting data into Lake Formation governed tablesConsuming streaming data using GlueCreating chapter-data-analysis-msk-connectionLoading and consuming data from MSK using GlueGlue streaming job as a consumer of a Kafka topicHudi DeltaStreamer streaming job as a consumer of a Kafka topicCreating and consuming CDC data through streaming jobs on GlueGlue’s integration with OpenSearchCleaning upSummary
Chapter 14: Machine Learning Integration
Technical requirementsGlue ML transformationsCreating an ML transformTraining an ML transformUsing an ML transformSageMaker integrationDeveloping ML pipelines with GlueSummary
Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases
Technical requirementsRunning a highly selective query on a big fact table using AWS GlueHands-on tutorialDealing with Join performance issues with big fact and small dimension tables in ETL workloadsSolving Join problems involving big fact and big dimension tables using AWS GlueHands-on tutorialSolutionReducing time on read operations using AWS Glue groupingSolving S3 eventual consistency problems using AWS GlueUsing glueparquetS3-optimized output committerSummary
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your Thoughts

Content preview from Serverless ETL and Analytics with AWS Glue

Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

We are now well versed in the concept of a data lake, a centralized repository that allows you to store all your structured and unstructured data at any scale. Since a data lake primarily focuses on storage, it does not require as much processing power as other methods (such as the data warehouse), making it easier, faster, and more cost-effective to scale up as data volumes grow.

The data lake is not just a repository – it requires a well-designed data architecture, along with proper planning and management. As it is driven by a data-based design, it helps you rapidly ingest raw data before any business requirements come into the picture. There are a variety of tools ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781800564985

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Serverless ETL and Analytics with AWS Glue

by Vishal Pathak, Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro Tanaka, Albert Quiroga, Ishan Gaur

Chapter 15: Architecting Data Lakes for Real-World Scenarios and Edge Cases

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.