Book description
Become well-versed with data engineering concepts and exam objectives to achieve Azure Data Engineer Associate certification
Key Features
- Understand and apply data engineering concepts to real-world problems and prepare for the DP-203 certification exam
- Explore the various Azure services for building end-to-end data solutions
- Gain a solid understanding of building secure and sustainable data solutions using Azure services
Book Description
Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other.
Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you'll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you’ll work on sample questions and answers to familiarize yourself with the pattern of the exam.
By the end of this Azure book, you'll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.
What you will learn
- Gain intermediate-level knowledge of Azure the data infrastructure
- Design and implement data lake solutions with batch and stream pipelines
- Identify the partition strategies available in Azure storage technologies
- Implement different table geometries in Azure Synapse Analytics
- Use the transformations available in T-SQL, Spark, and Azure Data Factory
- Use Azure Databricks or Synapse Spark to process data using Notebooks
- Design security using RBAC, ACL, encryption, data masking, and more
- Monitor and optimize data pipelines with debugging tips
Who this book is for
This book is for data engineers who want to take the DP-203: Azure Data Engineer Associate exam and are looking to gain in-depth knowledge of the Azure cloud stack. The book will also help engineers and product managers who are new to Azure or interviewing with companies working on Azure technologies, to get hands-on experience of Azure data technologies. A basic understanding of cloud technologies, extract, transform, and load (ETL), and databases will help you get the most out of this book.
Table of contents
- Azure Data Engineer Associate Certification Guide
- Contributors
- About the author
- About the reviewers
- Preface
- Part 1: Azure Basics
- Chapter 1: Introducing Azure Basics
- Part 2: Data Storage
-
Chapter 2: Designing a Data Storage Structure
- Technical requirements
- Designing an Azure data lake
- Selecting the right file types for storage
- Choosing the right file types for analytical queries
- Designing storage for efficient querying
- Designing storage for data pruning
- Designing folder structures for data transformation
- Designing a distribution strategy
- Designing a data archiving solution
- Summary
-
Chapter 3: Designing a Partition Strategy
- Understanding the basics of partitioning
- Designing a partition strategy for files
- Designing a partition strategy for analytical workloads
- Designing a partition strategy for efficiency/performance
- Designing a partition strategy for Azure Synapse Analytics
- Identifying when partitioning is needed in ADLS Gen2
- Summary
-
Chapter 4: Designing the Serving Layer
- Technical requirements
- Learning the basics of data modeling and schemas
- Designing Star and Snowflake schemas
- Designing SCDs
- Designing a solution for temporal data
- Designing a dimensional hierarchy
- Designing for incremental loading
- Designing analytical stores
- Designing metastores in Azure Synapse Analytics and Azure Databricks
- Summary
-
Chapter 5: Implementing Physical Data Storage Structures
- Technical requirements
- Getting started with Azure Synapse Analytics
- Implementing compression
- Implementing partitioning
- Implementing horizontal partitioning or sharding
- Implementing distributions
- Implementing different table geometries with Azure Synapse Analytics pools
- Implementing data redundancy
- Implementing data archiving
- Summary
- Chapter 6: Implementing Logical Data Structures
- Chapter 7: Implementing the Serving Layer
- Part 3: Design and Develop Data Processing (25-30%)
-
Chapter 8: Ingesting and Transforming Data
- Technical requirements
- Transforming data by using Apache Spark
- Transforming data by using T-SQL
- Transforming data by using ADF
- Transforming data by using Azure Synapse pipelines
- Transforming data by using Stream Analytics
- Cleansing data
- Splitting data
- Shredding JSON
- Encoding and decoding data
- Configuring error handling for the transformation
- Normalizing and denormalizing values
- Transforming data by using Scala
- Performing Exploratory Data Analysis (EDA)
- Summary
-
Chapter 9: Designing and Developing a Batch Processing Solution
- Technical requirements
- Designing a batch processing solution
- Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks
- Creating data pipelines
- Integrating Jupyter/Python notebooks into a data pipeline
- Designing and implementing incremental data loads
- Designing and developing slowly changing dimensions
- Handling duplicate data
- Handling missing data
- Handling late-arriving data
- Upserting data
- Regressing to a previous state
- Introducing Azure Batch
- Configuring the batch size
- Scaling resources
- Configuring batch retention
- Designing and configuring exception handling
- Handling security and compliance requirements
- Summary
-
Chapter 10: Designing and Developing a Stream Processing Solution
- Technical requirements
- Designing a stream processing solution
- Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs
- Processing data using Spark Structured Streaming
- Monitoring for performance and functional regressions
- Processing time series data
- Designing and creating windowed aggregates
- Configuring checkpoints/watermarking during processing
- Replaying archived stream data
- Transformations using streaming analytics
- Handling schema drifts
- Processing across partitions
- Processing within one partition
- Scaling resources
- Handling interruptions
- Designing and configuring exception handling
- Upserting data
- Designing and creating tests for data pipelines
- Optimizing pipelines for analytical or transactional purposes
- Summary
-
Chapter 11: Managing Batches and Pipelines
- Technical requirements
- Triggering batches
- Handling failed Batch loads
- Validating Batch loads
- Scheduling data pipelines in Data Factory/Synapse pipelines
- Managing data pipelines in Data Factory/Synapse pipelines
- Managing Spark jobs in a pipeline
- Implementing version control for pipeline artifacts
- Summary
- Part 4: Design and Implement Data Security (10-15%)
-
Chapter 12: Designing Security for Data Policies and Standards
- Technical requirements
- Introducing the security and privacy requirements
- Designing and implementing data encryption for data at rest and in transit
- Designing and implementing a data auditing strategy
- Designing and implementing a data masking strategy
- Designing and implementing Azure role-based access control and a POSIX-like access control list for Data Lake Storage Gen2
- Designing and implementing row-level and column-level security
- Designing and implementing a data retention policy
- Designing to purge data based on business requirements
- Managing identities, keys, and secrets across different data platform technologies
- Implementing secure endpoints (private and public)
- Implementing resource tokens in Azure Databricks
- Loading a DataFrame with sensitive information
- Writing encrypted data to tables or Parquet files
- Designing for data privacy and managing sensitive information
- Summary
- Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)
-
Chapter 13: Monitoring Data Storage and Data Processing
- Technical requirements
- Implementing logging used by Azure Monitor
- Configuring monitoring services
- Understanding custom logging options
- Interpreting Azure Monitor metrics and logs
- Measuring the performance of data movement
- Monitoring data pipeline performance
- Monitoring and updating statistics about data across a system
- Measuring query performance
- Interpreting a Spark DAG
- Monitoring cluster performance
- Scheduling and monitoring pipeline tests
- Summary
-
Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing
- Technical requirements
- Compacting small files
- Rewriting user-defined functions (UDFs)
- Handling skews in data
- Handling data spills
- Tuning shuffle partitions
- Finding shuffling in a pipeline
- Optimizing resource management
- Tuning queries by using indexers
- Tuning queries by using cache
- Optimizing pipelines for analytical or transactional purposes
- Optimizing pipelines for descriptive versus analytical workloads
- Troubleshooting a failed Spark job
- Troubleshooting a failed pipeline run
- Summary
- Part 6: Practice Exercises
-
Chapter 15: Sample Questions with Solutions
- Exploring the question formats
- Case study-based questions
- Scenario-based questions
- Direct questions
- Ordering sequence questions
- Code segment questions
- Sample questions from the Design and Implement Data Storage section
- Sample questions from the Design and Develop Data Processing section
- Sample questions from the Design and Implement Data Security section
- Sample questions from the Monitor and Optimize Data Storage and Data Processing section
- Summary
- Why subscribe?
- Other Books You May Enjoy
Product information
- Title: Azure Data Engineer Associate Certification Guide
- Author(s):
- Release date: February 2022
- Publisher(s): Packt Publishing
- ISBN: 9781801816069
You might also like
book
Azure Data Engineer Associate Certification Guide - Second Edition
Achieve Azure Data Engineer Associate certification success with this DP-203 exam guide Purchase of this book …
book
Azure Data Scientist Associate Certification Guide
Develop the skills you need to run machine learning workloads in Azure and pass the DP-100 …
book
The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform
Build efficient and scalable batch and real-time data ingestion pipelines, DevOps continuous integration and deployment pipelines, …
video
Azure SQL Data Warehouse Synapse Analytics Service
Welcome to Azure Synapse Analytics Service (formerly Azure SQL Data Warehouse). In this course, we will …