Azure Data Engineer Associate Certification Guide

Book description

Become well-versed with data engineering concepts and exam objectives to achieve Azure Data Engineer Associate certification

Key Features

  • Understand and apply data engineering concepts to real-world problems and prepare for the DP-203 certification exam
  • Explore the various Azure services for building end-to-end data solutions
  • Gain a solid understanding of building secure and sustainable data solutions using Azure services

Book Description

Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other.

Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you'll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you'll work on sample questions and answers to familiarize yourself with the pattern of the exam.

By the end of this Azure book, you'll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.

What you will learn

  • Gain intermediate-level knowledge of Azure the data infrastructure
  • Design and implement data lake solutions with batch and stream pipelines
  • Identify the partition strategies available in Azure storage technologies
  • Implement different table geometries in Azure Synapse Analytics
  • Use the transformations available in T-SQL, Spark, and Azure Data Factory
  • Use Azure Databricks or Synapse Spark to process data using Notebooks
  • Design security using RBAC, ACL, encryption, data masking, and more
  • Monitor and optimize data pipelines with debugging tips

Who this book is for

This book is for data engineers who want to take the DP-203: Azure Data Engineer Associate exam and are looking to gain in-depth knowledge of the Azure cloud stack.

The book will also help engineers and product managers who are new to Azure or interviewing with companies working on Azure technologies, to get hands-on experience of Azure data technologies. A basic understanding of cloud technologies, extract, transform, and load (ETL), and databases will help you get the most out of this book.

Table of contents

  1. Azure Data Engineer Associate Certification Guide
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. Download the example code files
    4. Download the color images
    5. Get in touch
    6. Reviews
    7. Share Your Thoughts
  6. Part 1: Azure Basics
  7. Chapter 1: Introducing Azure Basics
    1. Technical requirements
    2. Introducing the Azure portal
    3. Exploring Azure accounts, subscriptions, and resource groups
      1. Azure account
      2. Azure subscription
      3. Resource groups
      4. Establishing a use case
    4. Introducing Azure Services
      1. Infrastructure as a Service (IaaS)
      2. Platform as a Service (PaaS)
      3. Software as a Service (SaaS), also known as Function as a Service (FaaS)
    5. Exploring Azure VMs
      1. Creating a VM using the Azure portal
      2. Creating a VM using the Azure CLI
    6. Exploring Azure Storage
      1. Azure Blob storage
      2. Azure Data Lake Gen 2 
      3. Azure Files
      4. Azure Queues
      5. Azure tables
      6. Azure Managed disks
    7. Exploring Azure Networking (VNet)
    8. Exploring Azure Compute
      1. VM Scale Sets
      2. Azure App Service
      3. Azure Kubernetes Service
      4. Azure Functions
      5. Azure Service Fabric
      6. Azure Batch
    9. Summary
  8. Part 2: Data Storage
  9. Chapter 2: Designing a Data Storage Structure
    1. Technical requirements
    2. Designing an Azure data lake
      1. How is a data lake different from a data warehouse?
      2. When should you use a data lake?
      3. Data lake zones
      4. Data lake architecture
      5. Exploring Azure technologies that can be used to build a data lake
    3. Selecting the right file types for storage
      1. Avro
      2. Parquet
      3. ORC
      4. Comparing Avro, Parquet, and ORC
    4. Choosing the right file types for analytical queries
    5. Designing storage for efficient querying
      1. Storage layer
      2. Application Layer
      3. Query layer
    6. Designing storage for data pruning
      1. Dedicated SQL pool example with pruning
      2. Spark example with pruning
    7. Designing folder structures for data transformation
      1. Streaming and IoT Scenarios
      2. Batch scenarios
    8. Designing a distribution strategy
      1. Round-robin tables
      2. Hash tables
      3. Replicated tables
    9. Designing a data archiving solution
      1. Hot Access Tier
      2. Cold Access Tier
      3. Archive Access Tier
      4. Data life cycle management
    10. Summary
  10. Chapter 3: Designing a Partition Strategy
    1. Understanding the basics of partitioning
      1. Benefits of partitioning
    2. Designing a partition strategy for files
      1. Azure Blob storage
      2. ADLS Gen2
    3. Designing a partition strategy for analytical workloads
      1. Horizontal partitioning
      2. Vertical partitioning
      3. Functional partitioning
    4. Designing a partition strategy for efficiency/performance
      1. Iterative query performance improvement process
    5. Designing a partition strategy for Azure Synapse Analytics
      1. Performance improvement while loading data
      2. Performance improvement for filtering queries
    6. Identifying when partitioning is needed in ADLS Gen2
    7. Summary
  11. Chapter 4: Designing the Serving Layer
    1. Technical requirements
    2. Learning the basics of data modeling and schemas
      1. Dimensional models
    3. Designing Star and Snowflake schemas
      1. Star schemas
      2. Snowflake schemas
    4. Designing SCDs
      1. Designing SCD1
      2. Designing SCD2
      3. Designing SCD3
      4. Designing SCD4
      5. Designing SCD5, SCD6, and SCD7
    5. Designing a solution for temporal data
    6. Designing a dimensional hierarchy
    7. Designing for incremental loading
      1. Watermarks
      2. File timestamps
      3. File partitions and folder structures
    8. Designing analytical stores
      1. Security considerations
      2. Scalability considerations
    9. Designing metastores in Azure Synapse Analytics and Azure Databricks
      1. Azure Synapse Analytics
      2. Azure Databricks (and Azure Synapse Spark)
    10. Summary
  12. Chapter 5: Implementing Physical Data Storage Structures
    1. Technical requirements
    2. Getting started with Azure Synapse Analytics
    3. Implementing compression
      1. Compressing files using Synapse Pipelines or ADF
      2. Compressing files using Spark
    4. Implementing partitioning
      1. Using ADF/Synapse pipelines to create data partitions
      2. Partitioning for analytical workloads
    5. Implementing horizontal partitioning or sharding
      1. Sharding in Synapse dedicated pools
      2. Sharding using Spark
    6. Implementing distributions
      1. Hash distribution
      2. Round-robin distribution
      3. Replicated distribution
    7. Implementing different table geometries with Azure Synapse Analytics pools
      1. Clustered columnstore indexing
      2. Heap indexing
      3. Clustered indexing
    8. Implementing data redundancy
      1. Azure storage redundancy in the primary region
      2. Azure storage redundancy in secondary regions
      3. Azure SQL Geo Replication
      4. Azure Synapse SQL Data Replication
      5. CosmosDB Data Replication
      6. Example of setting up redundancy in Azure Storage
    9. Implementing data archiving
    10. Summary
  13. Chapter 6: Implementing Logical Data Structures
    1. Technical requirements
    2. Building a temporal data solution
    3. Building a slowly changing dimension
      1. Updating new rows
      2. Updating the modified rows
    4. Building a logical folder structure
    5. Implementing file and folder structures for efficient querying and data pruning
      1. Deleting an old partition
      2. Adding a new partition
    6. Building external tables
    7. Summary
  14. Chapter 7: Implementing the Serving Layer
    1. Technical requirements
    2. Delivering data in a relational star schema
    3. Implementing a dimensional hierarchy
      1. Synapse SQL serverless
      2. Synapse Spark
      3. Azure Databricks
    4. Maintaining metadata
      1. Metadata using Synapse SQL and Spark pools
      2. Metadata using Azure Databricks
    5. Summary
  15. Part 3: Design and Develop Data Processing (25-30%)
  16. Chapter 8: Ingesting and Transforming Data
    1. Technical requirements
    2. Transforming data by using Apache Spark
      1. What are RDDs?
      2. What are DataFrames?
    3. Transforming data by using T-SQL
    4. Transforming data by using ADF
      1. Schema transformations
      2. Row transformations
      3. Multi-I/O transformations
      4. ADF templates
    5. Transforming data by using Azure Synapse pipelines
    6. Transforming data by using Stream Analytics
    7. Cleansing data
      1. Handling missing/null values
      2. Trimming inputs
      3. Standardizing values
      4. Handling outliers
      5. Removing duplicates/deduping
    8. Splitting data
      1. File splits
    9. Shredding JSON
      1. Extracting values from JSON using Spark
      2. Extracting values from JSON using SQL
      3. Extracting values from JSON using ADF
    10. Encoding and decoding data
      1. Encoding and decoding using SQL
      2. Encoding and decoding using Spark
      3. Encoding and decoding using ADF
    11. Configuring error handling for the transformation
    12. Normalizing and denormalizing values
      1. Denormalizing values using Pivot
      2. Normalizing values using Unpivot
    13. Transforming data by using Scala
    14. Performing Exploratory Data Analysis (EDA)
      1. Data exploration using Spark
      2. Data exploration using SQL
      3. Data exploration using ADF
    15. Summary
  17. Chapter 9: Designing and Developing a Batch Processing Solution
    1. Technical requirements
    2. Designing a batch processing solution
    3. Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks
      1. Storage
      2. Data ingestion
      3. Data preparation/data cleansing
      4. Transformation
      5. Using PolyBase to ingest the data into the Analytics data store
      6. Using Power BI to display the insights
    4. Creating data pipelines
    5. Integrating Jupyter/Python notebooks into a data pipeline
    6. Designing and implementing incremental data loads
    7. Designing and developing slowly changing dimensions
    8. Handling duplicate data
    9. Handling missing data
    10. Handling late-arriving data
      1. Handling late-arriving data in the ingestion/transformation stage
      2. Handling late-arriving data in the serving stage
    11. Upserting data
    12. Regressing to a previous state
    13. Introducing Azure Batch
      1. Running a sample Azure Batch job
    14. Configuring the batch size
    15. Scaling resources
      1. Azure Batch
      2. Azure Databricks
      3. Synapse Spark
      4. Synapse SQL
    16. Configuring batch retention
    17. Designing and configuring exception handling
      1. Types of errors
      2. Remedial actions
    18. Handling security and compliance requirements
      1. The Azure Security Benchmark
      2. Best practices for Azure Batch
    19. Summary
  18. Chapter 10: Designing and Developing a Stream Processing Solution
    1. Technical requirements
    2. Designing a stream processing solution
      1. Introducing Azure Event Hubs
      2. Introducing ASA
      3. Introducing Spark Streaming
    3. Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs
      1. A streaming solution using Event Hubs and ASA
      2. A streaming solution using Event Hubs and Spark Streaming
    4. Processing data using Spark Structured Streaming
    5. Monitoring for performance and functional regressions
      1. Monitoring in Event Hubs
      2. Monitoring in ASA
      3. Monitoring in Spark Streaming
    6. Processing time series data
      1. Types of timestamps
      2. Windowed aggregates
      3. Checkpointing or watermarking
      4. Replaying data from a previous timestamp
    7. Designing and creating windowed aggregates
      1. Tumbling windows
      2. Hopping windows
      3. Sliding windows
      4. Session windows
      5. Snapshot windows
    8. Configuring checkpoints/watermarking during processing
      1. Checkpointing in ASA
      2. Checkpointing in Event Hubs
      3. Checkpointing in Spark
    9. Replaying archived stream data
    10. Transformations using streaming analytics
      1. The COUNT and DISTINCT transformations
      2. CAST transformations
      3. LIKE transformations
    11. Handling schema drifts
      1. Handling schema drifts using Event Hubs
      2. Handling schema drifts in Spark
    12. Processing across partitions
      1. What are partitions?
      2. Processing data across partitions
    13. Processing within one partition
    14. Scaling resources
      1. Scaling in Event Hubs
      2. Scaling in ASA
      3. Scaling in Azure Databricks Spark Streaming
    15. Handling interruptions
      1. Handling interruptions in Event Hubs
      2. Handling interruptions in ASA
    16. Designing and configuring exception handling
    17. Upserting data
    18. Designing and creating tests for data pipelines
    19. Optimizing pipelines for analytical or transactional purposes
    20. Summary
  19. Chapter 11: Managing Batches and Pipelines
    1. Technical requirements
    2. Triggering batches
    3. Handling failed Batch loads
      1. Pool errors
      2. Node errors
      3. Job errors
      4. Task errors
    4. Validating Batch loads
    5. Scheduling data pipelines in Data Factory/Synapse pipelines
    6. Managing data pipelines in Data Factory/Synapse pipelines
      1. Integration runtimes
      2. ADF monitoring
    7. Managing Spark jobs in a pipeline
    8. Implementing version control for pipeline artifacts
      1. Configuring source control in ADF
      2. Integrating with Azure DevOps
      3. Integrating with GitHub
    9. Summary
  20. Part 4: Design and Implement Data Security (10-15%)
  21. Chapter 12: Designing Security for Data Policies and Standards
    1. Technical requirements
    2. Introducing the security and privacy requirements
    3. Designing and implementing data encryption for data at rest and in transit
      1. Encryption at rest
      2. Encryption in transit
    4. Designing and implementing a data auditing strategy
      1. Storage auditing
      2. SQL auditing
    5. Designing and implementing a data masking strategy
    6. Designing and implementing Azure role-based access control and a POSIX-like access control list for Data Lake Storage Gen2
      1. Restricting access using Azure RBAC
      2. Restricting access using ACLs
    7. Designing and implementing row-level and column-level security
      1. Designing row-level security
      2. Designing column-level security
    8. Designing and implementing a data retention policy
    9. Designing to purge data based on business requirements
      1. Purging data in Azure Data Lake Storage Gen2
      2. Purging data in Azure Synapse SQL
    10. Managing identities, keys, and secrets across different data platform technologies
      1. Azure Active Directory
      2. Azure Key Vault
      3. Access keys and Shared Access keys in Azure Storage
    11. Implementing secure endpoints (private and public)
    12. Implementing resource tokens in Azure Databricks
    13. Loading a DataFrame with sensitive information
    14. Writing encrypted data to tables or Parquet files
    15. Designing for data privacy and managing sensitive information
      1. Microsoft Defender
    16. Summary
  22. Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)
  23. Chapter 13: Monitoring Data Storage and Data Processing
    1. Technical requirements
    2. Implementing logging used by Azure Monitor
    3. Configuring monitoring services
    4. Understanding custom logging options
    5. Interpreting Azure Monitor metrics and logs
      1. Interpreting Azure Monitor metrics
      2. Interpreting Azure Monitor logs
    6. Measuring the performance of data movement
    7. Monitoring data pipeline performance
    8. Monitoring and updating statistics about data across a system
      1. Creating statistics for Synapse dedicated pools
      2. Updating statistics for Synapse dedicated pools
      3. Creating statistics for Synapse serverless pools
      4. Updating statistics for Synapse serverless pools
    9. Measuring query performance
      1. Monitoring Synapse SQL pool performance
      2. Spark query performance monitoring
    10. Interpreting a Spark DAG
    11. Monitoring cluster performance
      1. Monitoring overall cluster performance
      2. Monitoring per-node performance
      3. Monitoring YARN queue/scheduler performance
      4. Monitoring storage throttling
    12. Scheduling and monitoring pipeline tests
    13. Summary
  24. Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing
    1. Technical requirements
    2. Compacting small files
    3. Rewriting user-defined functions (UDFs)
      1. Writing UDFs in Synapse SQL Pool
      2. Writing UDFs in Spark
      3. Writing UDFs in Stream Analytics
    4. Handling skews in data
      1. Fixing skews at the storage level
      2. Fixing skews at the compute level
    5. Handling data spills
      1. Identifying data spills in Synapse SQL
      2. Identifying data spills in Spark
    6. Tuning shuffle partitions
    7. Finding shuffling in a pipeline
      1. Identifying shuffles in a SQL query plan
      2. Identifying shuffles in a Spark query plan
    8. Optimizing resource management
      1. Optimizing Synapse SQL pools
      2. Optimizing Spark
    9. Tuning queries by using indexers
      1. Indexing in Synapse SQL
      2. Indexing in the Synapse Spark pool using Hyperspace
    10. Tuning queries by using cache
    11. Optimizing pipelines for analytical or transactional purposes
      1. OLTP systems
      2. OLAP systems
      3. Implementing HTAP using Synapse Link and CosmosDB
    12. Optimizing pipelines for descriptive versus analytical workloads
      1. Common optimizations for descriptive and analytical pipelines
      2. Specific optimizations for descriptive and analytical pipelines
    13. Troubleshooting a failed Spark job
      1. Debugging environmental issues
      2. Debugging job issues
    14. Troubleshooting a failed pipeline run
    15. Summary
  25. Part 6: Practice Exercises
  26. Chapter 15: Sample Questions with Solutions
    1. Exploring the question formats
    2. Case study-based questions
      1. Case study – data lake
    3. Scenario-based questions
      1. Shared access signature
    4. Direct questions
      1. ADF transformation
    5. Ordering sequence questions
      1. ASA setup steps
    6. Code segment questions
      1. Column security
    7. Sample questions from the Design and Implement Data Storage section
      1. Case study – data lake
      2. Data visualization
      3. Data partitioning
      4. Synapse SQL pool table design – 1
      5. Synapse SQL pool table design – 2
      6. Slowly changing dimensions
      7. Storage tiers
      8. Disaster recovery
      9. Synapse SQL external tables
    8. Sample questions from the Design and Develop Data Processing section
      1. Data lake design
      2. ASA windows
      3. Spark transformation
      4. ADF – integration runtimes
      5. ADF triggers
    9. Sample questions from the Design and Implement Data Security section
      1. TDE/Always Encrypted
      2. Auditing Azure SQL/Synapse SQL
      3. Dynamic data masking
      4. RBAC – POSIX
      5. Row-level security
    10. Sample questions from the Monitor and Optimize Data Storage and Data Processing section
      1. Blob storage monitoring
      2. T-SQL optimization
      3. ADF monitoring
      4. Setting up alerts in ASA
    11. Summary
    12. Why subscribe?
  27. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts

Product information

  • Title: Azure Data Engineer Associate Certification Guide
  • Author(s): Newton Alex
  • Release date: February 2022
  • Publisher(s): Packt Publishing
  • ISBN: 9781801816069