Data Engineering with Google Cloud Platform - Second Edition

Book description

Become a successful data engineer by building and deploying your own data pipelines on Google Cloud, including making key architectural decisions

Key Features

  • Get up to speed with data governance on Google Cloud
  • Learn how to use various Google Cloud products like Dataform, DLP, Dataplex, Dataproc Serverless, and Datastream
  • Boost your confidence by getting Google Cloud data engineering certification guidance from real exam experiences
  • Purchase of the print or Kindle book includes a free PDF eBook

Book Description

The second edition of Data Engineering with Google Cloud builds upon the success of the first edition by offering enhanced clarity and depth to data professionals navigating the intricate landscape of data engineering.

Beyond its foundational lessons, this new edition delves into the essential realm of data governance within Google Cloud, providing you with invaluable insights into managing and optimizing data resources effectively. Written by a Data Strategic Cloud Engineer at Google, this book helps you stay ahead of the curve by guiding you through the latest technological advancements in the Google Cloud ecosystem. You’ll cover essential aspects, from exploring Cloud Composer 2 to the evolution of Airflow 2.5. Additionally, you’ll explore how to work with cutting-edge tools like Dataform, DLP, Dataplex, Dataproc Serverless, and Datastream to perform data governance on datasets.

By the end of this book, you'll be equipped to navigate the ever-evolving world of data engineering on Google Cloud, from foundational principles to cutting-edge practices.

What you will learn

  • Load data into BigQuery and materialize its output
  • Focus on data pipeline orchestration using Cloud Composer
  • Formulate Airflow jobs to orchestrate and automate a data warehouse
  • Establish a Hadoop data lake, generate ephemeral clusters, and execute jobs on the Dataproc cluster
  • Harness Pub/Sub for messaging and ingestion for event-driven systems
  • Apply Dataflow to conduct ETL on streaming data
  • Implement data governance services on Google Cloud

Who this book is for

Data analysts, IT practitioners, software engineers, or any data enthusiasts looking to have a successful data engineering career will find this book invaluable. Additionally, experienced data professionals who want to start using Google Cloud to build data platforms will get clear insights on how to navigate the path. Whether you're a beginner who wants to explore the fundamentals or a seasoned professional seeking to learn the latest data engineering concepts, this book is for you.

Table of contents

  1. Data Engineering with Google Cloud Platform
  2. Foreword
  3. Contributors
  4. About the author
  5. About the reviewers
  6. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  7. Part 1: Getting Started with Data Engineering with GCP
  8. Chapter 1: Fundamentals of Data Engineering
    1. Understanding the data life cycle
      1. Understanding the need for a data warehouse
    2. Start with knowing the roles of a data engineer
      1. A data engineer versus a data scientist
      2. The focus of data engineers
    3. Going through the foundational concepts for data engineering
      1. ETL concept in data engineering
      2. The difference between ETL and ELT
      3. What is not big data?
      4. A quick look at how big data technologies store data
      5. A quick look at how to process multiple files using MapReduce
    4. Summary
    5. Exercise
    6. Further Reading
  9. Chapter 2: Big Data Capabilities on GCP
    1. Technical requirements
    2. Understanding what the cloud is
      1. The difference between the cloud and non-cloud era
      2. The on-demand nature of the cloud
    3. Getting started with GCP
      1. Introduction to the GCP console
      2. Practicing pinning services
    4. A quick overview of GCP services for data engineering
      1. Understanding the GCP serverless service
      2. Service mapping and prioritization
      3. The concept of quotas on GCP services
      4. User account versus service account
    5. Summary
  10. Part 2: Build Solutions with GCP Components
  11. Chapter 3: Building a Data Warehouse in BigQuery
    1. Technical requirements
    2. Introduction to GCS and BigQuery
      1. BigQuery data location
    3. Introduction to the BigQuery console
      1. Creating a dataset in BigQuery using the console
      2. Loading the local CSV file into the BigQuery table
      3. Using public data in BigQuery
      4. Data types in BigQuery compared to other databases
      5. Timestamp data in BigQuery compared to other databases
    4. Preparing the prerequisites before developing our data warehouse
      1. Step 1 – Accessing Cloud Shell
      2. Step 2 – Checking the current setup using the command line
      3. Step 3 – Initializing the gcloud init command
      4. Step 4 – Downloading example data from Git
      5. Step 5 – Uploading data to GCS from Git
    5. Practicing developing a data warehouse
      1. Data warehouse in BigQuery – Requirements for scenario 1
      2. Steps and planning for handling scenario 1
      3. Data warehouse in BigQuery – Requirements for scenario 2
      4. Using the GCP console versus the code-based approach
      5. Steps and planning for handling scenario 2
    6. BigQuery’s useful features
      1. BigQuery console sub-menu options
      2. BigQuery partitioned table
    7. Summary
    8. Exercise – Scenario 3
    9. See also
  12. Chapter 4: Building Workflows for Batch Data Loading Using Cloud Composer
    1. Technical requirements
    2. Introduction to Cloud Composer
    3. Understanding the working of Airflow
    4. Cloud Composer 1 vs Cloud Composer 2
    5. Provisioning Cloud Composer in a GCP project
      1. Introducing the Airflow web UI
      2. Cloud Composer bucket directories
    6. Exercise – build data pipeline orchestration using Cloud Composer
      1. Level 1 DAG – creating dummy workflows
      2. Deploying the DAG file into Cloud Composer
      3. Level 2 DAG – scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets
      4. Level 3 DAG – parameterized variables
      5. Level 4 DAG – Guaranteeing task idempotency in Cloud Composer
      6. Level 5 DAG – handling DAG dependency using an Airflow dataset
    7. Summary
  13. Chapter 5: Building a Data Lake Using Dataproc
    1. Technical requirements
    2. Introduction to Dataproc
      1. A brief history of the data lake and Hadoop ecosystem
      2. A deeper look into Hadoop components
      3. How much Hadoop-related knowledge do you need on GCP?
      4. Introducing the Spark RDD and DataFrame concepts
      5. Introducing the data lake concept
      6. Hadoop and Dataproc positioning on GCP
      7. Introduction to Dataproc Serverless
    3. Exercise – Building a data lake on a Dataproc cluster
      1. Creating a Dataproc cluster on GCP
      2. Using GCS as an underlying Dataproc filesystem
    4. Exercise – Creating and running jobs on a Dataproc cluster
      1. Preparing log data in GCS and HDFS
      2. Developing a Spark ETL job from HDFS to HDFS
      3. Developing a Spark ETL job from GCS to GCS
      4. Developing a Spark ETL job from GCS to BigQuery
    5. Understanding the concept of an ephemeral cluster
      1. Practicing using a workflow template on Dataproc
    6. Building an ephemeral cluster using Dataproc and Cloud Composer
      1. Submitting a Spark ETL job from GCS to BigQuery using Dataproc Serverless
    7. Summary
  14. Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow
    1. Technical requirements
    2. Processing streaming data
    3. Introduction to Pub/Sub
    4. Introduction to Dataflow
    5. Exercise – publishing event streams to Pub/Sub
      1. Creating a Pub/Sub topic
      2. Creating and running a Pub/Sub publisher using Python
      3. Creating a Pub/Sub subscription
    6. Exercise – using Dataflow to stream data from Pub/Sub to GCS
      1. Creating a HelloWorld application using Apache Beam
      2. Creating a Dataflow streaming job without aggregation
      3. Creating a streaming job with aggregation
    7. Introduction to CDC and Datastream
      1. What is Datastream?
    8. Exercise – Datastream ETL streaming to BigQuery
      1. Step 1 – create a CloudSQL MySQL table
      2. Step 2 – create a GCS bucket
      3. Step 3 – create a GCS notification to the Pub/Sub topic and subscription
      4. Step 4 – create a BigQuery dataset
      5. Step 5 – configure a Datastream job
      6. Step 6 – run a Dataflow job from the Dataflow template
      7. Step 7 – insert a value in MySQL and check the result in BigQuery
    9. Summary
  15. Chapter 7: Visualizing Data to Make Data-Driven Decisions with Looker Studio
    1. Technical requirements
    2. Unlocking the power of your data with Looker Studio
      1. Don’t confuse Looker Studio with Looker
    3. From data to metrics in minutes with an Illustrative use case
      1. Understanding what BigQuery INFORMATION_SCHEMA is
      2. Exercise – accessing the BigQuery INFORMATION_SCHEMA table using Looker Studio
      3. Exercise – creating a Looker Studio report using data from a bike-sharing data warehouse
    4. Understanding how Looker Studio can impact the cost of BigQuery
      1. What kind of table could be 1 TB in size?
      2. How can a table be accessed 10,000 times in a month?
    5. Creating Materialized Views and understanding how BI Engine works
      1. Understanding BI Engine
    6. Summary
  16. Chapter 8: Building Machine Learning Solutions on GCP
    1. Technical requirements
    2. A quick look at ML
    3. Exercise – practicing ML code using Python
      1. Preparing the ML dataset by using a table from the BigQuery public dataset
      2. Training the ML model using Random Forest in Python
      3. Creating a batch prediction using the training dataset’s output
    4. The MLOps landscape in GCP
      1. Understanding the basic principles of MLOps
      2. Introducing GCP services related to MLOps
    5. Exercise – leveraging pre-built GCP models as a service
      1. Uploading the image to a GCS bucket
      2. Creating a detect text function in Python
    6. Exercise – using GCP in AutoML to train an ML model
    7. Exercise – deploying a dummy workflow with Vertex AI Pipelines
      1. Creating a dedicated regional GCS bucket
      2. Developing the pipeline on Python
      3. Monitoring the pipeline on the Vertex AI Pipelines console
    8. Exercise – deploying a scikit-learn model pipeline with Vertex AI
      1. Creating the first pipeline, which will result in an ML model file in GCS
      2. Running the first pipeline in Vertex AI Pipelines
      3. Creating the second pipeline, which will use the model file from the prediction results as a CSV file in GCS
      4. Running the second pipeline in Vertex AI Pipelines
    9. Summary
  17. Part 3: Key Strategies for Architecting Top-Notch Solutions
  18. Chapter 9: User and Project Management in GCP
    1. Technical requirements
    2. Understanding IAM in GCP
    3. Planning a GCP project structure
    4. Understanding the GCP organization, folder, and project hierarchy
      1. Deciding how many projects we should have in a GCP organization
    5. Controlling user access to our data warehouse
      1. Use-case scenario – planning BigQuery ACLs on an eCommerce organization
    6. Practicing the concept of IaC using Terraform
      1. Exercise – creating and running basic Terraform scripts
      2. Self-exercise – managing a GCP project and resources using Terraform
    7. Summary
  19. Chapter 10: Data Governance in GCP
    1. Technical requirements
    2. Introduction to data governance
    3. A deeper understanding of data usability
      1. Exercise – implementing metadata tagging using Dataplex
      2. A deeper understanding of data security
      3. Example – BigQuery data masking
      4. Exercise – finding PII using SDP
    4. A deeper understanding of being accountable
      1. Clear traceability
      2. Clear data ownership
      3. Data lineage
      4. Clear data quality process
      5. Exercise – practicing data quality using Dataform
    5. Summary
  20. Chapter 11: Cost Strategy in GCP
    1. Technical requirements
    2. Estimating the cost of your end-to-end data solution in GCP
      1. Comparing BigQuery on-demand and editions
      2. An example – an estimating data engineering use case
    3. Tips to optimize BigQuery using partitioned and clustered tables
      1. Partitioned tables
      2. Clustered tables
      3. An exercise – optimizing BigQuery on-demand cost
    4. Summary
  21. Chapter 12: CI/CD on GCP for Data Engineers
    1. Technical requirements
    2. An introduction to CI/CD
      1. Understanding the data engineer’s relationship with CI/CD practices
    3. Understanding CI/CD components with GCP services
    4. Exercise – implementing CI using Cloud Build
      1. Creating a GitHub repository using a Cloud Source Repository
      2. Developing the code and Cloud Build scripts
      3. Creating a Cloud Build trigger
      4. Pushing the code to the GitHub repository
    5. Exercise – deploying Cloud Composer jobs using Cloud Build
      1. Preparing the CI/CD environment
      2. Preparing the cloudbuild.yaml configuration file
      3. Pushing the DAG to our GitHub repository
      4. Checking the CI/CD result in the GCS bucket and Cloud Composer
    6. CI/CD best practices in data engineering
    7. Summary
    8. Further reading
  22. Chapter 13: Boosting Your Confidence as a Data Engineer
    1. Overviewing the Google Cloud certification
      1. Exam preparation tips
      2. Extra GCP service materials
    2. Quiz – reviewing all the concepts you’ve learned about
      1. Questions
      2. Answers
    3. The past, present, and future of data engineering
    4. Boosting your confidence and final thoughts
    5. Summary
  23. Index
    1. Why subscribe?
  24. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Data Engineering with Google Cloud Platform - Second Edition
  • Author(s): Adi Wijaya
  • Release date: April 2024
  • Publisher(s): Packt Publishing
  • ISBN: 9781835080115