Book description
Become a successful data engineer by building and deploying your own data pipelines on Google Cloud, including making key architectural decisions
Key Features
- Get up to speed with data governance on Google Cloud
- Learn how to use various Google Cloud products like Dataform, DLP, Dataplex, Dataproc Serverless, and Datastream
- Boost your confidence by getting Google Cloud data engineering certification guidance from real exam experiences
- Purchase of the print or Kindle book includes a free PDF eBook
Book Description
The second edition of Data Engineering with Google Cloud builds upon the success of the first edition by offering enhanced clarity and depth to data professionals navigating the intricate landscape of data engineering.
Beyond its foundational lessons, this new edition delves into the essential realm of data governance within Google Cloud, providing you with invaluable insights into managing and optimizing data resources effectively. Written by a Data Strategic Cloud Engineer at Google, this book helps you stay ahead of the curve by guiding you through the latest technological advancements in the Google Cloud ecosystem. You’ll cover essential aspects, from exploring Cloud Composer 2 to the evolution of Airflow 2.5. Additionally, you’ll explore how to work with cutting-edge tools like Dataform, DLP, Dataplex, Dataproc Serverless, and Datastream to perform data governance on datasets.
By the end of this book, you'll be equipped to navigate the ever-evolving world of data engineering on Google Cloud, from foundational principles to cutting-edge practices.
What you will learn
- Load data into BigQuery and materialize its output
- Focus on data pipeline orchestration using Cloud Composer
- Formulate Airflow jobs to orchestrate and automate a data warehouse
- Establish a Hadoop data lake, generate ephemeral clusters, and execute jobs on the Dataproc cluster
- Harness Pub/Sub for messaging and ingestion for event-driven systems
- Apply Dataflow to conduct ETL on streaming data
- Implement data governance services on Google Cloud
Who this book is for
Data analysts, IT practitioners, software engineers, or any data enthusiasts looking to have a successful data engineering career will find this book invaluable. Additionally, experienced data professionals who want to start using Google Cloud to build data platforms will get clear insights on how to navigate the path. Whether you're a beginner who wants to explore the fundamentals or a seasoned professional seeking to learn the latest data engineering concepts, this book is for you.
Table of contents
- Data Engineering with Google Cloud Platform
- Foreword
- Contributors
- About the author
- About the reviewers
- Preface
- Part 1: Getting Started with Data Engineering with GCP
- Chapter 1: Fundamentals of Data Engineering
- Chapter 2: Big Data Capabilities on GCP
- Part 2: Build Solutions with GCP Components
- Chapter 3: Building a Data Warehouse in BigQuery
-
Chapter 4: Building Workflows for Batch Data Loading Using Cloud Composer
- Technical requirements
- Introduction to Cloud Composer
- Understanding the working of Airflow
- Cloud Composer 1 vs Cloud Composer 2
- Provisioning Cloud Composer in a GCP project
-
Exercise – build data pipeline orchestration using Cloud Composer
- Level 1 DAG – creating dummy workflows
- Deploying the DAG file into Cloud Composer
- Level 2 DAG – scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets
- Level 3 DAG – parameterized variables
- Level 4 DAG – Guaranteeing task idempotency in Cloud Composer
- Level 5 DAG – handling DAG dependency using an Airflow dataset
- Summary
-
Chapter 5: Building a Data Lake Using Dataproc
- Technical requirements
-
Introduction to Dataproc
- A brief history of the data lake and Hadoop ecosystem
- A deeper look into Hadoop components
- How much Hadoop-related knowledge do you need on GCP?
- Introducing the Spark RDD and DataFrame concepts
- Introducing the data lake concept
- Hadoop and Dataproc positioning on GCP
- Introduction to Dataproc Serverless
- Exercise – Building a data lake on a Dataproc cluster
- Exercise – Creating and running jobs on a Dataproc cluster
- Understanding the concept of an ephemeral cluster
- Building an ephemeral cluster using Dataproc and Cloud Composer
- Summary
-
Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow
- Technical requirements
- Processing streaming data
- Introduction to Pub/Sub
- Introduction to Dataflow
- Exercise – publishing event streams to Pub/Sub
- Exercise – using Dataflow to stream data from Pub/Sub to GCS
- Introduction to CDC and Datastream
-
Exercise – Datastream ETL streaming to BigQuery
- Step 1 – create a CloudSQL MySQL table
- Step 2 – create a GCS bucket
- Step 3 – create a GCS notification to the Pub/Sub topic and subscription
- Step 4 – create a BigQuery dataset
- Step 5 – configure a Datastream job
- Step 6 – run a Dataflow job from the Dataflow template
- Step 7 – insert a value in MySQL and check the result in BigQuery
- Summary
- Chapter 7: Visualizing Data to Make Data-Driven Decisions with Looker Studio
-
Chapter 8: Building Machine Learning Solutions on GCP
- Technical requirements
- A quick look at ML
- Exercise – practicing ML code using Python
- The MLOps landscape in GCP
- Exercise – leveraging pre-built GCP models as a service
- Exercise – using GCP in AutoML to train an ML model
- Exercise – deploying a dummy workflow with Vertex AI Pipelines
- Exercise – deploying a scikit-learn model pipeline with Vertex AI
- Summary
- Part 3: Key Strategies for Architecting Top-Notch Solutions
- Chapter 9: User and Project Management in GCP
- Chapter 10: Data Governance in GCP
- Chapter 11: Cost Strategy in GCP
- Chapter 12: CI/CD on GCP for Data Engineers
- Chapter 13: Boosting Your Confidence as a Data Engineer
- Index
- Other Books You May Enjoy
Product information
- Title: Data Engineering with Google Cloud Platform - Second Edition
- Author(s):
- Release date: April 2024
- Publisher(s): Packt Publishing
- ISBN: 9781835080115
You might also like
book
Data Engineering with Google Cloud Platform
Build and deploy your own data pipelines on GCP, make key architectural decisions, and gain the …
book
Visualizing Google Cloud
Easy-to-follow visual walkthrough of every important part of the Google Cloud Platform The Google Cloud Platform …
book
Fundamentals of Data Engineering
Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and …
audiobook
Fundamentals of Data Engineering
Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and …