O'Reilly logo
live online training icon Live Online training

GCP Certification Prep Crash Course: Professional Data Engineer

Get both the big picture and the important little details

Topic: System Administration
Janani Ravi

There’s no doubting Google’s might in data, and all of that heft has been brought to bear in the data offerings on Google Cloud Platform (GCP). If you want to prove your ability to identify the right GCP technologies and combine them meaningfully to solve real-world problems, you’ll need the Google Professional Data Engineer certification.

It’s certainly not the easiest certification out there—and this is precisely why technology professionals vie to secure this credential. But you don’t need years of experience working on the Google Cloud Platform to succeed. You can leverage your experience with the Hadoop ecosystem and ML technologies such as scikit-learn, TensorFlow, and PyTorch, along with your ability to draw analogies and parallels across other platforms like AWS, as you prepare.

Expert Janani Ravi guides you through what you need to know to pass the Google Professional Data Engineer certification exam. You’ll dive into the important conceptual and practical aspects of GCP that will help maximize your chances of success—and learn how to identify the right GCP data technologies for your use case and implement a solution using those technologies. Join in to build a strong foundation in GCP and better understand the linkages between different GCP technologies that are key to success on this test.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • GCP offerings for big data processing, machine learning, and AI model building
  • GCP’s support for Hadoop and its ecosystem technologies (Spark, Hive, and Pig)
  • How to cut through the clutter of similarly named storage technologies and choose the right OLAP, OLTP, SQL, or NoSQL technology for your use case

And you’ll be able to:

  • Design migration paths into GCP from on-premises Hadoop, proprietary data warehouses, and other cloud platforms
  • Implement every step of the traditional machine learning workflow on GCP
  • Design integrated batch and streaming architectures using Cloud Pub/Sub, Dataproc, and other GCP technologies
  • Appropriately integrate different GCP technologies so that real-time data handling is correctly implemented
  • Estimate pricing and usage so as to avoid sticker shock when your cloud bills come due
  • Choose between pretrained models, transfer learning, and from-scratch development for your ML models

This training course is for you because...

  • You’re an experienced data professional now looking to master GCP.
  • You have some exposure to at least one of the following: MySQL, Hadoop, Spark, scikit-learn, or Teradata.
  • You want to quickly secure the Professional Data Engineer certification.


  • A GCP account with billing enabled (required for in-class exercises)
  • A basic working knowledge of Google Cloud Platform
  • Familiarity with cloud computing, database technologies such as MySQL, big data technologies such as Hadoop and Spark, and ML technologies such as scikit-learn and TensorFlow (useful but not required)

Recommended follow-up:

About your instructor

  • Janani Ravi is a cofounder of Loonycorn, a team dedicated to upskilling IT professionals. She’s been involved in more than 75 online courses in Azure and GCP. Previously, Janani worked at Google, Flipkart, and Microsoft. She completed her studies at Stanford.


The timeframes are only estimates and may vary according to how the class is progressing

Data offerings on Google Cloud Platform (20 minutes)

  • Presentation: Test format mechanics; GCP offerings for big data applications—DataProc and managed Hadoop, streaming data and real-time data handling; GCP offerings for machine learning applications—pretrained ML models (Speech, Vision, Text, Translate APIs), AI Platform, and ML Engine for the ML workflow; other offerings—AutoML, BQML
  • Group discussion: How is cloud adoption influencing the design of big data architectures?; How do the AI Platform and other offerings map to machine learning use cases?

Taxonomy of storage solutions on GCP (20 minutes)

  • Presentation: Block storage solutions for use from within VM instances; cloud storage buckets and their usage; Cloud SQL for small-scale RDBMS applications; Cloud Spanner for truly scalable RDBMS applications; BigQuery as the big attraction on the GCP; Bigtable for very large data requiring very fast access; Memorystore, Cloud Firestore, Cloud Datastore, and other specialized solutions; pricing considerations
  • Hands-on exercise: Given a set of scenarios, pick the appropriate storage solution
  • Q&A

Break (5 minutes)

Cloud SQL as starter relational database on GCP (35 minutes)

  • Presentation: Cloud SQL as the starter relational database on GCP; Cloud SQL instances, connections, and the use of proxies; replication and high-availability configurations with Cloud SQL; cloud-first scenarios versus migration scenarios
  • Hands-on exercises: Given specific enterprise scenarios, determine when Cloud SQL would be a suitable choice of technology; create a Cloud SQL instance and connect to it using a MySQL client

Cloud Spanner as a specialized, high-end relational database (15 minutes)

  • Presentation: Features, limitations and pricing of Cloud Spanner; data model, schema design, indexes, and secondary indexes; architecture—instances, replicas; interleaved tables, cascading delete; transaction support, read-only, and read-write transactions
  • Hands-on exercises: Given specific enterprise scenarios determine when Cloud Spanner would be a suitable choice of technology; contrast Cloud Spanner with Bigtable and BigQuery
  • Q&A

Break (5 minutes)

BigQuery for data warehousing and OLAP use cases (45 minutes)

  • Presentation: The importance of BigQuery within the GCP suite of offerings; BigQuery versus Hive and BigQuery versus Teradata; BigQuery versus relational database technologies; BigQuery integrations with Bigtable, Dataproc, and Datalab; BigQuery connectors and visualization offerings; Migration into GCP—into cloud storage versus into BigQuery
  • Hands-on exercises: Differentiate between OLAP and OLTP use cases; enumerate the advantages of BigQuery over other OLAP offerings; enumerate the advantages of BigQuery over GCP and non-GCP OLTP offerings; create a BigQuery dataset and table, load data into BigQuery, and query data from BigQuery

Bigtable, Memorystore, and Firebase: Specialized GCP data offerings (15 minutes)

  • Presentation: NoSQL offerings on GCP; Bigtable’s unique architecture and design; Bigtable use cases and potential pitfalls; key design and cluster optimization with Bigtable; Memorystore as managed Redis on GCP; Firebase, Cloud Firestore, and the erstwhile Cloud Datastore—document databases on GCP
  • Hands-on exercise: Choose the right NoSQL offering on GCP
  • Q&A

Break (5 minutes)

Dataproc and Cloud Pub/Sub for batch and stream processing (30 minutes)

  • Presentation: Mapping from the Hadoop ecosystem to GCP offerings; Dataproc as managed Hadoop—migration, cluster sizing, and pricing; using Spark, PySpark, Pig, and Hive on GCP; integrations between Spark, Hadoop, and GCP technologies; Cloud Pub/Sub for stream processing; Google Composer for orchestration use cases; integrated batch and stream architecture: Dataflow and Dataproc; the “reference architecture”—Cloud Pub/Sub into Dataproc into BigQuery
  • Hands-on exercises: Explore migration strategies from on-premises Hadoop to Dataproc; use Cloud Pub/Sub to publish and subscribe to messages

Taxonomy of ML and AI offerings on GCP (45 minutes)

  • Presentation: Understanding the machine learning workflow; Google ML APIs—pretrained models for standard use cases; traditional ML versus deep learning; AI Platform, Cloud ML Engine for traditional ML and deep learning; democratization of ML/AI—BigQuery ML and AutoML
  • Hands-on exercises: Enumerate the classic ML problems—classification, regression, clustering, dimensionality reduction; identify the steps in the classic machine learning workflow; use BigQuery ML to build a simple regression model
  • Q&A