Chapter 7. Connecting from Dataproc to GCP Services
Dataproc provides a powerful framework for running Hadoop and Spark jobs, allowing users to connect and interact with GCP services efficiently. In this chapter, we’ll explore various ways to connect Dataproc with popular GCP services like Cloud SQL, BigQuery, Bigtable and Pub/Sub Lite. We will also see how to configure Delta Lake tables on Dataproc and read from BigLake seamlessly.
In this chapter, you’ll get hands-on experience and insights into the following connectors:
- Spark-BigQuery connector
A specialized connector for high-performance Dataproc-BigQuery transfers
- Spark JDBC interface
An interface to connect Dataproc to Cloud SQL and other relational databases
- Pub/Sub Lite–Spark connector
A connector to integrate Dataproc with Pub/Sub Lite’s real-time messaging
- Dataproc templates
Preconfigured templates for common data tasks
- Delta Lake on Dataproc
Used to create Delta writes
- BigLake integration
Used to query Delta Lake tables using BigLake
Reading from GCS and Writing to a BigQuery Table
Problem
You need a Spark job running on Dataproc to read CSV data from GCS, process it, and write the results to the BigQuery table.
Solution
To achieve this, you can leverage the Spark-BigQuery connector. The Spark-BigQuery connector is preinstalled on Dataproc. No additional setup is required.
Here is the code snippet to write a DataFrame to BigQuery in append mode from Spark:
outputdf.write\.format("bigquery")\.option("writeMethod" ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access