Chapter 1. Creating a Dataproc Cluster
Dataproc is a paid Google Cloud service built on top of open source software Apache Hadoop, Apache Spark, and other big data technologies, such as Apache Kafka, JupyterHub, and Apache Solr. As a managed service, Dataproc abstracts creating, updating, managing, and deleting all the required cloud services and resources.
Dataproc offers three different environments for running it:
-
Dataproc on Google Compute Engine (GCE)
-
Dataproc on Google Kubernetes Engine (GKE)
-
Dataproc Serverless
In this chapter, we focus on the first option: running Dataproc on GCE (see Figure 1-1).
Before you start using this product, you should understand the billing and charges. This service has two types of charges: a charge for the software and a charge for the underlying components (compute engine, disks, cloud storage, network, etc.). Dataproc’s pay-as-you-go model allows you to pay for only the services you use. For more information on pricing, refer to the Dataproc pricing documentation. Dataproc Serverless has a different pricing model that we will discuss in Chapter 4.
Figure 1-1. Dataproc on GCE high-level architecture diagram
The first step in the process of creating a Dataproc cluster is to secure a Google Cloud account. If you don’t have one already, sign up for a new Google Cloud account at cloud.google.com.
Tip
Google encourages new users to try ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access