Chapter 1. Creating Dataproc Cluster
This chapter provides a basic understanding of the prerequisites for creating a Dataproc cluster, as well as the components that make up a cluster. We will discuss the various options for creating and customizing Dataproc clusters.
Dataproc is a Google Cloud paid service that is built on top of open source software (OSS) Apache Hadoop, Spark, and other Big Data technologies, like Kafka, JupyterHub, and Solr. As a managed service, Dataproc abstracts the creation, updating, managing, and deletion of all the required cloud services and resources needed.
Dataproc offers three different environments to run it:
-
Dataproc on GCE (Google compute engine)
-
Dataproc on GKE (Google Kubernetes engine)
-
Dataproc Serverless
In this chapter we focus on the first option to run Dataproc on GCE.
Before you start using this product, understand the billing and charges. This service has two types of charges: a charge for the software and a charge for the underlying components (Compute Engine, Disks, Cloud Storage, Network, etc.). Dataproc’s pay-as-you-go model allows you to pay for only the services you use for the time. For more information on pricing, please refer to the documentation link at https://cloud.google.com/dataproc/pricing. Dataproc Serverless has a different pricing model that we will discuss in later chapters of this book.
The first step in the process of creating a Dataproc cluster is to secure a Google Cloud account. If you don’t have one already, sign up for a new Google Cloud account at https://cloud.google.com/.
Tip
Google encourages new users to try out the GCP products by offering $300 in free1 credits across multiple services. Learn more about Google’s free tier products at https://cloud.google.com/free.
Throughout your learning journey, along with this book, the product documentation at https://cloud.google.com/dataproc/docs will help you gain knowledge. To keep updated about product releases, monitor the link https://cloud.google.com/dataproc/docs/release-notes. Google offers a paid support model to help you with any GCP-related issues. More details can be found at https://cloud.google.com/support. If you are in an enterprise, your organization might already have a support plan purchased. To participate in public discussions and ask questions, join the Google Group at cloud-dataproc-discuss@googlegroups.com.
Let’s get started on the key components to install prior to starting working on Dataproc.
1.1 Installing Google Cloud CLI
Problem
You want to install Google Cloud CLI on your machine to interact with GCP services using the command line.
Solution
You can download the GCloud CLI software from the Google Cloud SDK downloads repository, installation instructions are different based on the machine type (Mac/Windows/Linux) you have.
Discussion
Tip
Alternatively you can use a web browser based cloud shell that comes with Cloud SDK, gcloud, Cloud Code, an online Code Editor and other utilities pre-installed, fully authenticated and up-to-date. For accessing the cloud shell open https://console.cloud.google.com?cloudshell=true in your browser.
MAC Users
-
Open https://cloud.google.com/sdk/docs/install#mac in your browser.
-
Check the supported version of python is installed on your machine. At the time of writing this book the minimum required python version is python3. Run commands
python3 -V
orpython -V
to know the version of python you have. -
If missing python download and install it from here https://www.python.org/downloads/macos/ .
-
Know your Mac architecture by running command
uname -m
. -
Download the package that matches your Mac machine architecture(x86_64/arm/x86).
-
Download from hyperlink in step #2 of https://cloud.google.com/sdk/docs/install#mac
-
-
Downloaded file is in compressed(tar.gz) format. Extract the compressed archive file using command,
tar -xvzf downloaded-file-name
-
Install the CLI from command
./google-cloud-sdk/install.sh
-
After successful installation run
gcloud
init
command for authenticating with google cloud.-
It will ask you to configure the default user,project,region/zone to be used.
-
Now you successfully downloaded, installed and set up authentication with GCloud CLI and are ready to run the commands on your machine.
Windows Users
-
Download the installer from https://dl.google.com/dl/cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe
-
Run the installer by double clicking on it and follow the prompts.
Linux Users
-
Know your Linux machine architecture(32 bit or 64 bit) using command
getconf LONG_BIT
-
Check the python version using command commands
python3 -V or python -V
-
Minimum required python version is python3
-
If you are using 64 bit, python is bundled with the installer and no need to install manually. Otherwise download it from https://www.python.org/downloads/
-
-
Download the CLI package that matches with you Linux machine architecture(32/64 bit) from the hyperlink in step #2 of https://cloud.google.com/sdk/docs/install#linux .
-
Downloaded file is in compressed(tar.gz) format. Extract the compressed archive file using command,
tar -xvzf downloaded-file-name
-
Install the CLI from command ./google-cloud-sdk/install.sh
-
Restart the cloud shell/terminal.
-
After successful installation run
gcloud init
command for authenticating with google cloud.-
It will ask you to configure the default user,project,region/zone to be used.
-
1.2 Granting IAM privileges to user
Problem
You want to grant the IAM permissions to the user or service account for creating a Dataproc cluster.
Solution
Grant an IAM role at Service Level
gcloud projects add-iam-policy-binding <PROJECT_ID>\ --member="<user:EMAIL_ADDRESS>"\ --role=roles/dataproc.editor
Grant service account user permission
gcloud iam service-accounts add-iam-policy-binding <compute_engine_default_account>\ --member="<user:EMAIL_ADDRESS>"\ --role=roles/iam.serviceAccountActor
Discussion
User or Service accounts creating the cluster require IAM privileges. If you are new to Google Cloud, IAM (Identity and Access Management) is the service that controls who can access which resource in Google Cloud. Accessing IAM also needs an IAM role(viewer/editor/owner). If you own the project you will get the owner role for accessing IAM or check with your project/platform admin for getting the required access.
Resources in Google Cloud are organized in hierarchical format with Organization being the parent and followed by Folders, Projects,Services & Resources(GCS Bucket,Compute Engine,Dataproc Cluster,BigQuery Table, etc.,)
IAM policies created at the parent level (within a hierarchy) are inherited by child components. For instance, the Editor role assigned at the project level will be inherited by all services and resources created within that project. Similarly, an Editor role assigned at the Dataproc service level will be inherited by all clusters within that Dataproc service.
IAM offers basic roles can be applied at service level
IAM Basic Role | Description |
Viewer | Grants read only access. |
Editor | All viewer permissions plus access to create/modify/delete resources. |
Owner | All Editor permissions plus additional high level administrative permissions. Managing IAM permissions, setting up Billing Accounts and Deleting projects are few examples. |
Gcloud command for creating editor access at project level
export PROJECT_ID=<PROJECT_ID> export SERVICE_ACCOUNT_EMAIL=dataprociamtest0827 gcloud projects add-iam-policy-binding ${PROJECT_ID}\ --member="user:${SERVICE_ACCOUNT_EMAIL}"\ --role=roles/editor
Granting editor access at project level gives editor access to all services in the project. For limiting the user to have access to only Dataproc Service, Dataproc has predefined roles as shown in Table 1-2
IAM Role | Description |
Dataproc Administrator | Grants full control over Dataproc resources. |
Dataproc Editor | Grants permission to create and manage clusters and view the underlying resources. |
Dataproc Viewer | Read only access to Dataproc resources |
Dataproc Worker | Assigned to Compute Engine machines for performing cluster tasks. |
To assign Dataproc Editor role for a user run the gcloud command
export PROJECT_ID=<PROJECT_ID> export SERVICE_ACCOUNT_EMAIL=dataprociamtest0827 gcloud projects add-iam-policy-binding ${PROJECT_ID}\ --member="user:${SERVICE_ACCOUNT_EMAIL}"\ --role=roles/dataproc.editor
Each IAM role is a collection of permissions. Granting the Dataproc Editor role to users will give them permissions to create the cluster and also do additional privileges to manage,delete clusters. For fine-grained permissions specific to cluster creation, you can create a custom role.
Here is the gcloud command to create a custom role
export PROJECT_ID=<PROJECT_ID> gcloud iam roles create custom.dataprocEditor\ --project=${PROJECT_ID}\ --title="Custom Dataproc Editor"\ --description="Custom role for creating and managing Dataproc clusters"\ --permissions=dataproc.clusters.create
Now assign this custom role to the user
export PROJECT_ID=<PROJECT_ID> export SERVICE_ACCOUNT_EMAIL=<SERVICE_ACCOUNT_EMAIL> gcloud projects add-iam-policy-binding PROJECT_ID\ --member=${SERVICE_ACCOUNT_EMAIL}\ --role=roles/custom.dataprocEditor
Dataproc internally uses two types of staging accounts,
-
Dataproc VM Service account
-
Dataproc Service Agent Service account
The Dataproc VM Service account is used to create underlying resources like Compute Engine instances and to perform dataplane operations like reading and writing data to Google Cloud Storage (GCS). Dataproc uses the Compute Engine default service account as the Dataproc VM Service account, but this can be customized using the --service-account option.
Users creating clusters require access to the Dataproc VM Service account as a service account user.
export SERVICE_ACCOUNT_EMAIL=<SERVICE_ACCOUNT_EMAIL> gcloud iam service-accounts add-iam-policy-binding\ <COMPUTE_ENGINE_DEFAULT_SERVICE_ACCOUNT>\ --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}"\ --role=roles/iam.serviceAccountActor
Tip
The project’s compute engine default service account can be listed using the gcloud command,
gcloud iam service-accounts list\ --filter="displayName:Compute Engine default service account"\ --project=<PROJECT_ID_HERE>
The Dataproc Service Agent service account is responsible for control plane operations such as creating, updating, and deleting clusters. Dataproc creates this service account automatically and is not allowed to use a custom service account.
1.3 Configuring a Network and Firewall rules
Problem
You want to create a new virtual private cloud(VPC) network for hosting virtual machines and attach firewall rules for allowing communication between the machines.
Solution
Create a VPC network
gcloud compute networks create <NETWORK_NAME>\ --subnet-mode auto\ --description "VPC network hosting dataproc resources"
Attach a Firewall rule
gcloud compute firewall-rules create <FIREWALL_NAME> --network dataproctest --allow [PROTOCOL[:PORT]] --source-ranges <IP_RANGE>
Discussion
Compute Engines that are part of a Dataproc cluster must reside within a VPC network to communicate with each other and external resources when necessary. The Dataproc service mandates that all VMs in the cluster be able to communicate with each other using the ICMP, TCP (all ports), and UDP (all ports) protocols.
The default project network typically has subnets created in the range of 10.128.0.0/9. It also includes the ‘default-allow-internal’ firewall rule, permitting communication within this subnet range. Table 1-3 (if presented) would illustrate an example of these rules. If you are creating a custom network, ensure you establish a rule aligned with Dataproc’s requirements to enable internal communication.
Direction | Priority | Source range | Protocols:Ports |
ingress | 65534 | For default network: 10.128.0.0/9 For custom vpc/subnet use the custom subnet range. |
tcp:0-65535,udp:0-65535,icmp |
To create a VPC network in auto mode
gcloud compute networks create dataproc-vpc\ --subnet-mode auto\ --description "VPC network hosting dataproc resources"
Tip
Configuring subnet mode as ‘auto’ creates multiple subnets (one for each GCP region). This VPC creates subnets in all available regions, allowing you to create a Dataproc cluster in any region.
To prevent the automatic creation of too many(40+) subnets in auto mode, we can create subnets in required regions. This is a two-step process: first, create a VPC, and then add a subnet to it.
Creating VPC with custom subnet mode, It creates an empty VPC without any subnets.
gcloud compute networks create dataproc-vpc\ --subnet-mode custom\ --description "VPC network hosting dataproc resources"
Creating Subnet in us-east1 region with range of 10.120.0.0/20
gcloud compute networks subnets create dataproc-vpc-us-east1-subnet\ --network=dataproc-vpc\ --region=us-east1\ --range=10.120.0.0/20
Tip
A subnet range refers to the capacity to hold a certain number of IP addresses within that subnet. In the case of 10.120.0.0/20, it has a capacity of 256 IP addresses, with 254 being usable (excluding the very first 10.120.0.0, which is the network address, and the last 10.120.15.255, which is the broadcast address).
To choose a suitable subnet range for the maximum number of hosts on a Dataproc cluster, you will need to consider the expected number of hosts and allow room for growth.
Resources in VPC are not reachable until you create a firewall rule to allow communication. Attach the firewall rule matching dataproc service requirements
Creating Firewall rule for custom subnet with IP range as 10.120.0.0/20
gcloud compute firewall-rules create dataproc-allow-tcp-udp-icmp-all-ports\ --network dataproc-vpc\ --allow tcp:0-65535,udp:0-65535,icmp\ --source-ranges "10.120.0.0/20"
See Also
Refer to Google Cloud public documentation to learn more about VPC Network and Firewall rules https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network
1.4 Create Dataproc Cluster from UI
Problem
You wanted to create a Dataproc cluster using Web UI
Solution
Google Cloud Console is a web-based UI for accessing Google Cloud services. Web UI offers the option to create and manage dataproc clusters.
Discussion
For creating a Dataproc cluster using Web UI, first login to google cloud console using URL https://console.cloud.google.com/, as shown in Figure 1-4 .
Once login to Google Cloud Console you will see the dashboard or home page as shown in Figure 1-5
In the search bar enter the keyword Dataproc and select the service as shown in Figure 1-6
If the Dataproc API is not enabled in the project it might prompt you to enable the service. Dataproc Service is not enabled by default in the project. Click on the Enable button to enable the service as shown in Figure 1-7
Tip
If the billing account is not linked to the project it might ask you to link existing or create a new billing account.
Selecting Dataproc service from search will take you to the Dataproc Service home page. Click Create Cluster for creating a new cluster.
Dataproc clusters can be created on Compute Engine or Google Kubernetes Engine (GKE) . Select the Cluster on Compute Engine option as shown in Figure 1-9
To create a basic cluster with all default values just enter two values: cluster name and region and click on Create button as shown in Figure 1-10
Creating a cluster will take up to 90 seconds. Once successfully created you will see the cluster in the Dataproc service home page. Click on hyperlink with Cluster name to view details inside cluster.
1.5 Creating Dataproc Cluster using gcloud
Problem
The manual creation of clusters from the UI is time-consuming when creating multiple clusters. How can you accelerate development and testing from a local machine by creating a cluster from the command line?
Solution
Install a Google Cloud CLI in your machine and run a gcloud dataproc clusters create command.
Command to create cluster with basic configuration:
gcloud dataproc clusters create basic-cluster --region us-central1
Cluster with custom configuration (Machine types,Disk,Network,etc.,)
gcloud dataproc clusters create basic-cluster\ --region us-central1\ --zone ""\ --image-version 2.0-debian10\ --master-machine-type n1-standard-4\ --worker-machine-type n1-standard-8\ --master-boot-disk-type pd-ssd\ --master-boot-disk-size 100\ --worker-boot-disk-type pd-ssd\ --worker-boot-disk-size 200\ --num-worker-local-ssds 2\ --network default\ --enable-component-gateway
Discussion
To create a dataproc cluster with gcloud: you provide the cluster name and the region where the cluster needs to be created. The following command creates a dataproc cluster with name basic_cluster in region us-central1.
gcloud dataproc clusters create basic-cluster --region us-central1
The cluster created from the command will use the defaults shown in Table 1-4.
Property | Default Value |
Number of Master Nodes | 1 |
Number of Primary workers | 2 |
Machine Type | Choses a machine type based on Dataproc internal configuration. |
Network | When no network is specified, the cluster uses the network with the name “default” that is available in the project.. |
Zone | Intelligently picks a zone within the region specified. |
Dataproc version | Defaults to latest version available |
Disk Type | Standard Persistent Disk |
Disk Size | 1000 GB |
Component Gateway | Disabled |
Tip
The default values could be changed by Google over time. For latest refer to the google documentation(https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/create).
To view the list of clusters available in a region and project
gcloud dataproc clusters list --project {project_name_here} –-region {region_name_here}
To delete a cluster
gcloud dataproc clusters delete basic_cluster --region=us-central1
The minimum required arguments are the cluster name and the region where the cluster will be created. Additional customizations, such as machine types, secondary workers, disk type/size for primary workers and secondary workers, high availability, and component gateway, can also be configured using the gcloud command.
Let’s look at the command to customize few more components in the cluster
gcloud dataproc clusters create basic-cluster\ --region us-central1\ --zone ""\ --image-version 2.0-debian10\ --master-machine-type n1-standard-4\ --worker-machine-type n1-standard-8\ --master-boot-disk-type pd-ssd\ --master-boot-disk-size 100\ --worker-boot-disk-type pd-ssd\ --worker-boot-disk-size 200\ --num-worker-local-ssds 2\ --network default\ --enable-component-gateway
From this command the following components were customized :
- region
-
The region is where your cluster will be created.
- Zone
-
Google Cloud has multiple zones in each region. For example, the us-central1 region has zones us-central1-a, us-central1-b, and us-central1-c. If you do not specify a zone when creating a Dataproc cluster, Dataproc will choose a zone for you in the specified region.
- Image-version
-
Combination of operating system and Hadoop Technology stack. Image version 2.0-debian10 comes with Debain 10 operating system and Apache Hadoop 3.X, Apache Spark 3.1 . It is recommended to explicitly specify the image version when creating Dataproc clusters to ensure consistency in the cluster configuration. Refer to https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters for available and supported Dataproc versions.
- Master-machine-type
-
Compute Engine type for installing master node services.
- Master-boot-disk-type
-
Disk type to be attached to the master node. Accepted values are pd-ssd ,pd-standard , pd-balanced.
- Master-boot-disk-size
-
Size of master boot disk . By default, values given are assumed to be in GBs. Value 100 refers to a 100 GB disk to be attached to the master node.
- worker-boot-disk-type
-
Disk type to be attached to the worker node. Accepted values are pd-ssd ,pd-standard , pd-balanced.
Tip
Dataproc cluster consists of VM’s provided by Google Cloud Compute Engine service. Choose a VM machine type that suits your data processing needs. GCP offers a variety of machine types,
-
General purpose (N1,N2,N2D,T2D,T2A,etc.,)
-
Cost Optimized (E2)
-
Memory Optimized (M1)
-
CPU optimized (C2,C2D,C3)
-
Custom Machine types that can be created with custom memory and CPU configuration.
For data pipelines, general-purpose machine types N2D and N2 are popular choices. High-performance compute required workloads use C3 machine types or GPUs.
- worker-boot-disk-size
-
Size of worker boot disk . By default, values given are assumed to be in GBs. Value 100 refers to a 100 GB disk to be attached to the master node.
Tip
Dataproc clusters require storage attached to compute nodes for storing persistent or temporary data. Google cloud platform offers following storage options
-
PD Standard (Persistent Disk Standard)
-
PD SSD (Persistent Disk SSD)
-
Local SSD
Local SSDs are a recommended choice for Dataproc worker nodes as they offer greater performance than PD Standard and less expensive than PD SSD.
- num-worker-local-ssds
-
Local SSDs are recommended storage for worker nodes. They offer higher performance than standard disks with a good price to performance ratio.
- num-worker
-
Number of primary worker nodes. You can also add secondary worker nodes that do compute only and no storage (HDFS).
Tip
Primary workers are the only worker machine types that have a Datanode component for storing HDFS data. Based on the amount of HDFS storage needed, choose the Primary workers. Not all of your data goes to HDFS. In future chapters, we will cover what gets stored in HDFS vs Local File system vs GCS.
- network
-
Virtual network that pools cloud resources together. When you create a project it comes with a default network.
- enable-component-gateway
-
Creates access to web endpoints for services like Resource Manager, Namenode Web UI, Spark History server.
We will learn more about the customizations in next chapters.
1.6 Creating Dataproc Cluster using API Endpoints
Problem
You wanted to create a cluster using REST API to make the process of cluster creation as platform independent.
Solution
curl -X POST\ -H "Authorization: Bearer $(gcloud auth print-access-token)"\ -H "Content-Type: application/json; charset=utf-8"\ -d @request.json\ "https://dataproc.googleapis.com/v1/projects/[project_name]/regions/<region-name>/clusters"
Discussion
Create a json request file with all the required configuration.
request.json
{ "projectId": "dataproctest", "clusterName": "dataproc-test-cluster", "config": { "gceClusterConfig": { "networkUri": "default", "zoneUri": "us-central1-c" }, "masterConfig": { "numInstances": 1, "machineTypeUri": "n2-standard-4", "diskConfig": { "bootDiskType": "pd-standard", "bootDiskSizeGb": 500, "numLocalSsds": 0 } }, "softwareConfig": { "imageVersion": "2.1-debian11", }, "workerConfig": { "numInstances": 2, "machineTypeUri": "n2-standard-4", "diskConfig": { "bootDiskType": "pd-standard", "bootDiskSizeGb": 500, "numLocalSsds": 2, "localSsdInterface": "SCSI" } } }, "labels": { "billing_account": "test-account" } }
Making a REST API call to google cloud services requires the user to provide authorization tokens. Authenticate from command line prior to executing curl command,
To authenticate with a personal account run the command
gcloud auth login
To authenticate as a service account using credential json file run the command
gcloud auth activate-service-account --key-file=<credential-json-file-location>
Execute the following curl command by replacing project_name and region_name (ex., us-central1).
curl -X POST\ -H "Authorization: Bearer $(gcloud auth print-access-token)"\ -H "Content-Type: application/json; charset=utf-8"\ -d @request.json\ "https://dataproc.googleapis.com/v1/projects/{project_name}/regions/{region-name}/clusters"
Successful execution of the curl command will give output as below
{ "name": "projects/{project-name}/regions/{region-name}/operations/b5706e31......", "metadata": { "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata", "clusterName": "cluster-name", "clusterUuid": "5fe882b2-...", "status": { "state": "PENDING", "innerState": "PENDING", "stateStartTime": "2019-11-21T00:37:56.220Z" }, "operationType": "CREATE", "description": "Create cluster with 2 workers", "warnings": [ "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ..."" ] } }
1.7 Creating Dataproc Cluster using Terraform
Problem
You wanted to automate the provisioning and managing of clusters with Infrastructure as a Code(IaC) framework.
Solution
Terraform is a Infrastructure as Code (IaC) tool that allows users to create and maintain cloud infrastructure using declarative configuration language.
Discussion
Terraform is a widely used IaC tool for creating, maintaining and managing the Cloud Platform resources. This tool supports multiple cloud vendors AWS,GCP,Azure,etc.,
Install Terraform following the instructions at public documentation https://developer.hashicorp.com/terraform/downloads .
Terraform code execution involves
-
init - Initializes a state of resources
-
plan - Run a preview and let you know what all changes to be applied on top of the current state of resources.
-
apply - Applies the changes on resources.
-
destroy - Deletes all the resources.
Following is a sample Terraform code to create a basic dataproc cluster with no customizations. The code has two configuration blocks:
-
Provider - Provider block helps interaction between terraform and Google Cloud Platform. Service account credentials file in json format is configured for authentication.
-
Google_dataproc_cluster
Resource contains configuration specific to the Dataproc cluster being created. -
Google_compute_network
-
Google_compute_firewall
provider "google" { credentials = file("service-account-credentials-file.json") project = "project-id" region = "us-central1" } resource "google_dataproc_cluster" "clusterCreationResource" { provider = google name = "basic-cluster" region = "us-central1" cluster_config { gce_cluster_config { network = google_compute_network.dataproc_network.name } master_config { num_instances = 1 machine_type = "n1-standard-4" } worker_config { num_instances = 2 machine_type = "n1-standard-8" } endpoint_config { enable_http_port_access = "true" } } } resource "google_compute_network" "dataproc_network" { name = "basic-cluster-network" auto_create_subnetworks = true } resource "google_compute_firewall" "firewall_rules" { name = "basic-cluster-firewall-rules" network = google_compute_network.dataproc_network.name // Allow ping allow { protocol = "icmp" } //Allow all TCP ports allow { protocol = "tcp" ports = ["1-65535"] } //Allow all UDP ports allow { protocol = "udp" ports = ["1-65535"] } source_ranges = ["0.0.0.0/0"] }
Save the terraform sample code in the main.tf file .
Navigate to the folder that has you main.tf file and run a command to initialize terraform
terraform init
Run a plan command
terraform plan
Run a plan to apply changes and create the cluster
terraform apply
To destroy the cluster run a destroy command
terraform destroy
Tip
Terraform maintains the state of all resources it created in a file named terraform.tfstate. When running multiple times, it compares with the state file and applies only updates where needed. The destroy option will delete all the resources it created and maintained in the state.
1.8 Creating cluster using Python
Problem
You wanted to automate the creation of a cluster using the Python programming language.
Solution
Google Cloud Dataproc offers python client libraries for interacting with Dataproc services. Here is a code for creating a Dataproc cluster.
from google.cloud import dataproc_v1 def create_dataproc_cluster(project_id, region, cluster_name): """Creates a Dataproc cluster.""" dataproc_cluster_client = dataproc_v1.ClusterControllerClient() # Create the cluster config. cluster = { "project_id": project_id, "cluster_name": cluster_name, "config": { "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"}, "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"}, }, } operation = dataproc_cluster_client.create_cluster( project_id=project_id, region="us-central1", cluster=cluster ) #result = operation.result() print(f"Created Dataproc cluster: {cluster.cluster_name}") if __name__ == "__main__": project_id = "PROJECT-ID" region = "REGION" cluster_name = "CLUSTER-NAME" create_dataproc_cluster(project_id, region, cluster_name)
Discussion
Running a python based sdk requires the python package google-cloud-dataproc to be installed. To install google-cloud-dataproc using pip execute the command
pip install google-cloud-dataproc
Lets understand the python code step by step how it creates Dataproc cluster:
The code first imports the dataproc_v1 module from the google.cloud package. This module provides the Python client library for the Google Cloud Dataproc API.
from google.cloud import dataproc_v1
The create_dataproc_cluster() function takes three arguments: the project ID, the region, and the cluster name.
def create_dataproc_cluster(project_id, region, cluster_name)
The function first creates a client object for the ClusterControllerClient class. This class provides methods for creating, managing, and monitoring Dataproc clusters.
dataproc_cluster_client = dataproc_v1.ClusterControllerClient()
The function then creates a cluster configuration object. The cluster configuration object specifies the configuration of the cluster, such as the number of master and worker nodes, the machine types for the nodes, and the software that should be installed on the nodes.
# Create the cluster config. cluster = { "project_id": project_id, "cluster_name": cluster_name, "config": { "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"}, "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"}, }, }
The function then calls the create_cluster() method on the client object. The create_cluster() method creates a new Dataproc cluster and returns an operation object. The operation object can be used to track the progress of the cluster creation.
operation = dataproc_cluster_client.create_cluster( project_id=project_id, region="us-central1", cluster=cluster )
The function finally prints a message that the cluster has been created.
print(f"Created Dataproc cluster: {cluster.cluster_name}")
The if __name__ == “__main__”: statement at the end of the code defines the main entry point for the program. When the program is run, this statement will be executed first. The statement then assigns the values to the variables project_id, region, and cluster_name. The create_dataproc_cluster() function is then called with these values.
if __name__ == "__main__": project_id = "project-id" region = "us-central1" cluster_name = "basic-cluster" create_dataproc_cluster(project_id, region, cluster_name)
To run the code, you can save it as a Python file and then run it from the command line. For example, if you save the code as create_dataproc_cluster.py, you can run it by typing the following command into the command line:
python create_dataproc_cluster.py
1.9 Duplicating a Dataproc Cluster
Problem
Users reported an issue in production. You don’t have access to the production environment, so you want to create an exact replica of the production cluster and verify the issue.
Solution
Export the existing cluster configuration to a file
gcloud dataproc clusters export <source-cluster-name> --destination prod-cluster-config.yaml
Create a new cluster using the YAML configuration file
gcloud dataproc clusters import <target-cluster-name>\ --source prod-cluster-config.yaml\ --region=region
Discussion
When working with existing clusters, you may want to view cluster details such as worker details, labels, custom configurations, and component gateway URLs.
Gcloud command for viewing existing cluster configuration
gcloud dataproc clusters describe <cluster-name-here> --region <region>
Creating a new cluster with the same configuration as the existing cluster is a two step process. First we have to export the existing cluster configuration to a file. Dataproc offers the gcloud command option to export the configuration in YAML file format. At the time of this writing this option of configuration export is only supported from gcloud and can’t be done from Web UI.
Run a command to export the configuration
gcloud dataproc clusters export prod-cluster --destination prod-cluster-config.yaml
Upon successful execution of the command, the cluster configuration will be stored in a file named prod-cluster-config.yaml. The cluster name and region are not included in the export because the name must be unique. When creating a new cluster using this configuration, the cluster name and region must be provided.
Run a command to create a new cluster using the configuration present in YAML file
gcloud dataproc clusters import prod-cluster-duplicate\ --source prod-cluster-config.yaml\ --region=region
Get Dataproc Cookbook now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.