Chapter 1. Creating Dataproc Cluster

This chapter provides a basic understanding of the prerequisites for creating a Dataproc cluster, as well as the components that make up a cluster. We will discuss the various options for creating and customizing Dataproc clusters.

Dataproc is a Google Cloud paid service that is built on top of open source software (OSS) Apache Hadoop, Spark, and other Big Data technologies, like Kafka, JupyterHub, and Solr. As a managed service, Dataproc abstracts the creation, updating, managing, and deletion of all the required cloud services and resources needed.

Dataproc offers three different environments to run it:

  1. Dataproc on GCE (Google compute engine)

  2. Dataproc on GKE (Google Kubernetes engine)

  3. Dataproc Serverless

In this chapter we focus on the first option to run Dataproc on GCE.

Dataproc on GCE High Level Architecture Diagram
Figure 1-1. Dataproc on GCE High Level Architecture Diagram

Before you start using this product, understand the billing and charges. This service has two types of charges: a charge for the software and a charge for the underlying components (Compute Engine, Disks, Cloud Storage, Network, etc.). Dataproc’s pay-as-you-go model allows you to pay for only the services you use for the time. For more information on pricing, please refer to the documentation link at https://cloud.google.com/dataproc/pricing. Dataproc Serverless has a different pricing model that we will discuss in later chapters of this book.

The first step in the process of creating a Dataproc cluster is to secure a Google Cloud account. If you don’t have one already, sign up for a new Google Cloud account at https://cloud.google.com/.

Tip

Google encourages new users to try out the GCP products by offering $300 in free1 credits across multiple services. Learn more about Google’s free tier products at https://cloud.google.com/free.

Throughout your learning journey, along with this book, the product documentation at https://cloud.google.com/dataproc/docs will help you gain knowledge. To keep updated about product releases, monitor the link https://cloud.google.com/dataproc/docs/release-notes. Google offers a paid support model to help you with any GCP-related issues. More details can be found at https://cloud.google.com/support. If you are in an enterprise, your organization might already have a support plan purchased. To participate in public discussions and ask questions, join the Google Group at cloud-dataproc-discuss@googlegroups.com.

Let’s get started on the key components to install prior to starting working on Dataproc.

1.1 Installing Google Cloud CLI

Problem

You want to install Google Cloud CLI on your machine to interact with GCP services using the command line.

Solution

You can download the GCloud CLI software from the Google Cloud SDK downloads repository, installation instructions are different based on the machine type (Mac/Windows/Linux) you have.

 

Discussion

Tip

Alternatively you can use a web browser based cloud shell that comes with Cloud SDK, gcloud, Cloud Code, an online Code Editor and other utilities pre-installed, fully authenticated and up-to-date. For accessing the cloud shell open https://console.cloud.google.com?cloudshell=true in your browser.

MAC Users

  • Open https://cloud.google.com/sdk/docs/install#mac in your browser.

  • Check the supported version of python is installed on your machine. At the time of writing this book the minimum required python version is python3. Run commands python3 -V or python -V to know the version of python you have.

  • If missing python download and install it from here https://www.python.org/downloads/macos/ .

  • Know your Mac architecture by running command uname -m .

  • Download the package that matches your Mac machine architecture(x86_64/arm/x86).

  • Downloaded file is in compressed(tar.gz) format. Extract the compressed archive file using command,

    tar -xvzf downloaded-file-name
  • Install the CLI from command ./google-cloud-sdk/install.sh

  • After successful installation run gcloud init command for authenticating with google cloud.

    • It will ask you to configure the default user,project,region/zone to be used.

Now you successfully downloaded, installed and set up authentication with GCloud CLI and are ready to run the commands on your machine.

Windows Users

Linux Users

  • Know your Linux machine architecture(32 bit or 64 bit) using command getconf LONG_BIT

  • Check the python version using command commands python3 -V or python -V

    • Minimum required python version is python3

    • If you are using 64 bit, python is bundled with the installer and no need to install manually. Otherwise download it from https://www.python.org/downloads/

  • Download the CLI package that matches with you Linux machine architecture(32/64 bit) from the hyperlink in step #2 of https://cloud.google.com/sdk/docs/install#linux .

  • Downloaded file is in compressed(tar.gz) format. Extract the compressed archive file using command,

    tar -xvzf downloaded-file-name
  • Install the CLI from command ./google-cloud-sdk/install.sh

  • Restart the cloud shell/terminal.

  • After successful installation run gcloud init command for authenticating with google cloud.

    • It will ask you to configure the default user,project,region/zone to be used.

1.2 Granting IAM privileges to user

Problem

You want to grant the IAM permissions to the user or service account for creating a Dataproc cluster.

Solution

Grant an IAM role at Service Level

gcloud projects add-iam-policy-binding <PROJECT_ID>\
 --member="<user:EMAIL_ADDRESS>"\
 --role=roles/dataproc.editor

Grant service account user permission

gcloud iam service-accounts add-iam-policy-binding <compute_engine_default_account>\
 --member="<user:EMAIL_ADDRESS>"\
 --role=roles/iam.serviceAccountActor

Discussion

User or Service accounts creating the cluster require IAM privileges. If you are new to Google Cloud, IAM (Identity and Access Management) is the service that controls who can access which resource in Google Cloud. Accessing IAM also needs an IAM role(viewer/editor/owner). If you own the project you will get the owner role for accessing IAM or check with your project/platform admin for getting the required access.

Resources in Google Cloud are organized in hierarchical format with Organization being the parent and followed by Folders, Projects,Services & Resources(GCS Bucket,Compute Engine,Dataproc Cluster,BigQuery Table, etc.,)

GCP resources hierarchy for IAM policy inheritance
Figure 1-2. GCP resources hierarchy for IAM policy inheritance

IAM policies created at the parent level (within a hierarchy) are inherited by child components. For instance, the Editor role assigned at the project level will be inherited by all services and resources created within that project. Similarly, an Editor role assigned at the Dataproc service level will be inherited by all clusters within that Dataproc service.

IAM offers basic roles can be applied at service level

Table 1-1. Basic IAM roles for Google Cloud Services
IAM Basic Role Description
Viewer Grants read only access.
Editor All viewer permissions plus access to create/modify/delete resources.
Owner All Editor permissions plus additional high level administrative permissions. Managing IAM permissions, setting up Billing Accounts and Deleting projects are few examples.

Gcloud command for creating editor access at project level

export PROJECT_ID=<PROJECT_ID>
                  export SERVICE_ACCOUNT_EMAIL=dataprociamtest0827 
                  gcloud projects add-iam-policy-binding ${PROJECT_ID}\
                    --member="user:${SERVICE_ACCOUNT_EMAIL}"\
                    --role=roles/editor

Granting editor access at project level gives editor access to all services in the project. For limiting the user to have access to only Dataproc Service, Dataproc has predefined roles as shown in Table 1-2

Table 1-2. Predefined IAM roles for Dataproc Service
IAM Role Description
Dataproc Administrator Grants full control over Dataproc resources.
Dataproc Editor Grants permission to create and manage clusters and view the underlying resources.
Dataproc Viewer Read only access to Dataproc resources
Dataproc Worker Assigned to Compute Engine machines for performing cluster tasks.

To assign Dataproc Editor role for a user run the gcloud command

export PROJECT_ID=<PROJECT_ID>
                  export SERVICE_ACCOUNT_EMAIL=dataprociamtest0827 
                  gcloud projects add-iam-policy-binding ${PROJECT_ID}\
                    --member="user:${SERVICE_ACCOUNT_EMAIL}"\
                    --role=roles/dataproc.editor

Each IAM role is a collection of permissions. Granting the Dataproc Editor role to users will give them permissions to create the cluster and also do additional privileges to manage,delete clusters. For fine-grained permissions specific to cluster creation, you can create a custom role.

Here is the gcloud command to create a custom role

export PROJECT_ID=<PROJECT_ID> gcloud iam roles create custom.dataprocEditor\
 --project=${PROJECT_ID}\
 --title="Custom Dataproc Editor"\
 --description="Custom role for creating and managing Dataproc clusters"\
 --permissions=dataproc.clusters.create

Now assign this custom role to the user

export PROJECT_ID=<PROJECT_ID>
                  export SERVICE_ACCOUNT_EMAIL=<SERVICE_ACCOUNT_EMAIL>
                  gcloud projects add-iam-policy-binding PROJECT_ID\
                    --member=${SERVICE_ACCOUNT_EMAIL}\
                    --role=roles/custom.dataprocEditor

Dataproc internally uses two types of staging accounts,

  1. Dataproc VM Service account

  2. Dataproc Service Agent Service account

The Dataproc VM Service account is used to create underlying resources like Compute Engine instances and to perform dataplane operations like reading and writing data to Google Cloud Storage (GCS). Dataproc uses the Compute Engine default service account as the Dataproc VM Service account, but this can be customized using the --service-account option.

Users creating clusters require access to the Dataproc VM Service account as a service account user.

export SERVICE_ACCOUNT_EMAIL=<SERVICE_ACCOUNT_EMAIL>
                  gcloud iam service-accounts add-iam-policy-binding\
 <COMPUTE_ENGINE_DEFAULT_SERVICE_ACCOUNT>\
 --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}"\
 --role=roles/iam.serviceAccountActor
Tip

The project’s compute engine default service account can be listed using the gcloud command,

gcloud iam service-accounts list\
  --filter="displayName:Compute Engine default service account"\
  --project=<PROJECT_ID_HERE>

The Dataproc Service Agent service account is responsible for control plane operations such as creating, updating, and deleting clusters. Dataproc creates this service account automatically and is not allowed to use a custom service account.

1.3 Configuring a Network and Firewall rules

Problem

You want to create a new virtual private cloud(VPC) network for hosting virtual machines and attach firewall rules for allowing communication between the machines.

Solution

Create a VPC network

gcloud compute networks create <NETWORK_NAME>\
 --subnet-mode auto\
 --description "VPC network hosting dataproc resources"

Attach a Firewall rule

gcloud compute firewall-rules create <FIREWALL_NAME> --network dataproctest 
                  --allow [PROTOCOL[:PORT]] --source-ranges <IP_RANGE>

Discussion

Compute Engines that are part of a Dataproc cluster must reside within a VPC network to communicate with each other and external resources when necessary. The Dataproc service mandates that all VMs in the cluster be able to communicate with each other using the ICMP, TCP (all ports), and UDP (all ports) protocols.

The default project network typically has subnets created in the range of 10.128.0.0/9. It also includes the ‘default-allow-internal’ firewall rule, permitting communication within this subnet range. Table 1-3 (if presented) would illustrate an example of these rules. If you are creating a custom network, ensure you establish a rule aligned with Dataproc’s requirements to enable internal communication.

Table 1-3. Firewall rule requirements for Dataproc cluster Network
Direction Priority Source range Protocols:Ports
ingress 65534 For default network: 10.128.0.0/9

For custom vpc/subnet use the custom subnet range.
tcp:0-65535,udp:0-65535,icmp

To create a VPC network in auto mode

gcloud compute networks create dataproc-vpc\
 --subnet-mode auto\
 --description "VPC network hosting dataproc resources"
Tip

Configuring subnet mode as ‘auto’ creates multiple subnets (one for each GCP region). This VPC creates subnets in all available regions, allowing you to create a Dataproc cluster in any region.

To prevent the automatic creation of too many(40+) subnets in auto mode, we can create subnets in required regions. This is a two-step process: first, create a VPC, and then add a subnet to it.

Creating VPC with custom subnet mode, It creates an empty VPC without any subnets.

gcloud compute networks create dataproc-vpc\
 --subnet-mode custom\
 --description "VPC network hosting dataproc resources"

Creating Subnet in us-east1 region with range of 10.120.0.0/20

gcloud compute networks subnets create dataproc-vpc-us-east1-subnet\
 --network=dataproc-vpc\
 --region=us-east1\
 --range=10.120.0.0/20
Tip

A subnet range refers to the capacity to hold a certain number of IP addresses within that subnet. In the case of 10.120.0.0/20, it has a capacity of 256 IP addresses, with 254 being usable (excluding the very first 10.120.0.0, which is the network address, and the last 10.120.15.255, which is the broadcast address).

To choose a suitable subnet range for the maximum number of hosts on a Dataproc cluster, you will need to consider the expected number of hosts and allow room for growth.

Resources in VPC are not reachable until you create a firewall rule to allow communication. Attach the firewall rule matching dataproc service requirements

Creating Firewall rule for custom subnet with IP range as 10.120.0.0/20

gcloud compute firewall-rules create dataproc-allow-tcp-udp-icmp-all-ports\
 --network dataproc-vpc\
 --allow tcp:0-65535,udp:0-65535,icmp\
 --source-ranges "10.120.0.0/20"
Successful firewall rule creation output
Figure 1-3. Successful firewall rule creation output

See Also

Refer to Google Cloud public documentation to learn more about VPC Network and Firewall rules https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/network

1.4 Create Dataproc Cluster from UI

Problem

You wanted to create a Dataproc cluster using Web UI

Solution

Google Cloud Console is a web-based UI for accessing Google Cloud services. Web UI offers the option to create and manage dataproc clusters.

Discussion

For creating a Dataproc cluster using Web UI, first login to google cloud console using URL https://console.cloud.google.com/, as shown in Figure 1-4 .

Sign in to Google Cloud Console
Figure 1-4. Sign in to Google Cloud Console

Once login to Google Cloud Console you will see the dashboard or home page as shown in Figure 1-5

Google Cloud Console Home Page
Figure 1-5. Google Cloud Console Home Page

In the search bar enter the keyword Dataproc and select the service as shown in Figure 1-6

Searching for Dataproc service in console
Figure 1-6. Searching for Dataproc service in console

If the Dataproc API is not enabled in the project it might prompt you to enable the service. Dataproc Service is not enabled by default in the project. Click on the Enable button to enable the service as shown in Figure 1-7

Enabling Dataproc API service
Figure 1-7. Enabling Dataproc API service
Tip

If the billing account is not linked to the project it might ask you to link existing or create a new billing account.

Selecting Dataproc service from search will take you to the Dataproc Service home page. Click Create Cluster for creating a new cluster.

Button to create a Dataproc cluster
Figure 1-8. Button to create a Dataproc cluster

Dataproc clusters can be created on Compute Engine or Google Kubernetes Engine (GKE) . Select the Cluster on Compute Engine option as shown in Figure 1-9

Console UI showing different options for creating Dataproc Cluster
Figure 1-9. Console UI showing different options for creating Dataproc Cluster

To create a basic cluster with all default values just enter two values: cluster name and region and click on Create button as shown in Figure 1-10

Screenshot showing Cluster name and location region entered
Figure 1-10. Screenshot showing Cluster name and location/region entered

Creating a cluster will take up to 90 seconds. Once successfully created you will see the cluster in the Dataproc service home page. Click on hyperlink with Cluster name to view details inside cluster.

Dataproc service page listing available clusters
Figure 1-11. Dataproc service page listing available clusters

1.5 Creating Dataproc Cluster using gcloud

Problem

The manual creation of clusters from the UI is time-consuming when creating multiple clusters. How can you accelerate development and testing from a local machine by creating a cluster from the command line?

Solution

Install a Google Cloud CLI in your machine and run a gcloud dataproc clusters create command.

Command to create cluster with basic configuration:

gcloud dataproc clusters create basic-cluster --region us-central1

Cluster with custom configuration (Machine types,Disk,Network,etc.,)

gcloud dataproc clusters create basic-cluster\
 --region us-central1\
 --zone ""\
 --image-version 2.0-debian10\
 --master-machine-type n1-standard-4\
 --worker-machine-type n1-standard-8\
 --master-boot-disk-type pd-ssd\
 --master-boot-disk-size 100\
 --worker-boot-disk-type pd-ssd\
 --worker-boot-disk-size 200\
 --num-worker-local-ssds 2\
 --network default\
 --enable-component-gateway

Discussion

To create a dataproc cluster with gcloud: you provide the cluster name and the region where the cluster needs to be created. The following command creates a dataproc cluster with name basic_cluster in region us-central1.

gcloud dataproc clusters create basic-cluster --region us-central1
                  

The cluster created from the command will use the defaults shown in Table 1-4.

Table 1-4. Default values when creating cluster with name and region
Property Default Value
Number of Master Nodes 1
Number of Primary workers 2
Machine Type Choses a machine type based on Dataproc internal configuration.
Network When no network is specified, the cluster uses the network with the name “default” that is available in the project..
Zone Intelligently picks a zone within the region specified.
Dataproc version Defaults to latest version available
Disk Type Standard Persistent Disk
Disk Size 1000 GB
Component Gateway Disabled
Tip

The default values could be changed by Google over time. For latest refer to the google documentation(https://cloud.google.com/sdk/gcloud/reference/dataproc/clusters/create).

To view the list of clusters available in a region and project

gcloud dataproc clusters list --project {project_name_here} –-region {region_name_here}
                  

To delete a cluster

gcloud dataproc clusters delete basic_cluster --region=us-central1
                  

 

The minimum required arguments are the cluster name and the region where the cluster will be created. Additional customizations, such as machine types, secondary workers, disk type/size for primary workers and secondary workers, high availability, and component gateway, can also be configured using the gcloud command.

Let’s look at the command to customize few more components in the cluster

gcloud dataproc clusters create basic-cluster\
 --region us-central1\
 --zone ""\
 --image-version 2.0-debian10\
 --master-machine-type n1-standard-4\
 --worker-machine-type n1-standard-8\
 --master-boot-disk-type pd-ssd\
 --master-boot-disk-size 100\
 --worker-boot-disk-type pd-ssd\
 --worker-boot-disk-size 200\
 --num-worker-local-ssds 2\
 --network default\
 --enable-component-gateway

From this command the following components were customized :

region

The region is where your cluster will be created.

Zone

Google Cloud has multiple zones in each region. For example, the us-central1 region has zones us-central1-a, us-central1-b, and us-central1-c. If you do not specify a zone when creating a Dataproc cluster, Dataproc will choose a zone for you in the specified region.

Image-version

Combination of operating system and Hadoop Technology stack. Image version 2.0-debian10 comes with Debain 10 operating system and Apache Hadoop 3.X, Apache Spark 3.1 . It is recommended to explicitly specify the image version when creating Dataproc clusters to ensure consistency in the cluster configuration. Refer to https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters for available and supported Dataproc versions.

Master-machine-type

Compute Engine type for installing master node services.

Master-boot-disk-type

Disk type to be attached to the master node. Accepted values are pd-ssd ,pd-standard , pd-balanced.

Master-boot-disk-size

Size of master boot disk . By default, values given are assumed to be in GBs. Value 100 refers to a 100 GB disk to be attached to the master node.

worker-boot-disk-type

Disk type to be attached to the worker node. Accepted values are pd-ssd ,pd-standard , pd-balanced.

Tip

Dataproc cluster consists of VM’s provided by Google Cloud Compute Engine service. Choose a VM machine type that suits your data processing needs. GCP offers a variety of machine types,

  • General purpose (N1,N2,N2D,T2D,T2A,etc.,)

  • Cost Optimized (E2)

  • Memory Optimized (M1)

  • CPU optimized (C2,C2D,C3)

  • Custom Machine types that can be created with custom memory and CPU configuration.

For data pipelines, general-purpose machine types N2D and N2 are popular choices. High-performance compute required workloads use C3 machine types or GPUs.

worker-boot-disk-size

Size of worker boot disk . By default, values given are assumed to be in GBs. Value 100 refers to a 100 GB disk to be attached to the master node.

Tip

Dataproc clusters require storage attached to compute nodes for storing persistent or temporary data. Google cloud platform offers following storage options

  • PD Standard (Persistent Disk Standard)

  • PD SSD (Persistent Disk SSD)

  • Local SSD

Local SSDs are a recommended choice for Dataproc worker nodes as they offer greater performance than PD Standard and less expensive than PD SSD.

num-worker-local-ssds

Local SSDs are recommended storage for worker nodes. They offer higher performance than standard disks with a good price to performance ratio.

num-worker

Number of primary worker nodes. You can also add secondary worker nodes that do compute only and no storage (HDFS).

Tip

Primary workers are the only worker machine types that have a Datanode component for storing HDFS data. Based on the amount of HDFS storage needed, choose the Primary workers. Not all of your data goes to HDFS. In future chapters, we will cover what gets stored in HDFS vs Local File system vs GCS.

network

Virtual network that pools cloud resources together. When you create a project it comes with a default network.

enable-component-gateway

Creates access to web endpoints for services like Resource Manager, Namenode Web UI, Spark History server.

We will learn more about the customizations in next chapters.

1.6 Creating Dataproc Cluster using API Endpoints

Problem

You wanted to create a cluster using REST API to make the process of cluster creation as platform independent.

Solution

curl -X POST\
 -H "Authorization: Bearer $(gcloud auth print-access-token)"\
 -H "Content-Type: application/json; charset=utf-8"\
 -d @request.json\
 "https://dataproc.googleapis.com/v1/projects/[project_name]/regions/<region-name>/clusters"

Discussion

Create a json request file with all the required configuration.

request.json

{ "projectId": "dataproctest",
                  "clusterName": "dataproc-test-cluster",
                  "config": { "gceClusterConfig": {
                    "networkUri": "default",
                    "zoneUri": "us-central1-c" 
                  }, 
                  "masterConfig": {
                  "numInstances": 1, 
                  "machineTypeUri": "n2-standard-4", 
                  "diskConfig": { 
                    "bootDiskType": "pd-standard",
                    "bootDiskSizeGb": 500, 
                    "numLocalSsds": 0 
                  } 
                }, 
                "softwareConfig": { 
                  "imageVersion": "2.1-debian11",
                }, "workerConfig": {
                  "numInstances": 2,
                  "machineTypeUri": "n2-standard-4",
                  "diskConfig": {
                    "bootDiskType": "pd-standard",
                    "bootDiskSizeGb": 500,
                    "numLocalSsds": 2,
                    "localSsdInterface": "SCSI"
                  }
                }
              }, 
              "labels": { 
                "billing_account": "test-account" 
              } 
            }

Making a REST API call to google cloud services requires the user to provide authorization tokens. Authenticate from command line prior to executing curl command,

To authenticate with a personal account run the command

gcloud auth login

To authenticate as a service account using credential json file run the command

gcloud auth activate-service-account
                  --key-file=<credential-json-file-location>

Execute the following curl command by replacing project_name and region_name (ex., us-central1).

curl -X POST\
 -H "Authorization: Bearer $(gcloud auth print-access-token)"\
 -H "Content-Type: application/json; charset=utf-8"\
 -d @request.json\
 "https://dataproc.googleapis.com/v1/projects/{project_name}/regions/{region-name}/clusters"

Successful execution of the curl command will give output as below

{ 
                  "name": 
                  "projects/{project-name}/regions/{region-name}/operations/b5706e31......",
                    "metadata": {
                      "@type": 
                  "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
                      "clusterName": "cluster-name", 
                      "clusterUuid": "5fe882b2-...", 
                      "status": {
                        "state": "PENDING",
                        "innerState": "PENDING",
                        "stateStartTime": "2019-11-21T00:37:56.220Z" 
                      }, 
                      "operationType": "CREATE", 
                      "description": "Create cluster with 2 workers",
                      "warnings": [ 
                        "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ..."" 
                      ] 
                    }
                  }

1.7 Creating Dataproc Cluster using Terraform

Problem

You wanted to automate the provisioning and managing of clusters with Infrastructure as a Code(IaC) framework.

Solution

Terraform is a Infrastructure as Code (IaC) tool that allows users to create and maintain cloud infrastructure using declarative configuration language.

Discussion

Terraform is a widely used IaC tool for creating, maintaining and managing the Cloud Platform resources. This tool supports multiple cloud vendors AWS,GCP,Azure,etc.,

Install Terraform following the instructions at public documentation https://developer.hashicorp.com/terraform/downloads .

Terraform code execution involves

  • init - Initializes a state of resources

  • plan - Run a preview and let you know what all changes to be applied on top of the current state of resources.

  • apply - Applies the changes on resources.

  • destroy - Deletes all the resources.


Following is a sample Terraform code to create a basic dataproc cluster with no customizations. The code has two configuration blocks:

  • Provider - Provider block helps interaction between terraform and Google Cloud Platform. Service account credentials file in json format is configured for authentication.

  • Google_dataproc_cluster Resource contains configuration specific to the Dataproc cluster being created.

  • Google_compute_network

  • Google_compute_firewall

provider "google" {
 credentials = file("service-account-credentials-file.json")
 project     = "project-id"
 region      = "us-central1"
}


resource "google_dataproc_cluster" "clusterCreationResource" {
 provider = google
 name     = "basic-cluster"
 region   = "us-central1"

 cluster_config {

    gce_cluster_config {
    network = google_compute_network.dataproc_network.name
  }


   master_config {
     num_instances     = 1
     machine_type      = "n1-standard-4"
   }

   worker_config {
     num_instances     = 2
     machine_type      = "n1-standard-8"

   }

   endpoint_config {
      enable_http_port_access = "true"
    }

 }
}

resource "google_compute_network" "dataproc_network" {
  name                    = "basic-cluster-network"
  auto_create_subnetworks = true
}

resource "google_compute_firewall" "firewall_rules" {
  name    = "basic-cluster-firewall-rules"
  network = google_compute_network.dataproc_network.name

  // Allow ping
  allow {
    protocol = "icmp"
  }
  //Allow all TCP ports
  allow {
    protocol = "tcp"
    ports    = ["1-65535"]
  }
  //Allow all UDP ports
  allow {
    protocol = "udp"
    ports    = ["1-65535"]
  }
  source_ranges = ["0.0.0.0/0"]
}

Save the terraform sample code in the main.tf file .

Navigate to the folder that has you main.tf file and run a command to initialize terraform

terraform init

Run a plan command

terraform plan

Run a plan to apply changes and create the cluster

terraform apply 

To destroy the cluster run a destroy command

terraform destroy
Tip

Terraform maintains the state of all resources it created in a file named terraform.tfstate. When running multiple times, it compares with the state file and applies only updates where needed. The destroy option will delete all the resources it created and maintained in the state.

1.8 Creating cluster using Python

Problem

You wanted to automate the creation of a cluster using the Python programming language.

Solution

Google Cloud Dataproc offers python client libraries for interacting with Dataproc services. Here is a code for creating a Dataproc cluster.

from google.cloud import dataproc_v1

def create_dataproc_cluster(project_id, region, cluster_name):
    """Creates a Dataproc cluster."""
    dataproc_cluster_client = dataproc_v1.ClusterControllerClient()

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }


    operation = dataproc_cluster_client.create_cluster(
        project_id=project_id,
        region="us-central1",
        cluster=cluster
    )


    #result = operation.result()

    print(f"Created Dataproc cluster: {cluster.cluster_name}")


if __name__ == "__main__":
    project_id = "PROJECT-ID"
    region = "REGION"
    cluster_name = "CLUSTER-NAME"

    create_dataproc_cluster(project_id, region, cluster_name)

Discussion

Running a python based sdk requires the python package google-cloud-dataproc to be installed. To install google-cloud-dataproc using pip execute the command

pip install google-cloud-dataproc

Lets understand the python code step by step how it creates Dataproc cluster:

The code first imports the dataproc_v1 module from the google.cloud package. This module provides the Python client library for the Google Cloud Dataproc API.

from google.cloud import dataproc_v1

The create_dataproc_cluster() function takes three arguments: the project ID, the region, and the cluster name.

def create_dataproc_cluster(project_id, region, cluster_name)

The function first creates a client object for the ClusterControllerClient class. This class provides methods for creating, managing, and monitoring Dataproc clusters.

dataproc_cluster_client = dataproc_v1.ClusterControllerClient()

The function then creates a cluster configuration object. The cluster configuration object specifies the configuration of the cluster, such as the number of master and worker nodes, the machine types for the nodes, and the software that should be installed on the nodes.

# Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

The function then calls the create_cluster() method on the client object. The create_cluster() method creates a new Dataproc cluster and returns an operation object. The operation object can be used to track the progress of the cluster creation.

operation = dataproc_cluster_client.create_cluster(
        project_id=project_id,
        region="us-central1",
        cluster=cluster
    )

The function finally prints a message that the cluster has been created.

print(f"Created Dataproc cluster: {cluster.cluster_name}")

The if __name__ == “__main__”: statement at the end of the code defines the main entry point for the program. When the program is run, this statement will be executed first. The statement then assigns the values to the variables project_id, region, and cluster_name. The create_dataproc_cluster() function is then called with these values.

if __name__ == "__main__":
    project_id = "project-id"
    region = "us-central1"
    cluster_name = "basic-cluster"

    create_dataproc_cluster(project_id, region, cluster_name)

To run the code, you can save it as a Python file and then run it from the command line. For example, if you save the code as create_dataproc_cluster.py, you can run it by typing the following command into the command line:

python create_dataproc_cluster.py

1.9 Duplicating a Dataproc Cluster

Problem

Users reported an issue in production. You don’t have access to the production environment, so you want to create an exact replica of the production cluster and verify the issue.

Solution

Export the existing cluster configuration to a file

gcloud dataproc clusters export <source-cluster-name> --destination 
                  prod-cluster-config.yaml

Create a new cluster using the YAML configuration file

gcloud dataproc clusters import <target-cluster-name>\
 --source prod-cluster-config.yaml\
 --region=region

 

Discussion

When working with existing clusters, you may want to view cluster details such as worker details, labels, custom configurations, and component gateway URLs.

Gcloud command for viewing existing cluster configuration

gcloud dataproc clusters describe <cluster-name-here> --region <region>

Creating a new cluster with the same configuration as the existing cluster is a two step process. First we have to export the existing cluster configuration to a file. Dataproc offers the gcloud command option to export the configuration in YAML file format. At the time of this writing this option of configuration export is only supported from gcloud and can’t be done from Web UI.

Run a command to export the configuration

gcloud dataproc clusters export prod-cluster --destination 
                  prod-cluster-config.yaml

Upon successful execution of the command, the cluster configuration will be stored in a file named prod-cluster-config.yaml. The cluster name and region are not included in the export because the name must be unique. When creating a new cluster using this configuration, the cluster name and region must be provided.

Run a command to create a new cluster using the configuration present in YAML file

gcloud dataproc clusters import prod-cluster-duplicate\
 --source prod-cluster-config.yaml\
 --region=region

Get Dataproc Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.