Running the application on GCP Dataproc

This section will provide a tutorial on how to run the Apex application on a real Hadoop cluster in the cloud. Dataproc (https://cloud.google.com/dataproc/) is one of several options that exist (Amazon EMR is another one, and the instructions here can be easily adapted to EMR as well).

The general instructions on how to work on a cluster were already covered in Chapter 2, Getting Started with Application Development, where a Docker container was used. This section will focus on the differences of adding Apex to an existing multi-node cluster.

To start with, we are heading over to the GCP console (https://console.cloud.google.com/dataproc/clusters) to create a new cluster.

For better illustration we ...

Get Learning Apache Apex now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.