Chapter 10. Running Single Workflows at Scale with Pipelines API

In Chapter 8, we started running workflows for the first time, working on a custom virtual machine in GCP. However, that single-machine setup didn’t allow us to take advantage of the biggest strength of the cloud: the availability of seemingly endless numbers of machines on demand! So in this chapter, we use a service offered by GCP called Genomics Pipelines API (PAPI), which functions as a sort of job scheduler for GCP Compute Engine instances, to do exactly that.

First, we try simply changing the Cromwell configuration on our VM to submit job execution to PAPI instead of the local machine. Then, we try out a tool called WDL_Runner that wraps Cromwell and manages submissions to PAPI, which makes it easier to “launch and forget” WDL executions. Both of these options, which we explore in the first half of this chapter, will open the door for us to run full-scale GATK pipelines that we could not have run on our single-VM setup in Chapter 9. Along the way, we also discuss important considerations such as runtime, cost, portability, and overall efficiency of running workflows in the cloud.

Introducing the GCP Genomics Pipelines API Service

The Genomics Pipelines API is a service operated by GCP that makes it easy to dispatch jobs for execution on the GCP Compute Engine without having to actually manage VMs directly. Despite its name, the Genomics Pipelines API is not at all specific to genomics, so it can be used for ...

Get Genomics in the Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.