Chapter 8. Automating Analysis Execution with Workflows
So far, we’ve been running individual commands manually in the terminal. However, the overwhelming majority of genomics work proper—the secondary analysis, in which we go from raw data to distilled information that we’ll then feed into downstream analysis to ultimately produce biological insight—involves running the same commands in the same order on all the data as it rolls off the sequencer. Look again at the GATK Best Practices workflows described in Chapters 6 and 7; imagine how tedious it would be to do all that manually for every sample. Newcomers to the field often find it beneficial to run through the commands involved step by step on test data, to gain an appreciation of the key steps, their requirements, and their quirks—that’s why we walked you through all that in Chapters 6 and 7—but at the end of the day, you’re going to want to automate all of that as much as possible. Automation not only reduces the amount of tedious manual work you need to do, but also increases the throughput of your analysis and reduces the opportunity for human error.
In this chapter, we cover that shift from individual commands or one-off scripts to repeatable workflows. We show you how to use a workflow management system (Cromwell), and language, WDL, that we selected primarily for their portability. We walk you through writing your first example workflow, executing it, and interpreting Cromwell’s output. Then, we do the same thing with ...
Get Genomics in the Cloud now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.