book

Genomics in the Cloud

by Geraldine A. Van der Auwera, Brian D. O'Connor

April 2020

Beginner to intermediate

493 pages

15h 34m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
Purpose, Scope, and Intended Audience of This BookWhat You Will Learn from This BookWhat Computational Experience Is Needed for the Exercises?Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction
The Promises and Challenges of Big Data in Biology and Life SciencesInfrastructure ChallengesToward a Cloud-Based Ecosystem for Data Sharing and AnalysisCloud-Hosted Data and ComputePlatforms for Research in the Life SciencesStandardization and Reuse of InfrastructureBeing FAIRWrap-Up and Next Steps
2. Genomics in a Nutshell: A Primer for Newcomers to the Field
Introduction to GenomicsThe Gene as a Discrete Unit of Inheritance (Sort Of)The Central Dogma of Biology: DNA to RNA to ProteinThe Origins and Consequences of DNA MutationsGenomics as an Inventory of Variation in and Among GenomesThe Challenge of Genomic Scale, by the NumbersGenomic VariationThe Reference Genome as Common FrameworkPhysical Classification of VariantsGermline Variants Versus Somatic AlterationsHigh-Throughput Sequencing Data GenerationFrom Biological Sample to Huge Pile of Read DataTypes of DNA Libraries: Choosing the Right Experimental DesignData Processing and AnalysisMapping Reads to the Reference GenomeVariant CallingData Quality and Sources of ErrorFunctional Equivalence Pipeline SpecificationWrap-Up and Next Steps
3. Computing Technology Basics for Life Scientists
Basic Infrastructure Components and Performance BottlenecksTypes of Processor Hardware: CPU, GPU, TPU, FPGA, OMGLevels of Compute Organization: Core, Node, Cluster, and CloudAddressing Performance BottlenecksParallel ComputingParallelizing a Simple AnalysisFrom Cores to Clusters and Clouds: Many Levels of ParallelismTrade-Offs of Parallelism: Speed, Efficiency, and CostPipelining for Parallelization and AutomationWorkflow LanguagesPopular Pipelining Languages for GenomicsWorkflow Management SystemsVirtualization and the CloudVMs and ContainersIntroducing the CloudCategories of Research Use Cases for Cloud ServicesWrap-Up and Next Steps
4. First Steps in the Cloud
Setting Up Your Google Cloud Account and First ProjectCreating a ProjectChecking Your Billing Account and Activating Free CreditsRunning Basic Commands in Google Cloud ShellLogging in to the Cloud Shell VMUsing gsutil to Access and Manage FilesPulling a Docker Image and Spinning Up the ContainerMounting a Volume to Access the Filesystem from Within the ContainerSetting Up Your Own Custom VMCreating and Configuring Your VM InstanceLogging into Your VM by Using SSHChecking Your AuthenticationCopying the Book Materials to Your VMInstalling Docker on Your VMSetting Up the GATK Container ImageStopping Your VM…to Stop It from Costing You MoneyConfiguring IGV to Read Data from GCS BucketsWrap-Up and Next Steps
5. First Steps with GATK
Getting Started with GATKOperating RequirementsCommand-Line SyntaxMultithreading with SparkRunning GATK in PracticeGetting Started with Variant DiscoveryCalling Germline SNPs and Indels with HaplotypeCallerFiltering Based on Variant Context AnnotationsIntroducing the GATK Best PracticesBest Practices Workflows Covered in This BookOther Major Use CasesWrap-Up and Next Steps
6. GATK Best Practices for Germline Short Variant Discovery
Data PreprocessingMapping Reads to the Genome ReferenceMarking DuplicatesRecalibrating Base Quality ScoresJoint Discovery AnalysisOverview of the Joint Calling WorkflowCalling Variants per Sample to Generate GVCFsConsolidating GVCFsApplying Joint Genotyping to Multiple SamplesFiltering the Joint Callset with Variant Quality Score RecalibrationRefining Genotype Assignments and Adjusting Genotype ConfidenceNext Steps and Further ReadingSingle-Sample Calling with CNN FilteringOverview of the CNN Single-Sample WorkflowApplying 1D CNN to Filter a Single-Sample WGS CallsetApplying 2D CNN to Include Read Data in the ModelingWrap-Up and Next Steps
7. GATK Best Practices for Somatic Variant Discovery
Challenges in Cancer GenomicsSomatic Short Variants (SNVs and Indels)Overview of the Tumor-Normal Pair Analysis WorkflowCreating a Mutect2 PoNRunning Mutect2 on the Tumor-Normal PairEstimating Cross-Sample ContaminationFiltering Mutect2 CallsAnnotating Predicted Functional Effects with FuncotatorSomatic Copy-Number AlterationsOverview of the Tumor-Only Analysis WorkflowCreating a Somatic CNA PoNApplying DenoisingPerforming Segmentation and Call CNAsAdditional Analysis OptionsWrap-Up and Next Steps
8. Automating Analysis Execution with Workflows
Introducing WDL and CromwellInstalling and Setting Up CromwellYour First WDL: Hello WorldLearning Basic WDL Syntax Through a Minimalist ExampleRunning a Simple WDL with Cromwell on Your Google VMInterpreting the Important Parts of Cromwell’s Logging OutputAdding a Variable and Providing Inputs via JSONAdding Another Task to Make It a Proper WorkflowYour First GATK Workflow: Hello HaplotypeCallerExploring the WDLGenerating the Inputs JSONRunning the WorkflowBreaking the Workflow to Test Syntax Validation and Error MessagingIntroducing Scatter-Gather ParallelismExploring the WDLGenerating a Graph Diagram for VisualizationWrap-Up and Next Steps

9. Deciphering Real Genomics Workflows
Mystery Workflow #1: Flexibility Through ConditionalsMapping Out the WorkflowReverse Engineering the Conditional SwitchMystery Workflow #2: Modularity and Code ReuseMapping Out the WorkflowUnpacking the Nesting DollsWrap-Up and Next Steps
10. Running Single Workflows at Scale with Pipelines API
Introducing the GCP Genomics Pipelines API ServiceEnabling Genomics API and Related APIs in Your Google Cloud ProjectDirectly Dispatching Cromwell Jobs to PAPIConfiguring Cromwell to Communicate with PAPIRunning Scattered HaplotypeCaller via PAPIMonitoring Workflow Execution on Google Compute EngineUnderstanding and Optimizing Workflow EfficiencyGranularity of OperationsBalance of Time Versus MoneySuggested Cost-Saving OptimizationsPlatform-Specific Optimization Versus PortabilityWrapping Cromwell and PAPI Execution with WDL RunnerSetting Up WDL RunnerRunning the Scattered HaplotypeCaller Workflow with WDL RunnerMonitoring WDL Runner ExecutionWrap-Up and Next Steps
11. Running Many Workflows Conveniently in Terra
Getting Started with TerraCreating an AccountCreating a Billing ProjectCloning the Preconfigured WorkspaceRunning Workflows with the Cromwell Server in TerraRunning a Workflow on a Single SampleRunning a Workflow on Multiple Samples in a Data TableMonitoring Workflow ExecutionLocating Workflow Outputs in the Data TableRunning the Same Workflow Again to Demonstrate Call CachingRunning a Real GATK Best Practices Pipeline at Full ScaleFinding and Cloning the GATK Best Practices Workspace for Germline Short Variant DiscoveryExamining the Preloaded DataSelecting Data and Configuring the Full-Scale WorkflowLaunching the Full-Scale Workflow and Monitoring ExecutionOptions for Downloading Output Data—or NotWrap-Up and Next Steps
12. Interactive Analysis in Jupyter Notebook
Introduction to Jupyter in TerraJupyter Notebooks in GeneralHow Jupyter Notebooks Work in TerraGetting Started with Jupyter in TerraInspecting and Customizing the Notebook Runtime ConfigurationOpening Notebook in Edit Mode and Checking the KernelRunning the Hello World CellsUsing gsutil to Interact with Google Cloud Storage BucketsSetting Up a Variable Pointing to the Germline Data in the Book BucketSetting Up a Sandbox and Saving Output Files to the Workspace BucketVisualizing Genomic Data in an Embedded IGV WindowSetting Up the Embedded IGV BrowserAdding Data to the IGV BrowserSetting Up an Access Token to View Private DataRunning GATK Commands to Learn, Test, or TroubleshootRunning a Basic GATK Command: HaplotypeCallerLoading the Data (BAM and VCF) into IGVTroubleshooting a Questionable Variant Call in the Embedded IGV BrowserVisualizing Variant Context Annotation DataExporting Annotations of Interest with VariantsToTableLoading R Script to Make Plotting Functions AvailableMaking Density Plots for QUAL by Using makeDensityPlotMaking a Scatter Plot of QUAL Versus DPMaking a Scatter Plot Flanked by Marginal Density PlotsWrap-Up and Next Steps
13. Assembling Your Own Workspace in Terra
Managing Data Inside and Outside of WorkspacesThe Workspace Bucket as Data RepositoryAccessing Private Data That You Manage Outside of TerraAccessing Data in the Terra Data LibraryRe-Creating the Tutorial Workspace from Base ComponentsCreating a New WorkspaceAdding the Workflow to the Methods Repository and Importing It into the WorkspaceCreating a Configuration Quickly with a JSON FileAdding the Data TableFilling in the Workspace Resource Data TableCreating a Workflow Configuration That Uses the Data TablesAdding the Notebook and Checking the Runtime EnvironmentDocumenting Your Workspace and Sharing ItStarting from a GATK Best Practices WorkspaceCloning a GATK Best Practices WorkspaceExamining GATK Workspace Data Tables to Understand How the Data Is StructuredGetting to Know the 1000 Genomes High Coverage DatasetCopying Data Tables from the 1000 Genomes WorkspaceUsing TSV Load Files to Import Data from the 1000 Genomes WorkspaceRunning a Joint-Calling Analysis on the Federated DatasetBuilding a Workspace Around a DatasetCloning the 1000 Genomes Data WorkspaceImporting a Workflow from DockstoreConfiguring the Workflow to Use the Data TablesWrap-Up and Next Steps
14. Making a Fully Reproducible Paper
Overview of the Case StudyComputational Reproducibility and the FAIR FrameworkOriginal Research Study and History of the Case StudyAssessing the Available Information and Key ChallengesDesigning a Reproducible ImplementationGenerating a Synthetic Dataset as a Stand-In for the Private DataOverall MethodologyRetrieving the Variant Data from 1000 Genomes ParticipantsCreating Fake Exomes Based on Real PeopleMutating the Fake ExomesGenerating the Definitive DatasetRe-Creating the Data Processing and Analysis MethodologyMapping and Variant DiscoveryVariant Effect Prediction, Prioritization, and Variant Load AnalysisAnalytical Performance of the New ImplementationThe Long, Winding Road to FAIRnessFinal Conclusions
Glossary
Index

Overview

Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytesâ??or over 50 million gigabytesâ??of genomic data, and theyâ??re turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that volume of data in the cloud?

With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Geraldine Van der Auwera, longtime custodian of the GATK user community, and Brian Oâ??Connor of the UC Santa Cruz Genomics Institute, guide you through the process. Youâ??ll learn by working with real data and genomics algorithms from the field.

This book covers:

Essential genomics and computing technology background
Basic cloud computing operations
Getting started with GATK, plus three major GATK Best Practices pipelines
Automating analysis with scripted workflows using WDL and Cromwell
Scaling up workflow execution in the cloud, including parallelization and cost optimization
Interactive analysis in the cloud using Jupyter notebooks
Secure collaboration and computational reproducibility using Terra

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491975183Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills