Book description
Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytesâ??or over 50 million gigabytesâ??of genomic data, and theyâ??re turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that volume of data in the cloud?
With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Geraldine Van der Auwera, longtime custodian of the GATK user community, and Brian Oâ??Connor of the UC Santa Cruz Genomics Institute, guide you through the process. Youâ??ll learn by working with real data and genomics algorithms from the field.
This book covers:
- Essential genomics and computing technology background
- Basic cloud computing operations
- Getting started with GATK, plus three major GATK Best Practices pipelines
- Automating analysis with scripted workflows using WDL and Cromwell
- Scaling up workflow execution in the cloud, including parallelization and cost optimization
- Interactive analysis in the cloud using Jupyter notebooks
- Secure collaboration and computational reproducibility using Terra
Publisher resources
Table of contents
- Foreword
- Preface
- 1. Introduction
- 2. Genomics in a Nutshell: A Primer for Newcomers to the Field
- 3. Computing Technology Basics for Life Scientists
- 4. First Steps in the Cloud
- 5. First Steps with GATK
-
6. GATK Best Practices for Germline Short Variant Discovery
- Data Preprocessing
-
Joint Discovery Analysis
- Overview of the Joint Calling Workflow
- Calling Variants per Sample to Generate GVCFs
- Consolidating GVCFs
- Applying Joint Genotyping to Multiple Samples
- Filtering the Joint Callset with Variant Quality Score Recalibration
- Refining Genotype Assignments and Adjusting Genotype Confidence
- Next Steps and Further Reading
- Single-Sample Calling with CNN Filtering
- Wrap-Up and Next Steps
- 7. GATK Best Practices for Somatic Variant Discovery
- 8. Automating Analysis Execution with Workflows
- 9. Deciphering Real Genomics Workflows
- 10. Running Single Workflows at Scale with Pipelines API
- 11. Running Many Workflows Conveniently in Terra
-
12. Interactive Analysis in Jupyter Notebook
- Introduction to Jupyter in Terra
-
Getting Started with Jupyter in Terra
- Inspecting and Customizing the Notebook Runtime Configuration
- Opening Notebook in Edit Mode and Checking the Kernel
- Running the Hello World Cells
- Using gsutil to Interact with Google Cloud Storage Buckets
- Setting Up a Variable Pointing to the Germline Data in the Book Bucket
- Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
- Visualizing Genomic Data in an Embedded IGV Window
- Running GATK Commands to Learn, Test, or Troubleshoot
- Visualizing Variant Context Annotation Data
- Wrap-Up and Next Steps
-
13. Assembling Your Own Workspace in Terra
- Managing Data Inside and Outside of Workspaces
-
Re-Creating the Tutorial Workspace from Base Components
- Creating a New Workspace
- Adding the Workflow to the Methods Repository and Importing It into the Workspace
- Creating a Configuration Quickly with a JSON File
- Adding the Data Table
- Filling in the Workspace Resource Data Table
- Creating a Workflow Configuration That Uses the Data Tables
- Adding the Notebook and Checking the Runtime Environment
- Documenting Your Workspace and Sharing It
-
Starting from a GATK Best Practices Workspace
- Cloning a GATK Best Practices Workspace
- Examining GATK Workspace Data Tables to Understand How the Data Is Structured
- Getting to Know the 1000 Genomes High Coverage Dataset
- Copying Data Tables from the 1000 Genomes Workspace
- Using TSV Load Files to Import Data from the 1000 Genomes Workspace
- Running a Joint-Calling Analysis on the Federated Dataset
- Building a Workspace Around a Dataset
- Wrap-Up and Next Steps
- 14. Making a Fully Reproducible Paper
- Glossary
- Index
Product information
- Title: Genomics in the Cloud
- Author(s):
- Release date: April 2020
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491975190
You might also like
book
Data Science on the Google Cloud Platform, 2nd Edition
Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems …
book
The Enterprise Big Data Lake
The data lake is a daring new approach for harnessing the power of big data technology …
book
Data Science on AWS
With this practical book, AI and machine learning practitioners will learn how to successfully build and …
book
Kubernetes in Action
Kubernetes in Action is a comprehensive guide to effectively developing and running applications in a Kubernetes …