Appendix C HPC Working Example
Using Parallelization Programs, such as GNU Parallel and OpenMP, with Serial Tools
Overview
The goal of this document is to provide several examples and methods to program and use parallel logic to process multiple data sets using multiple cores on one or more servers. The document does not cover message passing interface (MPI) batch processing invoking multiple nodes sharing a single processing job.
Basic knowledge of shell scripting is helpful but not absolutely necessary.
This page is helpful for beginners: http://linuxcommand.org/lc3_wss0020.php.
Key terms used in these scripts:
- Arrays
- Variables
- Arguments
- Functions
- For loops
Linux tools:
- bash
- sed
- split
- xargs
- vi
- GNU Parallel (http://www.gnu.org/software/parallel/)
Next-generation sequencing (NGS) tools used:
- Picard-tools (http://broadinstitute.github.io/picard/)
- SamToFastq
- Samtools
- AddOrReplaceReadGroups
- SortSam
- Burrows-Wheeler Aligner (http://sourceforge.net/projects/bio-bwa/)
- Plink (http://pngu.mgh.harvard.edu/~purcell/plink/)
HPC Resource Manager:
BIO HPC Use Case 1
Biologist AF receives 24 Binary Alignment Map (BAM) files from a third-party lab. AF uses Picard’s samtools program to index these BAM files, but the index is corrupt and unusable. AF contacts the lab and discovers the files received were not correctly processed (aligned, sorted, read groups added, etc.).
Solution:
Process BAM data using NGS tools. ...
Get Strategies in Biomedical Data Science now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.