Appendix C HPC Working Example

Using Parallelization Programs, such as GNU Parallel and OpenMP, with Serial Tools

Overview

The goal of this document is to provide several examples and methods to program and use parallel logic to process multiple data sets using multiple cores on one or more servers. The document does not cover message passing interface (MPI) batch processing invoking multiple nodes sharing a single processing job.

Basic knowledge of shell scripting is helpful but not absolutely necessary.

This page is helpful for beginners: http://linuxcommand.org/lc3_wss0020.php.

Key terms used in these scripts:

  • Arrays
  • Variables
  • Arguments
  • Functions
  • For loops

Linux tools:

Next-generation sequencing (NGS) tools used:

HPC Resource Manager:

BIO HPC Use Case 1

Biologist AF receives 24 Binary Alignment Map (BAM) files from a third-party lab. AF uses Picard’s samtools program to index these BAM files, but the index is corrupt and unusable. AF contacts the lab and discovers the files received were not correctly processed (aligned, sorted, read groups added, etc.).

Solution:

Process BAM data using NGS tools. ...

Get Strategies in Biomedical Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.