O'Reilly logo

Strategies in Biomedical Data Science by Ken Buetow, Jay A. Etchings

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Appendix C HPC Working Example

Using Parallelization Programs, such as GNU Parallel and OpenMP, with Serial Tools

Overview

The goal of this document is to provide several examples and methods to program and use parallel logic to process multiple data sets using multiple cores on one or more servers. The document does not cover message passing interface (MPI) batch processing invoking multiple nodes sharing a single processing job.

Basic knowledge of shell scripting is helpful but not absolutely necessary.

This page is helpful for beginners: http://linuxcommand.org/lc3_wss0020.php.

Key terms used in these scripts:

  • Arrays
  • Variables
  • Arguments
  • Functions
  • For loops

Linux tools:

Next-generation sequencing (NGS) tools used:

HPC Resource Manager:

BIO HPC Use Case 1

Biologist AF receives 24 Binary Alignment Map (BAM) files from a third-party lab. AF uses Picard’s samtools program to index these BAM files, but the index is corrupt and unusable. AF contacts the lab and discovers the files received were not correctly processed (aligned, sorted, read groups added, etc.).

Solution:

Process BAM data using NGS tools. ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required