IBM Spectrum Scale Best Practices for Genomics Medicine Workloads

Book description

Advancing the science of medicine by targeting a disease more precisely with treatment specific to each patient relies on access to that patient's genomics information and the ability to process massive amounts of genomics data quickly. Although genomics data is becoming a critical source for precision medicine, it is expected to create an expanding data ecosystem.

Therefore, hospitals, genome centers, medical research centers, and other clinical institutes need to explore new methods of storing, accessing, securing, managing, sharing, and analyzing significant amounts of data. Healthcare and life sciences organizations that are running data-intensive genomics workloads on an IT infrastructure that lacks scalability, flexibility, performance, management, and cognitive capabilities also need to modernize and transform their infrastructure to support current and future requirements.

IBM® offers an integrated solution for genomics that is based on composable infrastructure. This solution enables administrators to build an IT environment in a way that disaggregates the underlying compute, storage, and network resources. Such a composable building block based solution for genomics addresses the most complex data management aspect and allows organizations to store, access, manage, and share huge volumes of genome sequencing data.

IBM Spectrum™ Scale is software-defined storage that is used to manage storage and provide massive scale, a global namespace, and high-performance data access with many enterprise features. IBM Spectrum Scale™ is used in clustered environments, provides unified access to data via file protocols (POSIX, NFS, and SMB) and object protocols (Swift and S3), and supports analytic workloads via HDFS connectors. Deploying IBM Spectrum Scale and IBM Elastic Storage™ Server (IBM ESS) as a composable storage building block in a Genomics Next Generation Sequencing deployment offers key benefits of performance, scalability, analytics, and collaboration via multiple protocols.

This IBM Redpaper™ publication describes a composable solution with detailed architecture definitions for storage, compute, and networking services for genomics next generation sequencing that enable solution architects to benefit from tried-and-tested deployments, to quickly plan and design an end-to-end infrastructure deployment. The preferred practices and fully tested recommendations described in this paper are derived from running GATK Best Practices work flow from the Broad Institute.

The scenarios provide all that is required, including ready-to-use configuration and tuning templates for the different building blocks (compute, network, and storage), that can enable simpler deployment and that can enlarge the level of assurance over the performance for genomics workloads. The solution is designed to be elastic in nature, and the disaggregation of the building blocks allows IT administrators to easily and optimally configure the solution with maximum flexibility.

The intended audience for this paper is technical decision makers, IT architects, deployment engineers, and administrators who are working in the healthcare domain and who are working on genomics-based workloads.

Table of contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. Preface
    1. Authors
    2. Now you can become a published author, too
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  4. Summary of changes
    1. April 2018, Second Edition
    2. December 2017, First Edition
  5. Chapter 1. The IBM Spectrum Scale Blueprint for Genomics Medicine Workloads
    1. 1.1 Genomics medicine
      1. 1.1.1 Genomics medicine overview
      2. 1.1.2 Genomics workload
    2. 1.2 Solution approach
      1. 1.2.1 Composable infrastructure
      2. 1.2.2 Composable building blocks
      3. 1.2.3 Driven by design thinking
      4. 1.2.4 Driven by agile development
    3. 1.3 Blueprint capabilities
      1. 1.3.1 Capabilities of the compute services
      2. 1.3.2 Capabilities of the storage services
      3. 1.3.3 Capabilities of the private network services
    4. 1.4 Example environment
      1. 1.4.1 Physical configuration
      2. 1.4.2 Logical configuration
  6. Chapter 2. The compute services
    1. 2.1 Overview
      1. 2.1.1 Capabilities and solution elements
      2. 2.1.2 Software levels
    2. 2.2 Application layer
      1. 2.2.1 The Broad Institute GATK
    3. 2.3 Orchestration layer
      1. 2.3.1 IBM Spectrum LSF
    4. 2.4 Data layer
    5. 2.5 General recommendations
      1. 2.5.1 Designation of the compute nodes
      2. 2.5.2 IBM Spectrum Scale node roles
      3. 2.5.3 IBM Spectrum LSF host types
      4. 2.5.4 IBM Spectrum LSF add-ons
      5. 2.5.5 External dependencies
      6. 2.5.6 Communication and security aspects
    6. 2.6 Tuning
      1. 2.6.1 Operating system
      2. 2.6.2 Network
      3. 2.6.3 IBM Spectrum Scale
    7. 2.7 Monitoring
  7. Chapter 3. The storage services
    1. 3.1 Overview
      1. 3.1.1 Capabilities and solution elements
      2. 3.1.2 Software levels
    2. 3.2 File storage layer
      1. 3.2.1 IBM Spectrum Scale file systems
      2. 3.2.2 Recommendations for genomics medicine workloads
      3. 3.2.3 IBM Spectrum Scale filesets
    3. 3.3 Block storage layer
      1. 3.3.1 IBM Elastic Storage Server
    4. 3.4 File access layer
      1. 3.4.1 NFS and SMB
    5. 3.5 General recommendations
      1. 3.5.1 Recommendations for IBM Spectrum Scale
      2. 3.5.2 External dependencies
      3. 3.5.3 Communication and security aspects
    6. 3.6 Data management
    7. 3.7 Tuning
      1. 3.7.1 IBM Elastic Storage Server
      2. 3.7.2 Protocol nodes
    8. 3.8 Monitoring
  8. Chapter 4. The private network services
    1. 4.1 Overview
      1. 4.1.1 Capabilities and solution elements
      2. 4.1.2 Shared network
    2. 4.2 High-speed data network
      1. 4.2.1 IBM Spectrum Scale network requirements
      2. 4.2.2 Recommendations for genomics medicine workloads
      3. 4.2.3 Miscellaneous comments
    3. 4.3 Management networks
      1. 4.3.1 Provisioning network
      2. 4.3.2 Service network
    4. 4.4 Network designs
      1. 4.4.1 Network design for small configuration
      2. 4.4.2 Network design for large configuration
  9. Appendix A. Profiling GATK
  10. Related publications
    1. Online resources
    2. Help from IBM
  11. Back cover

Product information

  • Title: IBM Spectrum Scale Best Practices for Genomics Medicine Workloads
  • Author(s): Joanna Wong, Kevin Gildea, Kumaran Rajaram, Luis Bolinches, Monica Lemay, Piyush Chaudhary, Sandeep R. Patil, Ulf Troppens
  • Release date: April 2018
  • Publisher(s): IBM Redbooks
  • ISBN: 9780738456751