Chapter 3 Data Management

Content contributed by Joe Arnold

In Francis Collins’s 2015 keynote address at Datapalooza, the National Institutes of Health (NIH) director noted that the cost of full genome sequencing may drop to $1,000 by the end of that year [1].

With such dramatically lower sequencing costs, we are experiencing a tidal wave of data that is catalyzing exciting collaborative efforts to share and manage that data. The Global Alliance for Genomics & Health is working on a framework for sharing genomic and clinical data. Collins also mentioned PCORnet, the National Patient-Centered Clinical Research Network, which is focused on sharing large amounts of health data with the aim of making it faster, easier, and less costly to conduct clinical research. He described PCORnet as “an unprecedented network of networks” that lets you conduct observational trials almost for free. He also said NIH is working on a “data commons,” explaining that “we ought to have a virtual place where people can find data not balkanized, but readily usable.”

To be sure, there is much to be excited about, but these advances also pose new challenges in data management. Next-generation sequencing (NGS) has the capacity to generate data at rates that exceed the rate of growth of compute and storage, both in performance and in scale. Storing and analyzing genomics data has become a quintessential big data challenge due to the high cost of owning and maintaining adequate compute resources.

This chapter ...

Get Strategies in Biomedical Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.