Chapter 23. Biological Data Science: Saving Lives with Software

Matt Massie

It’s hard to believe a decade has passed since the MapReduce paper appeared at OSDI’04. It’s also hard to overstate the impact that paper had on the tech industry; the MapReduce paradigm opened distributed programming to nonexperts and enabled large-scale data processing on clusters built using commodity hardware. The open source community responded by creating open source MapReduce-based systems, like Apache Hadoop and Spark, that enabled data scientists and engineers to formulate and solve problems at a scale unimagined before.

While the tech industry was being transformed by MapReduce-based systems, biology was experiencing its own metamorphosis driven by second-generation (or “next-generation”) sequencing technology; see Figure 23-1. Sequencing machines are scientific instruments that read the chemical “letters” (A, C, T, and G) that make up your genome: your complete set of genetic material. To have your genome sequenced when the MapReduce paper was published cost about $20 million and took many months to complete; today, it costs just a few thousand dollars and takes only a few days. While the first human genome took decades to create, in 2014 alone an estimated 228,000 genomes were sequenced worldwide.[152] This estimate implies around 20 petabytes (PB) of sequencing data were generated in 2014 worldwide.

Figure 23-1. Timeline of big data technology and cost of sequencing a genome

The plummeting cost ...

Get Hadoop: The Definitive Guide, 4th Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.