Skip to Content
Hadoop: The Definitive Guide, 4th Edition
book

Hadoop: The Definitive Guide, 4th Edition

by Tom White
April 2015
Beginner to intermediate
754 pages
21h 39m
English
O'Reilly Media, Inc.
Content preview from Hadoop: The Definitive Guide, 4th Edition

Chapter 23. Biological Data Science: Saving Lives with Software

Matt Massie

It’s hard to believe a decade has passed since the MapReduce paper appeared at OSDI’04. It’s also hard to overstate the impact that paper had on the tech industry; the MapReduce paradigm opened distributed programming to nonexperts and enabled large-scale data processing on clusters built using commodity hardware. The open source community responded by creating open source MapReduce-based systems, like Apache Hadoop and Spark, that enabled data scientists and engineers to formulate and solve problems at a scale unimagined before.

While the tech industry was being transformed by MapReduce-based systems, biology was experiencing its own metamorphosis driven by second-generation (or “next-generation”) sequencing technology; see Figure 23-1. Sequencing machines are scientific instruments that read the chemical “letters” (A, C, T, and G) that make up your genome: your complete set of genetic material. To have your genome sequenced when the MapReduce paper was published cost about $20 million and took many months to complete; today, it costs just a few thousand dollars and takes only a few days. While the first human genome took decades to create, in 2014 alone an estimated 228,000 genomes were sequenced worldwide.[152] This estimate implies around 20 petabytes (PB) of sequencing data were generated in 2014 worldwide.

Figure 23-1. Timeline of big data technology and cost of sequencing a genome

The plummeting cost ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Hadoop: The Definitive Guide, 3rd Edition

Hadoop: The Definitive Guide, 3rd Edition

Tom White
Kafka: The Definitive Guide, 2nd Edition

Kafka: The Definitive Guide, 2nd Edition

Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty
Kafka: The Definitive Guide

Kafka: The Definitive Guide

Neha Narkhede, Gwen Shapira, Todd Palino

Publisher Resources

ISBN: 9781491901687Errata Page