Chapter 18. DNA Sequencing

Today, genome sequencing machines (such as Illumina’s HiSeq 4000) are able to generate thousands of gigabases of DNA and RNA sequencing data in a few hours for less than US$1,000 (a few years ago, the price was over US$100,000, and sequencing the first human genome cost about US$3 billion). Success in biology and the life sciences depends on our ability to properly analyze the big data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. MapReduce/Hadoop and Spark enable us to compute and analyze thousands of gigabytes/petabytes of data in hours (rather than days or weeks). For example, Spark was recently used to sort 100 TB of data using 206 machines in 23 minutes.1

In simple terms, DNA sequencing is the sequencing of whole genomes (such as human genomes). According to http://dnasequencing.com: “if finding DNA was the discovery of the exact substance holding our genetic makeup information, DNA sequencing is the discovery of the process that will allow us to read that information.” The main function of DNA sequencing is to find the precise order of nucleotides within a DNA molecule. Also, DNA sequencing is used to determine the order of the four bases—adenine (A), guanine (G), cytosine (C), and thymine (T)—in a strand of DNA.

What are some of the challenges of DNA sequencing? There are many, but here are some of the most important ones:

  • There are several sequencing technologies to generate FASTQ ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.