Chapter 24. DNA Base Count

This chapter provides four solutions for DNA base counting:

  • A MapReduce/Hadoop solution using FASTA format

  • A MapReduce/Hadoop solution using FASTQ format

  • A Spark solution using FASTA format

  • A Spark solution using FASTQ format

The purpose of this chapter is to count DNA1 bases. Human DNA’s code is written using only four letters—A, C, T, and Gand when we cannot recognize the code, we label it as N. The meaning of this DNA code lies in the sequence of the letters A, T, C, and G in the same way that the meaning of a word in the English language lies in the sequence of alphabet letters (A–Z).

In this chapter we’ll find the frequencies (or percentages) of A, T, C, G, and N in a given set of DNA sequences. We’ll also provide custom record readers for Hadoop’s input files.

So what do the letters ATCG stand for in the context of DNA? They refer to four of the nitrogenous bases associated with DNA:

  • A = Adenine

  • T = Thymine

  • C = Cytosine

  • G = Guanine

For example, ACGGGTACGAAT is a very small DNA sequence. DNA sequences can be huge.2 DNA base counting for our example will generate the results shown in Table 24-1.

Table 24-1. DNA base count example
Base Count
a 4
t 2
c 2
g 4
n 0

DNA sequences can be represented in many different formats, including the popular FASTA and FASTQ text-based formats, which are what we’ll use in our solutions. Note that Hadoop’s default record reader reads records line by line, and therefore we cannot ...

Get Data Algorithms now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.