This chapter provides four solutions for DNA base counting:
A MapReduce/Hadoop solution using FASTA format
A MapReduce/Hadoop solution using FASTQ format
A Spark solution using FASTA format
A Spark solution using FASTQ format
The purpose of this chapter is to count DNA1 bases. Human DNA’s code is written using only four letters—A, C, T, and G—and when we cannot recognize the code, we label it as N. The meaning of this DNA code lies in the sequence of the letters A, T, C, and G in the same way that the meaning of a word in the English language lies in the sequence of alphabet letters (A–Z).
In this chapter we’ll find the frequencies (or percentages) of A, T, C, G, and N in a given set of DNA sequences. We’ll also provide custom record readers for Hadoop’s input files.
So what do the letters ATCG stand for in the context of DNA? They refer to four of the nitrogenous bases associated with DNA:
A = Adenine
T = Thymine
C = Cytosine
G = Guanine
DNA sequences can be represented in many different formats, including the popular FASTA and FASTQ text-based formats, which are what we’ll use in our solutions. Note that Hadoop’s default record reader reads records line by line, and therefore we cannot ...