O'Reilly logo

Data Algorithms by Mahmoud Parsian

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 24. DNA Base Count

This chapter provides four solutions for DNA base counting:

  • A MapReduce/Hadoop solution using FASTA format

  • A MapReduce/Hadoop solution using FASTQ format

  • A Spark solution using FASTA format

  • A Spark solution using FASTQ format

The purpose of this chapter is to count DNA1 bases. Human DNA’s code is written using only four letters—A, C, T, and Gand when we cannot recognize the code, we label it as N. The meaning of this DNA code lies in the sequence of the letters A, T, C, and G in the same way that the meaning of a word in the English language lies in the sequence of alphabet letters (A–Z).

In this chapter we’ll find the frequencies (or percentages) of A, T, C, G, and N in a given set of DNA sequences. We’ll also provide custom record readers for Hadoop’s input files.

So what do the letters ATCG stand for in the context of DNA? They refer to four of the nitrogenous bases associated with DNA:

  • A = Adenine

  • T = Thymine

  • C = Cytosine

  • G = Guanine

For example, ACGGGTACGAAT is a very small DNA sequence. DNA sequences can be huge.2 DNA base counting for our example will generate the results shown in Table 24-1.

Table 24-1. DNA base count example
Base Count
a 4
t 2
c 2
g 4
n 0

DNA sequences can be represented in many different formats, including the popular FASTA and FASTQ text-based formats, which are what we’ll use in our solutions. Note that Hadoop’s default record reader reads records line by line, and therefore we cannot ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required