Skip to Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 24. DNA Base Count

This chapter provides four solutions for DNA base counting:

  • A MapReduce/Hadoop solution using FASTA format

  • A MapReduce/Hadoop solution using FASTQ format

  • A Spark solution using FASTA format

  • A Spark solution using FASTQ format

The purpose of this chapter is to count DNA1 bases. Human DNA’s code is written using only four letters—A, C, T, and Gand when we cannot recognize the code, we label it as N. The meaning of this DNA code lies in the sequence of the letters A, T, C, and G in the same way that the meaning of a word in the English language lies in the sequence of alphabet letters (A–Z).

In this chapter we’ll find the frequencies (or percentages) of A, T, C, G, and N in a given set of DNA sequences. We’ll also provide custom record readers for Hadoop’s input files.

So what do the letters ATCG stand for in the context of DNA? They refer to four of the nitrogenous bases associated with DNA:

  • A = Adenine

  • T = Thymine

  • C = Cytosine

  • G = Guanine

For example, ACGGGTACGAAT is a very small DNA sequence. DNA sequences can be huge.2 DNA base counting for our example will generate the results shown in Table 24-1.

Table 24-1. DNA base count example
Base Count
a 4
t 2
c 2
g 4
n 0

DNA sequences can be represented in many different formats, including the popular FASTA and FASTQ text-based formats, which are what we’ll use in our solutions. Note that Hadoop’s default record reader reads records line by line, and therefore we cannot ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Graph Algorithms

Graph Algorithms

Mark Needham, Amy E. Hodler
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content