Chapter 17. K-mer Counting
A K-mer is a substring of length K (K > 0), and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence data. Counting K-mers for a DNA sequence means finding frequencies of K-mers for the entire sequence. In bioinformatics, K-mer counting is used for genome and transcriptome assembly, metagenomic sequencing, and error correction of sequence reads. Although simple in principle, K-mer counting is a big data challenge, since a single DNA sample can contain several billion DNA sequences. The K-mer counting problem has been defined by the Schatz lab.
This chapter will provide a complete K-mer solution using MapReduce/Hadoop and Spark. Our implementation discovers all K-mers for a given K > 0 and finds the top N K-mers for a given N > 0. The complete MapReduce/Hadoop and Spark solutions are available on GitHub.
A K-mer is a DNA sequence of length K. For example, if a sample DNA sequence is CACACACAGT, then the following are 3-mers and 5-mers for that sequence:
original sequence: CACACACAGT
3-mers: CAC
ACA
CAC
ACA
CAC
ACA
CAG
AGT
original sequence: CACACACAGT
5-mers: CACAC
ACACA
CACAC
ACACA
CACAG
ACAGT
Given a DNA sequence (we’ll call it string S) comprising any of {A, C, G, T}, which represent the primary nucleobases, we are interested in counting the number of occurrences in S of every substring of length K.
Input Data for K-mer Counting
For input, we will use FASTQ, a concise and compact format that stores DNA ...