A K-mer is a substring of length *K* (*K >* 0), and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence data. Counting K-mers for a DNA sequence means finding frequencies of K-mers for the entire sequence. In bioinformatics, K-mer counting is used for genome and transcriptome assembly, metagenomic sequencing, and error correction of sequence reads. Although simple in principle, K-mer counting is a big data challenge, since a single DNA sample can contain several billion DNA sequences. The K-mer counting problem has been defined by the Schatz lab.

This chapter will provide a complete K-mer solution using MapReduce/Hadoop and Spark. Our implementation discovers all K-mers for a given *K >* 0 and finds the top *N* K-mers for a given *N >* 0. The complete MapReduce/Hadoop and Spark solutions are available on GitHub.

A K-mer is a DNA sequence of length *K*. For example, if a sample DNA sequence is *CACACACAGT*, then the following are 3-mers and 5-mers for that sequence:

original sequence: CACACACAGT 3-mers: CAC ACA CAC ACA CAC ACA CAG AGT original sequence: CACACACAGT 5-mers: CACAC ACACA CACAC ACACA CACAG ACAGT

Given a DNA sequence (we’ll call it string *S*) comprising any of {*A, C, G, T*}, which represent the primary nucleobases, we are interested in counting the number of occurrences in *S* of every substring of length *K*.

For input, we will use FASTQ, a concise and compact format that stores DNA ...

Start Free Trial

No credit card required