O'Reilly logo

Data Algorithms by Mahmoud Parsian

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 17. K-mer Counting

A K-mer is a substring of length K (K > 0), and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence data. Counting K-mers for a DNA sequence means finding frequencies of K-mers for the entire sequence. In bioinformatics, K-mer counting is used for genome and transcriptome assembly, metagenomic sequencing, and error correction of sequence reads. Although simple in principle, K-mer counting is a big data challenge, since a single DNA sample can contain several billion DNA sequences. The K-mer counting problem has been defined by the Schatz lab.

This chapter will provide a complete K-mer solution using MapReduce/Hadoop and Spark. Our implementation discovers all K-mers for a given K > 0 and finds the top N K-mers for a given N > 0. The complete MapReduce/Hadoop and Spark solutions are available on GitHub.

A K-mer is a DNA sequence of length K. For example, if a sample DNA sequence is CACACACAGT, then the following are 3-mers and 5-mers for that sequence:

original sequence: CACACACAGT
3-mers:            CAC
                    ACA
                     CAC
                      ACA
                       CAC
                        ACA
                         CAG
                          AGT

original sequence: CACACACAGT
5-mers:            CACAC
                    ACACA
                     CACAC
                      ACACA
                       CACAG
                        ACAGT

Given a DNA sequence (we’ll call it string S) comprising any of {A, C, G, T}, which represent the primary nucleobases, we are interested in counting the number of occurrences in S of every substring of length K.

Input Data for K-mer Counting

For input, we will use FASTQ, a concise and compact format that stores DNA ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required