Skip to Main Content
Data Algorithms
book

Data Algorithms

by Mahmoud Parsian
July 2015
Intermediate to advanced content levelIntermediate to advanced
778 pages
17h 9m
English
O'Reilly Media, Inc.
Content preview from Data Algorithms

Chapter 17. K-mer Counting

A K-mer is a substring of length K (K > 0), and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence data. Counting K-mers for a DNA sequence means finding frequencies of K-mers for the entire sequence. In bioinformatics, K-mer counting is used for genome and transcriptome assembly, metagenomic sequencing, and error correction of sequence reads. Although simple in principle, K-mer counting is a big data challenge, since a single DNA sample can contain several billion DNA sequences. The K-mer counting problem has been defined by the Schatz lab.

This chapter will provide a complete K-mer solution using MapReduce/Hadoop and Spark. Our implementation discovers all K-mers for a given K > 0 and finds the top N K-mers for a given N > 0. The complete MapReduce/Hadoop and Spark solutions are available on GitHub.

A K-mer is a DNA sequence of length K. For example, if a sample DNA sequence is CACACACAGT, then the following are 3-mers and 5-mers for that sequence:

original sequence: CACACACAGT
3-mers:            CAC
                    ACA
                     CAC
                      ACA
                       CAC
                        ACA
                         CAG
                          AGT

original sequence: CACACACAGT
5-mers:            CACAC
                    ACACA
                     CACAC
                      ACACA
                       CACAG
                        ACAGT

Given a DNA sequence (we’ll call it string S) comprising any of {A, C, G, T}, which represent the primary nucleobases, we are interested in counting the number of occurrences in S of every substring of length K.

Input Data for K-mer Counting

For input, we will use FASTQ, a concise and compact format that stores DNA ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Algorithms with Spark

Data Algorithms with Spark

Mahmoud Parsian
Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Datasets

Dzejla Medjedovic, Emin Tahirovic, Ines Schweigert
Data Mesh

Data Mesh

Zhamak Dehghani
Learning Algorithms

Learning Algorithms

George Heineman

Publisher Resources

ISBN: 9781491906170Errata PageSupplemental Content