Chapter 9. Overlap Graphs: Sequence Assembly Using Shared K-mers

A graph is a structure used to represent pairwise relationships between objects. As described in the Rosalind GRPH challenge, the goal of this exercise is to find pairs of sequences that can be joined using an overlap from the end of one sequence to the beginning of another. The practical application of this would be to join short DNA reads into longer contiguous sequences (contigs) or even whole genomes. To begin, I’ll only be concerned about joining two sequences, but a second version of the program will use a graph structure that can join any number of sequences to approximate a complete assembly. In this implementation, the overlapping regions used to join sequences are required to be exact matches. Real-world assemblers must allow for variation in the size and composition of the overlapping sequences.

You will learn:

  • How to use k-mers to create overlap graphs

  • How to log runtime messages to a file

  • How to use collections.defaultdict()

  • How to use set intersection to find common elements between collections

  • How to use itertools.product() to create Cartesian products of lists

  • How to use the iteration_utilities.starfilter() function

  • How to use Graphviz to model and visualize graph structures

Getting Started

The code and tests for this exercise are in the 09_grph directory. Start by copying one of the solutions to the program grph.py and requesting the usage:

$ cd 09_grph/ $ cp solution1.py grph.py ...

Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.