Skip to Content
Mastering Python for Bioinformatics
book

Mastering Python for Bioinformatics

by Ken Youens-Clark
May 2021
Intermediate to advanced
454 pages
10h 42m
English
O'Reilly Media, Inc.
Content preview from Mastering Python for Bioinformatics

Chapter 9. Overlap Graphs: Sequence Assembly Using Shared K-mers

A graph is a structure used to represent pairwise relationships between objects. As described in the Rosalind GRPH challenge, the goal of this exercise is to find pairs of sequences that can be joined using an overlap from the end of one sequence to the beginning of another. The practical application of this would be to join short DNA reads into longer contiguous sequences (contigs) or even whole genomes. To begin, I’ll only be concerned about joining two sequences, but a second version of the program will use a graph structure that can join any number of sequences to approximate a complete assembly. In this implementation, the overlapping regions used to join sequences are required to be exact matches. Real-world assemblers must allow for variation in the size and composition of the overlapping sequences.

You will learn:

  • How to use k-mers to create overlap graphs

  • How to log runtime messages to a file

  • How to use collections.defaultdict()

  • How to use set intersection to find common elements between collections

  • How to use itertools.product() to create Cartesian products of lists

  • How to use the iteration_utilities.starfilter() function

  • How to use Graphviz to model and visualize graph structures

Getting Started

The code and tests for this exercise are in the 09_grph directory. Start by copying one of the solutions to the program grph.py and requesting the usage:

$ cd 09_grph/ $ cp solution1.py grph.py ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Bioinformatics Programming Using Python

Bioinformatics Programming Using Python

Mitchell L Model

Publisher Resources

ISBN: 9781098100872Errata PageSupplemental Content