Chapter 10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search

As described in the Rosalind LCSM challenge, the goal of this exercise is to find the longest substring that is shared by all sequences in a given FASTA file. In ChapterÂ 8, I was searching for a given motif in some sequences. In this challenge, I donât know a priori that any shared motif is presentâmuch less the size or composition of itâso Iâll just look for any length of sequence that is present in every sequence. This is a challenging exercise that brings together many ideas Iâve shown in earlier chapters. Iâll use the solutions to explore algorithm design, functions, tests, and code organization.

You will learn:

How to use k-mers to find shared subsequences
How to use itertools.chain() to concatenate lists of lists
How and why to use a binary search
One way to maximize a function
How to use the key option with min() and max()

Getting Started

All the code and tests for this challenge are in the 10_lcsm directory. Start by copying the first solution to the lcsm.py program and asking for help:

$ cp solution1_kmers_imperative.py lcsm.py
$ ./lcsm.py -h
usage: lcsm.py [-h] FILE

Longest Common Substring

positional arguments:
  FILE        Input FASTA

optional arguments:
  -h, --help  show this help message and exit

The only required argument is a single positional file of FASTA-formatted DNA sequences. As with other programs that accept files, the program will ...

Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Python for Bioinformatics by Ken Youens-Clark

Chapter 10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search

Getting Started

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly