Chapter 10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search

As described in the Rosalind LCSM challenge, the goal of this exercise is to find the longest substring that is shared by all sequences in a given FASTA file. In Chapter 8, I was searching for a given motif in some sequences. In this challenge, I don’t know a priori that any shared motif is present—much less the size or composition of it—so I’ll just look for any length of sequence that is present in every sequence. This is a challenging exercise that brings together many ideas I’ve shown in earlier chapters. I’ll use the solutions to explore algorithm design, functions, tests, and code organization.

You will learn:

  • How to use k-mers to find shared subsequences

  • How to use itertools.chain() to concatenate lists of lists

  • How and why to use a binary search

  • One way to maximize a function

  • How to use the key option with min() and max()

Getting Started

All the code and tests for this challenge are in the 10_lcsm directory. Start by copying the first solution to the lcsm.py program and asking for help:

$ cp solution1_kmers_imperative.py lcsm.py
$ ./lcsm.py -h
usage: lcsm.py [-h] FILE

Longest Common Substring

positional arguments:
  FILE        Input FASTA

optional arguments:
  -h, --help  show this help message and exit

The only required argument is a single positional file of FASTA-formatted DNA sequences. As with other programs that accept files, the program will reject invalid ...

Get Mastering Python for Bioinformatics now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.