Chapter 10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search
As described in the Rosalind LCSM challenge, the goal of this exercise is to find the longest substring that is shared by all sequences in a given FASTA file. In Chapter 8, I was searching for a given motif in some sequences. In this challenge, I don’t know a priori that any shared motif is present—much less the size or composition of it—so I’ll just look for any length of sequence that is present in every sequence. This is a challenging exercise that brings together many ideas I’ve shown in earlier chapters. I’ll use the solutions to explore algorithm design, functions, tests, and code organization.
You will learn:
How to use k-mers to find shared subsequences
How to use
itertools.chain()to concatenate lists of lists
How and why to use a binary search
One way to maximize a function
How to use the
All the code and tests for this challenge are in the 10_lcsm directory.
Start by copying the first solution to the
lcsm.py program and asking for help:
$ cp solution1_kmers_imperative.py lcsm.py $ ./lcsm.py -h usage: lcsm.py [-h] FILE Longest Common Substring positional arguments: FILE Input FASTA optional arguments: -h, --help show this help message and exit
The only required argument is a single positional file of FASTA-formatted DNA sequences. As with other programs that accept files, the program will reject invalid ...