Chapter 10. Finding the Longest Shared Subsequence: Finding K-mers, Writing Functions, and Using Binary Search
As described in the Rosalind LCSM challenge, the goal of this exercise is to find the longest substring that is shared by all sequences in a given FASTA file. In Chapter 8, I was searching for a given motif in some sequences. In this challenge, I donât know a priori that any shared motif is presentâmuch less the size or composition of itâso Iâll just look for any length of sequence that is present in every sequence. This is a challenging exercise that brings together many ideas Iâve shown in earlier chapters. Iâll use the solutions to explore algorithm design, functions, tests, and code organization.
You will learn:
-
How to use k-mers to find shared subsequences
-
How to use
itertools.chain()
to concatenate lists of lists -
How and why to use a binary search
-
One way to maximize a function
-
How to use the
key
option withmin()
andmax()
Getting Started
All the code and tests for this challenge are in the 10_lcsm directory.
Start by copying the first solution to the lcsm.py
program and asking for help:
$ cp solution1_kmers_imperative.py lcsm.py $ ./lcsm.py -h usage: lcsm.py [-h] FILE Longest Common Substring positional arguments: FILE Input FASTA optional arguments: -h, --help show this help message and exit
The only required argument is a single positional file of FASTA-formatted DNA sequences. As with other programs that accept files, the program will ...
Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.