Chapter 8. Find a Motif in DNA: Exploring Sequence Similarity
In the Rosalind SUBS challenge, I’ll be searching for any occurrences of one sequence inside another.
A shared subsequence might represent a conserved element such as a marker, gene, or regulatory sequence.
Conserved sequences between two organisms might suggest some inherited or convergent trait.
I’ll explore how to write a solution using the str (string) class in Python and will compare strings to lists.
Then I’ll explore how to express these ideas using higher-order functions and will continue the discussion of k-mers I started in Chapter 7.
Finally, I’ll show how regular expressions can find patterns and will point out problems with overlapping matches.
In this chapter, I’ll demonstrate:
-
How to use
str.find(),str.index(), and string slices -
How to use sets to create unique collections of elements
-
How to combine higher-order functions
-
How to find subsequences using k-mers
-
How to find possibly overlapping sequences using regular expressions
Getting Started
The code and tests for this chapter are in 08_subs.
I suggest you start by copying the first solution to the program subs.py and requesting help:
$ cd 08_subs/ $ cp solution1_str_find.py subs.py $ ./subs.py -h usage: subs.py [-h] seq subseq Find subsequences positional arguments: seq Sequence subseq subsequence optional arguments: -h, --help show this help message and exit
The program should report the starting locations where the subsequence can be ...