Chapter 11. Finding a Protein Motif: Fetching Data and Using Regular Expressions
Weâve spent quite a bit of time now looking for sequence motifs. As described in the Rosalind MPRT challenge, shared or conserved sequences in proteins imply shared functions. In this exercise, I need to identify protein sequences that contain the N-glycosylation motif. The input to the program is a list of protein IDs that will be used to download the sequences from the UniProt website. After demonstrating how to manually and programmatically download the data, Iâll show how to find the motif using a regular expression and by writing a manual solution.
You will learn:
-
How to programmatically fetch data from the internet
-
How to write a regular expression to find the N-glycosylation motif
-
How to manually find the N-glycosylation motif
Getting Started
All the code and tests for this program are located in the 11_mprt directory.
To begin, copy the first solution to the program mprt.py
:
$ cd 11_mprt $ cp solution1_regex.py mprt.py
Inspect the usage:
$ ./mprt.py -h usage: mprt.py [-h] [-d DIR] FILE Find locations of N-glycosylation motif positional arguments: FILE Input text file of UniProt IDs optional arguments: -h, --help show this help message and exit -d DIR, --download_dir DIR Directory ...
Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.