Chapter 11. Finding a Protein Motif: Fetching Data and Using Regular Expressions

We’ve spent quite a bit of time now looking for sequence motifs. As described in the Rosalind MPRT challenge, shared or conserved sequences in proteins imply shared functions. In this exercise, I need to identify protein sequences that contain the N-glycosylation motif. The input to the program is a list of protein IDs that will be used to download the sequences from the UniProt website. After demonstrating how to manually and programmatically download the data, I’ll show how to find the motif using a regular expression and by writing a manual solution.

You will learn:

  • How to programmatically fetch data from the internet

  • How to write a regular expression to find the N-glycosylation motif

  • How to manually find the N-glycosylation motif

Getting Started

All the code and tests for this program are located in the 11_mprt directory. To begin, copy the first solution to the program mprt.py:

$ cd 11_mprt
$ cp solution1_regex.py mprt.py

Inspect the usage:

$ ./mprt.py -h
usage: mprt.py [-h] [-d DIR] FILE

Find locations of N-glycosylation motif

positional arguments:
  FILE                  Input text file of UniProt IDs 1

optional arguments:
  -h, --help            show this help message and exit
  -d DIR, --download_dir DIR 2 Directory ...

Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.