Skip to Content
Mastering Python for Bioinformatics
book

Mastering Python for Bioinformatics

by Ken Youens-Clark
May 2021
Intermediate to advanced
454 pages
10h 42m
English
O'Reilly Media, Inc.
Content preview from Mastering Python for Bioinformatics

Chapter 16. FASTX grep: Creating a Utility Program to Select Sequences

A colleague asked me once to find all the RNA sequences in a FASTQ file that had a description or name containing the string LSU (for long subunit RNA). Although it’s possible to solve this problem for FASTQ files by using the grep program1 to find all the lines of a file matching some pattern, writing a solution in Python allows you to create a program that could be expanded to handle other formats, like FASTA, as well as to select records based on other criteria, such as length or GC content. Additionally, you can add options to change the output sequence format and introduce conveniences for the user like guessing the input file’s format based on the file extension.

In this chapter, you will learn:

  • About the structure of a FASTQ file

  • How to perform a case-insensitive regular expression match

  • About DWIM (Do What I Mean) and DRY (Don’t Repeat Yourself) ideas in code

  • How to use and and or operations to reduce Boolean values and bits

Finding Lines in a File Using grep

The grep program can find all the lines in a file matching a given pattern. If I search for LSU in one of the FASTQ files, it finds two header lines containing this pattern:

$ grep LSU tests/inputs/lsu.fq
@ITSLSUmock2p.ITS_M01380:138:000000000-C9GKM:1:1101:14440:2042 2:N:0
@ITSLSUmock2p.ITS_M01384:138:000000000-C9GKM:1:1101:14440:2043 2:N:0

If the goal were only to find how many sequences contain this string, I could pipe this into ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Bioinformatics Programming Using Python

Bioinformatics Programming Using Python

Mitchell L Model

Publisher Resources

ISBN: 9781098100872Errata PageSupplemental Content