Chapter 16. FASTX grep: Creating a Utility Program to Select Sequences

A colleague asked me once to find all the RNA sequences in a FASTQ file that had a description or name containing the string LSU (for long subunit RNA). Although it’s possible to solve this problem for FASTQ files by using the grep program1 to find all the lines of a file matching some pattern, writing a solution in Python allows you to create a program that could be expanded to handle other formats, like FASTA, as well as to select records based on other criteria, such as length or GC content. Additionally, you can add options to change the output sequence format and introduce conveniences for the user like guessing the input file’s format based on the file extension.

In this chapter, you will learn:

  • About the structure of a FASTQ file

  • How to perform a case-insensitive regular expression match

  • About DWIM (Do What I Mean) and DRY (Don’t Repeat Yourself) ideas in code

  • How to use and and or operations to reduce Boolean values and bits

Finding Lines in a File Using grep

The grep program can find all the lines in a file matching a given pattern. If I search for LSU in one of the FASTQ files, it finds two header lines containing this pattern:

$ grep LSU tests/inputs/lsu.fq
@ITSLSUmock2p.ITS_M01380:138:000000000-C9GKM:1:1101:14440:2042 2:N:0
@ITSLSUmock2p.ITS_M01384:138:000000000-C9GKM:1:1101:14440:2043 2:N:0

If the goal were only to find how many sequences contain this string, I could pipe this ...

Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.