Chapter 16. FASTX grep: Creating a Utility Program to Select Sequences

A colleague asked me once to find all the RNA sequences in a FASTQ file that had a description or name containing the string LSU (for long subunit RNA). Although itâs possible to solve this problem for FASTQ files by using the grep program¹ to find all the lines of a file matching some pattern, writing a solution in Python allows you to create a program that could be expanded to handle other formats, like FASTA, as well as to select records based on other criteria, such as length or GC content. Additionally, you can add options to change the output sequence format and introduce conveniences for the user like guessing the input fileâs format based on the file extension.

In this chapter, you will learn:

About the structure of a FASTQ file
How to perform a case-insensitive regular expression match
About DWIM (Do What I Mean) and DRY (Donât Repeat Yourself) ideas in code
How to use and and or operations to reduce Boolean values and bits

Finding Lines in a File Using grep

The grep program can find all the lines in a file matching a given pattern. If I search for LSU in one of the FASTQ files, it finds two header lines containing this pattern:

$ grep LSU tests/inputs/lsu.fq
@ITSLSUmock2p.ITS_M01380:138:000000000-C9GKM:1:1101:14440:2042 2:N:0
@ITSLSUmock2p.ITS_M01384:138:000000000-C9GKM:1:1101:14440:2043 2:N:0

If the goal were only to find how many sequences contain this string, I could pipe this ...

Get Mastering Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Mastering Python for Bioinformatics by Ken Youens-Clark

Chapter 16. FASTX grep: Creating a Utility Program to Select Sequences

Finding Lines in a File Using grep

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly