How to do it...

Before we start coding, let's take a look at the FASTQ file, in which you will have many records, as shown in the following code:

@SRR003258.1 30443AAXX:1:1:1053:1999 length=51 ACCCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCACACACACCAACAC + =IIIIIIIII5IIIIIII>IIII+GIIIIIIIIIIIIII(IIIII01&III

Line 1 starts with @, followed by a sequence identifier and a description string. The description string will vary from a sequencer or a database source, but will normally be amenable to automated parsing.

The second line has the sequence DNA, which is just like a FASTA file. The third line is a +, sometimes followed by the description line on the first line.

The fourth line contains quality values for each base that's read on line two. Each letter ...

Get Bioinformatics with Python Cookbook - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.