6 Parsing DNA Data Files

Large databases of DNA information are being collected by several institutes. In the United States, a large repository is Genbank, which is under the sponsorship of the National Institutes of Health (http://www.ncbi.nlm.nih.gov/Genbank/index.html). The concern of this chapter is to develop programs capable of reading the files that are stored in three of the most popular formats: FASTA, Genbank, and ASN.1.

6.1 FASTA Files

The FASTA format is extremely simple, but it contains very little information aside from the sequence. A typical FASTA format is shown in Figure 6-1.

The first line contains a small header that may vary in content. In this case, the accession number and name of species and chromosome number are given. ...

Get Python for Bioinformatics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.