Skip to Content
Bioinformatics Data Skills
book

Bioinformatics Data Skills

by Vince Buffalo
July 2015
Intermediate to advanced
538 pages
15h 29m
English
O'Reilly Media, Inc.
Book available
Content preview from Bioinformatics Data Skills

Chapter 10. Working with Sequence Data

One of the core issues of Bioinformatics is dealing with a profusion of (often poorly defined or ambiguous) file formats. Some ad hoc simple human readable formats have over time attained the status of de facto standards.

Peter Cock et al. (2010)

Good programmers know what to write. Great ones know what to rewrite (and reuse).

The Cathedral and the Bazaar Eric S. Raymond

Nucleotide (and protein) sequences are stored in two plain-text formats widespread in bioinformatics: FASTA and FASTQ—pronounced fast-ah (or fast-A) and fast-Q, respectively. We’ll discuss each format and their limitations in this section, and then see some tools for working with data in these formats. This is a short chapter, but one with an important lesson: beware of common pitfalls when working with ad hoc bioinformatics formats. Simple mistakes over minor details like file formats can consume a disproportionate amount of time and energy to discover and fix, so mind these details early on.

The FASTA Format

The FASTA format originates from the FASTA alignment suite, created by William R. Pearson and David J. Lipman. The FASTA format is used to store any sort of sequence data not requiring per-base pair quality scores. This includes reference genome files, protein sequences, coding DNA sequences (CDS), transcript sequences, and so on. FASTA can also be used to store multiple alignment data, but we won’t discuss this specialized variant of the format here. We’ve encountered ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Analytical Skills for AI and Data Science

Analytical Skills for AI and Data Science

Daniel Vaughan
R for Data Science, 2nd Edition

R for Data Science, 2nd Edition

Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund

Publisher Resources

ISBN: 9781449367480Errata Page