This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Chapter 11: BLAST Databases
sequences may not have been processed to remove sequencing vectors or low quality
reads, so the quality of the sequences varies. Finally, the htg database contains all
high-throughput genomic sequences that correspond to large genomic fragments
from various organisms.
Custom BLAST Databases
In many cases, using a custom database rather than a standard database is more effi-
cient. For example, if you’re only interested in searching against human sequences,
there’s no point in including the rest of the public database. But the total cost of a
BLAST search also includes creating the database, so it isn’t always more efficient to
use a custom database. Suppose you’ve just cloned a dog gene with mutations that
bear some similarity to a human disease, and you want to know which human pro-
teins correspond to the dog protein. You can build a human protein database and
search it, or you could search against nr and ignore anything that isn’t human. If this
is the only experiment you’re going to perform, it’s probably more efficient to search
nr than build a custom database.
If you occasionally want to make custom databases, it’s worth getting to know one of
the batch retrieval systems available on the Internet. You can make custom databases
easily using these web-based systems. If you find yourself making custom databases
frequently and find limitations with using web-based systems, you will probably want
to have an in-house database. This can be a nontrivial task involving many hours of
work and expensive computers, or it can be a relatively simple operation. It depends
on the kind of performance and features you want. You will learn more about this
topic in just a bit, but let’s first discuss sequence databases in general.
Sequence Databases
The sequences in BLAST databases come from sequence databases. But what are
sequence databases and where do you get them? The answers to these simple ques-
tions are surprisingly complex. Sequence databases come in many shapes and sizes.
Some are just collections of raw sequence data from genome sequencing projects,
while others contain comprehensive information about the origin and function of the
sequences. Unfortunately, there isn’t a one-stop shopping place to get all the
information you may want, but there is one particular service worth mentioning
above all others: the International Nucleotide Sequence Database.
International Nucleotide Sequence Database
Probably the most important molecular biology resource is the public sequence data-
base maintained by the International Nucleotide Sequence Database (INSD). It is
composed of three parties: the DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Sequence Databases
199, the European Molecular Biology Laboratory, (EMBL,,
and GenBank from the National Center for Biotechnology Information (NCBI, http:// This consortium collaborates to form the largest public
repository for DNA and protein sequences in the world. Because it is such an impor-
tant resource, this chapter spends some time exploring it.
Database Growth
The amount of publicly available sequence has been growing geometrically, dou-
bling approximately every 14 months (see Figure 11-2). Fortunately, computer tech-
nology has also kept pace. While it seems scary that GenBank is currently
approaching 100 GB and will be half a terabyte in a few years, it’s nice to know that
this isn’t going to be a problem. Not every database grows so fast, though. Organ-
ism-specific databases such as the Saccharomyces Genome Database, WormBase,
and FlyBase are growing at a more moderate pace, principally because the sequence
of their genomes is complete. But many new genome projects are just getting started,
and they will probably grow very quickly.
Flat Files
Sequence databases usually offer their data in several different formats. The FASTA
format is universally accepted for operating on sequences, but many sequence data-
bases record a lot more data than just the sequence. Such extra information is com-
monly presented in a human-readable format called a flat file. The INSD uses two
kinds of flat files. The DDBJ and GenBank flat file formats are identical, while the
EMBL format is slightly different. The following DDBJ/GenBank record corre-
sponds to a fragment of the Hoxa-11 gene from the coelacanth (the ancient fish on
the cover of the book):
LOCUS AF287139 606 bp DNA linear VRT 10-DEC-2000
DEFINITION Latimeria chalumnae Hoxa-11 gene, partial cds.
VERSION AF287139.1 GI:11611818
SOURCE Latimeria chalumnae.
ORGANISM Latimeria chalumnae
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Coelacanthiformes; Coelacanthidae; Latimeria.
REFERENCE 1 (bases 1 to 606)
AUTHORS Chiu,C.H., Nonaka,D., Xue,L., Amemiya,C.T. and Wagner,G.P.
TITLE Evolution of Hoxa-11 in lineages phylogenetically positioned along
the fin-limb transition
JOURNAL Mol. Phylogenet. Evol. 17 (2), 305-316 (2000)
MEDLINE 20538275
PUBMED 11083943
REFERENCE 2 (bases 1 to 606)

Get BLAST now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.