BLAST

This is the Title of the Book, eMatter Edition

198

Chapter 11: BLAST Databases

sequences may not have been processed to remove sequencing vectors or low quality

reads, so the quality of the sequences varies. Finally, the htg database contains all

high-throughput genomic sequences that correspond to large genomic fragments

from various organisms.

Custom BLAST Databases

In many cases, using a custom database rather than a standard database is more effi-

cient. For example, if you’re only interested in searching against human sequences,

there’s no point in including the rest of the public database. But the total cost of a

BLAST search also includes creating the database, so it isn’t always more efficient to

use a custom database. Suppose you’ve just cloned a dog gene with mutations that

bear some similarity to a human disease, and you want to know which human pro-

teins correspond to the dog protein. You can build a human protein database and

search it, or you could search against nr and ignore anything that isn’t human. If this

is the only experiment you’re going to perform, it’s probably more efficient to search

nr than build a custom database.

If you occasionally want to make custom databases, it’s worth getting to know one of

the batch retrieval systems available on the Internet. You can make custom databases

easily using these web-based systems. If you find yourself making custom databases

frequently and find limitations with using web-based systems, you will probably want

to have an in-house database. This can be a nontrivial task involving many hours of

work and expensive computers, or it can be a relatively simple operation. It depends

on the kind of performance and features you want. You will learn more about this

topic in just a bit, but let’s first discuss sequence databases in general.

Sequence Databases

The sequences in BLAST databases come from sequence databases. But what are

sequence databases and where do you get them? The answers to these simple ques-

tions are surprisingly complex. Sequence databases come in many shapes and sizes.

Some are just collections of raw sequence data from genome sequencing projects,

while others contain comprehensive information about the origin and function of the

sequences. Unfortunately, there isn’t a one-stop shopping place to get all the

information you may want, but there is one particular service worth mentioning

above all others: the International Nucleotide Sequence Database.

International Nucleotide Sequence Database

Probably the most important molecular biology resource is the public sequence data-

base maintained by the International Nucleotide Sequence Database (INSD). It is

composed of three parties: the DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.

This is the Title of the Book, eMatter Edition

Sequence Databases

199

ac.jp), the European Molecular Biology Laboratory, (EMBL, http://www.embl.org),

and GenBank from the National Center for Biotechnology Information (NCBI, http://

ncbi.nlm.nih.gov/GenBank). This consortium collaborates to form the largest public

repository for DNA and protein sequences in the world. Because it is such an impor-

tant resource, this chapter spends some time exploring it.

Database Growth

The amount of publicly available sequence has been growing geometrically, dou-

bling approximately every 14 months (see Figure 11-2). Fortunately, computer tech-

nology has also kept pace. While it seems scary that GenBank is currently

approaching 100 GB and will be half a terabyte in a few years, it’s nice to know that

this isn’t going to be a problem. Not every database grows so fast, though. Organ-

ism-specific databases such as the Saccharomyces Genome Database, WormBase,

and FlyBase are growing at a more moderate pace, principally because the sequence

of their genomes is complete. But many new genome projects are just getting started,

and they will probably grow very quickly.

Flat Files

Sequence databases usually offer their data in several different formats. The FASTA

format is universally accepted for operating on sequences, but many sequence data-

bases record a lot more data than just the sequence. Such extra information is com-

monly presented in a human-readable format called a flat file. The INSD uses two

kinds of flat files. The DDBJ and GenBank flat file formats are identical, while the

EMBL format is slightly different. The following DDBJ/GenBank record corre-

sponds to a fragment of the Hoxa-11 gene from the coelacanth (the ancient fish on

the cover of the book):

LOCUS AF287139 606 bp DNA linear VRT 10-DEC-2000

DEFINITION Latimeria chalumnae Hoxa-11 gene, partial cds.

ACCESSION AF287139

VERSION AF287139.1 GI:11611818

KEYWORDS .

SOURCE Latimeria chalumnae.

ORGANISM Latimeria chalumnae

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Coelacanthiformes; Coelacanthidae; Latimeria.

REFERENCE 1 (bases 1 to 606)

AUTHORS Chiu,C.H., Nonaka,D., Xue,L., Amemiya,C.T. and Wagner,G.P.

TITLE Evolution of Hoxa-11 in lineages phylogenetically positioned along

the fin-limb transition

JOURNAL Mol. Phylogenet. Evol. 17 (2), 305-316 (2000)

MEDLINE 20538275

PUBMED 11083943

REFERENCE 2 (bases 1 to 606)

Get BLAST now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

BLAST by Ian Korf, Mark Yandell, Joseph Bedell

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly