This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
The mechanics of creating BLAST databases is quite simple; just run formatdb or
xdformat with the proper syntax. Chapter 10 discussed this topic, and you’ll find the
command summaries in Chapters 13 and 14. There are, however, subtleties that
make this process more complicated than it may appear.
One of the most common database complications occurs with large files. Most com-
puters today use 32-bit operating systems and 32-bit filesystems. This puts a physi-
cal limit of 4 GB on the amount of RAM and 4 GB on the size of any particular file.
(You may find that you are actually limited to less than 4 GB in both cases, and a 2-
GB limit is quite common.) Most computers these days don’t have or need 4-GB
RAM. However, most hard disks are quite a bit larger than 4 GB, and files can some-
times exceed these limits. Therefore many operating systems have the option of using
64-bit filesystems. Unfortunately you can’t just change the filesystem and expect
everything to work. Making software applications aware of large files often means
recompiling them with special flags, and the process of migrating to a 64-bit filesys-
tem can be painful because the applications don’t tell you useful things like “I’m not
large-file-aware.” Instead, they just sit there quietly burning CPU time while they run
in endless loops.
Large NCBI databases
The standard protocol for formatting a database is to run formatdb on a FASTA
formatdb -p F -i fasta_db -o
NCBI-BLAST databases are physically limited to 4 GB of sequence, which corre-
sponds to about 4 billion amino acids or 16 billion nucleotides (nucleotides are com-
pressed 4:1). On a 32-bit filesystem, the previous approach won’t let you use all this
space because the FASTA file can’t contain more than 2 or 4 billion letters. Creating
a database larger than 2 or 4 billion letters requires piping sequence to formatdb.
cat fasta1 fasta2 fasta3 | formatdb -p F -i stdin -n my_db -o
But what if you happen to have more than 16 billion letters? This isn’t a problem
because formatdb automatically segments individual BLAST databases to files con-
taining 16 billion nucleotides and creates something called an alias database that
stitches them all together. This is really convenient because it means that you can
search enormous databases even on 32-bit filesystems. Alias databases are discussed
in more detail later in this chapter.
It’s still possible to run into file size issues by piping FASTA files to formatdb because
the filesystem maximum may be 2 GB and the implicit BLAST maximum is 4 GB.