This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
213
Chapter 12
CHAPTER 12
Hardware and Software
Optimizations
This chapter explores how to optimize BLAST searches for maximum throughput
and will help you get the most out of your current and future hardware and soft-
ware. The first rule of BLAST performance is optimize your BLAST parameters.
Incorrect settings can cause BLAST to run slowly, and you can often achieve surpris-
ing increases in speed by adjusting a parameter or two. Chapter 9 can help you
choose the correct parameters for a particular experiment. If you’re already running
BLAST efficiently and want to get the most BLAST performance possible, read on.
The Persistence of Memory
Modern operating systems cache files. You may hear it referred to as RAM cache or
disk cache, but we’ll just call it cache. Once a file is read from the filesystem (e.g.,
hard disk), the file is kept in memory even after it is no longer used, assuming there’s
enough free RAM to do so. Why cache files? It’s frequently the case that the same file
is requested repeatedly. Retrieving from memory is much faster than from a disk, so
keeping it in memory can save a lot of time. Caching can be very important in
sequential BLAST searches if the database is located on a slow disk or across a net-
work. While the first search may be limited by the speed that the database can be
read, subsequent searches can be much faster.
The advantage of caching is most appreciable for insensitive BLAST searches, such as
BLASTN with a large word size. In more sensitive searches, retrieving sequences
from the database becomes a smaller fraction of the total elapsed time. In Table 12-1,
note how the speed increase from caching is a function of sensitivity (here, word
size).
Table 12-1. How caching benefits insensitive searches
Program Word size Search 1 Search 2 Speed increase
BLASTN W=12 12 sec 7 sec 1.71 x
BLASTN W=10 33 sec 28 sec 1.18 x
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
214
|
Chapter 12: Hardware and Software Optimizations
BLAST itself doesn’t take much memory, but having a lot of memory assists cach-
ing. Look at the amount of RAM in your current systems and the size of your BLAST
databases. As a rule, your RAM should be at least 20 percent greater than the size of
your largest database. If it isn’t and you do a lot of insensitive searches, a simple
memory upgrade may boost your throughput by 50 percent or more. However, if
most of your searches are sensitive searches or involve small databases, adding RAM
to all your machines may be less cost-effective than purchasing a few more servers.
BLAST Pipelines and Caching
If you’re running BLAST as part of a sequence analysis pipeline involving several
BLAST searches and multiple databases, you may want to consider how caching will
affect the execution of the pipeline. For example, look at the typical BLAST-based
sequence analysis pipeline for ESTs depicted in Figure 12-1. The most obvious
approach is to take each EST and pass it through each step. But is this the most effi-
cient way?
It’s common to design sequence analysis pipelines with the following structure:
for each sequence to analyze {
for each BLAST search in the pipeline {
execute BLAST search
}
}
However, you can switch the inner and outer loops to achieve this structure:
for each BLAST search in the pipeline {
for each sequence to analyze {
execute BLAST search
}
}
BLASTN W=8 57 sec 52 sec 1.10 x
BLASTN W=6 243 sec 238 sec 1.02 x
Figure 12-1. EST annotation pipeline
Table 12-1. How caching benefits insensitive searches (continued)
Program Word size Search 1 Search 2 Speed increase
EST
sequencing
project
BLASTN
vs.
vector
BLASTN
vs.
E. coli
BLASTX
vs.
nr
contaminants
Tentative
functional
classification

Get BLAST now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.