This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Chapter 12: Hardware and Software Optimizations
One obvious improvement is to employ faster networking. Doing so increases the
cost of each compute node a little and significantly increases the cost of network
switches because gigabit network switches are still quite expensive. However, it is
possible to use a hybrid solution in which the database server is connected to a
hybrid network switch via a gigabit line and the compute nodes are connected to the
switch via the more common 100-Mb interface. This is much cheaper than using
gigabit everywhere, and, because exceeding 12.5 MBps is rare, it doesn’t hinder per-
formance too much.
When building file servers, people often neglect to put in enough RAM. For BLAST
database servers, though, you really want as much RAM as possible. Caching applies
on the file-server end, too, and if several computers request data from the file server,
it’s much better if it can be served from memory rather than from disk. If you’re
thinking of using autonomous network attached servers as a BLAST database server,
think again. Most don’t have gigabit networking or enough RAM.
Keeping local copies of your BLAST databases on each node of the cluster will make
access to the data very fast. Most hard disks can read data at 20 to 30 MB per sec-
ond or about double what you could get from common networking. If your network
is slow, your cluster is large, or your searches are really insensitive, it’s much better
to have local copies of databases. The main concern with this approach is keeping
the files synchronized and updated with respect to a master copy. This can be done
via rsync or other means. However, if all the nodes update their databases at the
same time across a thin pipe, this operation could take a long time, and the compute
nodes may sit idle.
A lesser concern is the disks themselves. They cost money and are a potential source
of hardware failure (for this reason, some people advocate running the compute
nodes diskless). When discussing disks, there’s a great deal of debate over IDE ver-
sus SCSI. Drives using the IDE interface are generally slower and less reliable, but are
much less expensive. Experts on both sides of the debate will argue convincingly that
buying one type of drive makes more sense than buying the other. However, for opti-
mal performance, you really should access the database from cache rather than disk,
and therefore the disk shouldn’t really matter. Those who choose IDE or SCSI aren’t
necessarily fools, but people who fail to put enough RAM in their boxes are.
Distributed Resource Management
If you’re running a lot of BLAST jobs, one problem to consider is how to manage
them to minimize idle time without overloading your computers. Being organized is
the simplest way to schedule jobs. If you’re the only user, you can use simple scripts
to iterate over the various searches and keep your computer comfortably busy. The