
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
20 Tips to Improve Your BLAST Searches
|
119
8.5 Use the Karlin-Altschul Equation
to Design Experiments
The Karlin-Altschul equation is very useful for predicting the outcome of a BLAST
experiment, especially in large search spaces. Suppose you want to find exons in the
human genome by looking for similarities in the pufferfish genome. These genomes
last shared a common ancestor about 450 million years ago. You might assume that
any similarities at this distance must be due to evolutionary conservation.
Recall from Chapter 4 that the number of alignments expected by chance (E) is a
function of the search space (M, N), the normalized score (λS), and a minor con-
stant (K).
The typical cross-species parameters +1/-1 match/mismatch have a target frequency
of 75 percent identity and 0.55 nats per aligned letter on average (H). A 50-bp align-
ment therefore contains about 27.5 nats. Substituting this normalized score into the
Karlin-Altschul equation with K=0.334, M=1.5 GB (assuming half of the human
genome contains repeats), and N=450 MB (the size of the repeat-poor pufferfish
genome), you expect about 230,000 alignments by chance. That’s roughly the same
as the number of exons in the human genome. If you want to look for 50-bp exons,
you’ll have to sift through a lot of false positives.
To