
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Sum Statistics and Sum Scores
|
67
Note that the expected HSP (high scoring pair) length is dependent on the search
space (m*n) and the relative entropy of the scoring scheme, H, so it varies from
search to search.
To take edge effects into account when calculating an Expect, the expected HSP
length is subtracted from the actual length of the query, m, and the actual number of
residues in the database, n, to give their effective lengths, usually denoted by m´ and
n´, respectively (see Equations 4-12 and 4-13).
In a large search space, the expected HSP length may be greater than the length of
the query, resulting in a negative effective length, m´. In practice, if the effective
length is less than 1/k, it is set to 1/k, as doing so cancels the contribution of the
short sequence to the Expect; setting for example, gives , a for-
mulation independent of m’.
Unfortunately, effective lengths of less than aren’t uncommon today. Because
, the large size on many sequence databases can result in large expected HSP
lengths. In fact it’s not uncommon to see expected HSP lengths approaching 200
when searching some of the larger protein databases. Keep in mind that the average
protein is ~300 amino acids long; thus, for many searches, m´ is being set to 1/k rou-
tinely. Recent work by S.F. Altschul