
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
64
|
Chapter 4: Sequence Similarity
Sequence Similarity
Sequence similarity is a simple extension of amino acid or nucleotide similarity. To
determine it, sum up the individual pair-wise scores in an alignment. For example,
the raw score of the following BLAST alignment under the BLOSUM62 matrix is 72.
Converting 72 to a normalized score is as simple as multiplying by lambda. (Note
that for BLAST statistical calculations, the normalized score is λS – lnk.)
Query: 885 QCPVCHKKYSNALVLQQHIRLHTGE 909
+C VC K ++ L++H RLHTGE
Sbjct: 267 ECDVCSKSFTTKYFLKKHKRLHTGE 291
Recall from Chapter 3 that the score of each pair of letters is considered indepen-
dently from the rest of the alignment. This is the same idea. There is a convenient
synergy between alignment algorithms and alignment scores. However, when treat-
ing the letters independently of one another, you lose contextual information. Can
you assume that the probability of A followed by G is the same as the probability of
G followed by A? In a natural language such as English, you know that this doesn’t
make sense. In English, Q is always followed by U. If you treat these letters indepen-
dently, you lose this restriction. The context rules for biological sequences aren’t as
strict as for English, but there are tendencies. For example, low