This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Where Did My Oligo Go?
So far in this chapter, we’ve just walked through most basic operations of Karlin-
Altschul statistics to provide you with the knowledge necessary to calculate bit
scores, effective lengths, and Expects. We’ve explained that BLAST uses one statisti-
cal measure to calculate the Expect of an HSP and another to calculate the aggregate
Expect of a group of HSPs. Hopefully, you’ve gained a better understanding of how
all of these operations of fit into the larger picture of Karlin-Altschul statistics.
You have also seen that it’s possible to use Karlin-Altschul statistics to recover statis-
tical measures that are calculated by BLAST internally, but not included in the
report—principally, sum scores and the individual Expect for an HSP for which an
Expect(n) has been reported. Learning to calculate these values is the first step
toward becoming a power user of BLAST statistics. The remaining sections of this
chapter will show you how to use what you’ve learned to deal with critical questions
about BLAST results.
Using Statistics to Understand BLAST Results
Karlin-Altschul statistics is much more than a way to determine the statistical signifi-
cance of a sequence alignment in the context of a database search. It also provides a
framework with which to probe the complex relationships that exist between BLAST
parameters and results. Using Karlin-Altschul statistics to ask and answer questions
about a BLAST search is much like using stoichiometry at the lab bench; it doesn’t
require theoretical savvy, just a little algebra. It’s also useful; you no longer need to
be frustrated when confronted with an inexplicable BLAST result.
Now let’s look at a practical application of Karlin-Altschul statistics: using BLASTN
to map a PCR primer to a genome. The application is a simple but striking example
of how to use Karlin-Altschul statistics to understand the way parameter choice
determines BLAST results. Finally, Karlin-Altschul statistics reveal much about
BLASTN’s strengths and weaknesses and its potential as a tool to detect the con-
served, cis-regulatory regions of genes.
Where Did My Oligo Go?
First, try to identify the position of the following oligo-nucleotide in the Drosophila
melanogaster genome using WU-BLASTN with its default parameters:
Example 7-4 shows that the oligo isn’t found in the Drosophila melanogaster genome
that uses WU-BLASTN with default parameters.
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Chapter 7: A BLAST Statistics Tutorial
There are, of course, many reasons why you might not be able to identify an oligo in
the Drosophila melanogaster genome. First, the oligo might contain repetitive
sequence and thus be masked out. However, because WU-BLAST doesn’t mask by
default, that can’t be the reason. Second, the assembled genome may be incomplete.
Every sequenced genome to date is incomplete to some degree. In fact, a 99 percent
complete 124mb genome is still missing 1.24 mega-bases of a euchromatic (nonre-
petitive DNA) sequence, leaving plenty of space for an oligo to go missing in. The
incompleteness of the genome is a possible explanation for our WU-BLAST result,
but is it the correct one? Before concluding that the oligo falls into a sequencing gap,
let’s try to run NCBI-BLASTN with its default parameters. Aha! The NCBI-BLASTN
results in Example 7-5 show that the oligo is present in the Drosophila melanogaster
genome and the HSP is assigned a significant Expect.
Example 7-4. The oligo isn’t found
Reference: Gish, W. (1996-2000)
Notice: this program and its default parameter settings are optimized to find
nearly identical sequences rapidly. To identify weak similarities encoded in
nucleic acid, use BLASTX, TBLASTN or TBLASTX.
Query= oligo
(25 letters)
Database: na_whole-genome_genomic_dmel_RELEASE3.FASTA
7 sequences; 124,181,667 total letters.
Searching....10....20....30....40....50....60....70....80....90....100% done
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
*** NONE ***
Example 7-5. Using NCBI-BLASTN to find the oligo
Sequences producing significant alignments: (bits) Value
2R 2R.3 assembled 23-11-2001 50 1e-06
X X release:2 length:21666217bp Assembled X chromosome arm seque... 32 0.25
3R 3R.3 32 0.25
U GenomicInterval:U 30 0.99
3L 3L.3 v.3e 23351213bp BCM HGSC guide:3l-mtp-eval.08apr02 28 3.9
2L 2L release:3 length:22217931bp Assembled 2L chromosome arm se... 28 3.9
>2R 2R.3 assembled 23-11-2001
Length = 20302755
Score = 50.1 bits (25), Expect = 1e-06
Identities = 25/25 (100%)

Get BLAST now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.