This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
128
|
Chapter 8: 20 Tips to Improve Your BLAST Searches
8.17 Perform Pilot Experiments
Before embarking on a large BLAST experiment, first try some pilot experiments. For
example, if you want to compare all human proteins to all nonhuman proteins, try
100 proteins first. Or, if you want to annotate a 5 mb chromosomal region with
BLASTX similarities, search 100 Kb first. If you’re unsure of which parameters to
use, try several and see which ones give you the kinds of results you’re looking for. It
may seem like a waste of time, but performing pilot experiments will actually save
you time in the end.
8.18 Examine Statistical Outliers
In a high-throughput setting, BLAST reports may be huge and number in the thou-
sands. There’s no way you can look at all of them, but for quality control, you
should examine some of them. Keep global statistics on BLAST reports, such as
number of hits per Kb. Statistical outliers may point to general problems that
become more apparent in certain sequences.
8.19 Use links and topcomboN to Make Sense of
Alignment Groups
WU-BLAST has two very useful parameters for displaying alignment groupings.
topcomboN sorts alignments into groups and labels them. The links parameter shows
the order of alignments in a group, which is much like the order of a gene’s exons.
Figure 8-9 displays these features.
8.20 How to Lie with BLAST Statistics
Several techniques can help you massage BLAST statistics to either hide significant
alignments or make meaningless alignments appear highly significant. Why would
you want to do this? If you have to ask, you’re not the intended audience. Dishonest
evil doers read on.
The easiest method to adjust the significance of all scores is to set the effective size of
the search space either higher or lower. Command-line parameters in both NCBI-
BLAST (
-Y) and WU-BLAST (Y and Z) are available. You can also alter the scoring
scheme by editing the scoring matrices. A more involved approach involves hacking
the source code to set your own values for λ, k, and H. WU-BLAST makes it all too
easy because you can alter scores or set Karlin-Altschul parameters on the command
line. Whatever approach you take, you will, of course, want to edit the footer to
cover your tracks. The easiest way to do this is to run the search twice and diff the
footers to determine what needs fixing.
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
20 Tips to Improve Your BLAST Searches
|
129
With low gap penalties, you can make alignments between just about anything. For
BLASTN, NCBI-BLAST always uses ungapped statistics, so you don’t have to do
much work to lie. Just hope that nobody notices all the gaps. This works best if you
have a supervisor who is either too busy to look at alignments or wouldn’t know a
decent alignment if it bit him. NCBI-BLAST is very restrictive about what gap penal-
ties you can employ for the protein-based BLAST programs. Your only choice here is
to hack and recompile. WU-BLAST is very easy; set your gap costs low and include
warnings on the command line to suppress messages about ungapped statistics.
Another way to trick the unobservant is to remove complexity filters. This works
especially well when claiming that some anonymous low-complexity region or tran-
script is a cool gene. You can almost always find a small ORF that has a poor match
to something with an interesting definition line. A poor match is only poor if you
don’t know how to fix the statistics. This approach even works when fooling scien-
tific journals. (It really does. We’ve seen it happen.)
Figure 8-9. WU-BLAST topcomboN and links (the top-to-bottom order of alignments in the
graphic (a) are the same as the statistics lines from the BLASTX report (b))
a
b
1
2
links topcomboN

Get BLAST now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.