
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
20 Tips to Improve Your BLAST Searches
|
117
for nearly identical sequences) isn’t expected to answer the question; too many
sequences have changed in the 500 million years that separate worms and humans.
8.3 Perform Controls, Especially in
the Twilight Zone
Controls are crucial to any scientific experiment. The random model underlying
BLAST statistics provides one kind of control, but performing an explicit control can
give you greater confidence in your results. This is especially true when looking for
weak similarities, commonly called the twilight zone. One of the simplest and most
effective ways to determine if an alignment is believable is to shuffle your query
sequence and repeat the search. If the shuffled sequence returns similar results, the
alignment is based on compositional biases or the search parameters aren’t specific
enough. The following Perl script shuffles a FASTA file:
#!/usr/bin/perl -w
use strict;
my ($def, @seq) = <>;
print $def;
chomp @seq;
@seq = split(//, join("", @seq));
my $count = 0;
while (@seq) {
my $index = rand(@seq);
my $base = splice(@seq, $index, 1);
print $base;
print "\n" if ++$count % 60 == 0;
}
Now let’s put this script into action. Let’s make the dubious hypothesis that ALU
repeats aren’t specific to primates but are present in all genomes. They haven’t ...