
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
20 Tips to Improve Your BLAST Searches
|
123
8.10 Be Skeptical of Hypothetical Proteins
Amino acid sequencing is more difficult than nucleic acid sequencing, and therefore,
sequences of most proteins are inferred from DNA translations. Some inferences
come from gene predictions and others come from transcript translations. Finding
the correct structure of genes in genomic DNA is very difficult; algorithms are
incomplete approximations, and people make mistakes. Some research groups are
conservative and only report proteins when there is good evidence. Others submit
hypothetical proteins and hope that they will be useful (and they often are). As a
result, many proteins in the public database are slightly incorrect or even fictitious.
Unfortunately, hypothetical gene structures aren’t always clearly labeled.
The most accurate protein sequences come from translating full-length cDNAs. But
determining the protein encoded by a transcript isn’t as simple as it sounds. While
there is usually only one long open reading frame (ORF), the longest ORF won’t nec-
essarily correspond to a real protein. Be suspicious of all short proteins. Even in a
full-length cDNA with a very large ORF, determining the start of translation isn’t
straightforward. The first methionine in the longest ORF is usually