This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Chapter 2: Biological Sequences
total triplet combinations), long open reading frames (ORFs) should be very rare, at
least from an unbiased random model. On average, proteins are 300 amino acids
long, so finding an ORF that is 900 nucleotides long is really unexpected and a pretty
clear signal that the ORF codes for a real protein. Of course, some genes encode
small proteins, and finding these is a bit more difficult.
Eukaryotic gene structure is more complicated than prokaryotic gene structure.
Unlike prokaryotic genes, eukaryotic genes are often broken into pieces that are
assembled before they are translated. Like prokaryotes, eukaryotes also have promot-
ers to regulate when genes are turned on, but they are often much larger and may
exist a great distance from the start of translation. In addition, many genes respond
to additional sequences called enhancers and suppressors that aren’t necessarily
upstream of a gene and may be many kilobases away.
In eukaryotes, mRNAs are processed before they are translated (Figure 2-6). Two
kinds of processing are common: splicing and poly-adenylation. Splicing brings
together the coding sequences and throws out the intervening stuff. The sequences
that end up in the mature mRNA are called exons, and the intervening stuff is
called introns. The part of the mRNA that codes for protein is called the coding
sequence (CDS), and the parts at either end are called untranslated regions (UTRs).
The other common post-transcriptional modification is poly-adenylation. In this
process, 50 or more adenine nucleotides are added to the end of the mRNA, which
is called a poly-A tail.
To many people, the most interesting parts of a genome are its genes. However,
genes may account for a small fraction of a genome. In the human genome, for exam-
ple, only 1 to 2 percent of the sequence codes for proteins. So why not just sequence
the proteins? This procedure turns out to be much more difficult than sequencing
nucleotides, but you can sequence the transcripts that code for proteins. Using some
clever molecular biology techniques, it’s possible to separate mRNAs from the rest of
Figure 2-6. Eukaryotic mRNA processing
Exon Intron Exon
ATG Sto p
A A A A A A