This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Genomes and Genes
teria. So the next time you munch on a carrot, you might consider how many
genomes are really in there.
So far, this chapter has neglected viruses. Where do they fit in? By most definitions,
viruses aren’t even alive; they don’t grow or have repair processes. Viruses seem to
break every rule of biology. Some viruses infect prokaryotes and others that parasit-
ize eukaryotes. Viruses come in many different shapes and have wildly different life-
styles. Some have genomes made from RNA instead of DNA, and others have single-
stranded rather than double-stranded genomes.
Genomes and Genes
In general, the genomic structure of prokaryotes is very different from that of eukary-
otes (Figure 2-5). Genomes are organized into chromosomes. Prokaryotes often have
a single circular chromosome, and eukaryotes usually have multiple linear chromo-
somes. People are sometimes surprised to find that genome size and chromosome
number aren’t reflected in organismal complexity. For example, the single-celled
Amoeba dubia has a genome that is about 200 times larger than the human genome.
Although dogs and cats have very similar genome sizes, dogs have twice as many
chromosomes. One rule to keep in mind when thinking about genomic organization
is that genomes of viruses and prokaryotic organisms generally contain little noncod-
ing sequence, whereas the genomes of more complex organisms usually contain a
much higher percentage of noncoding sequence.
Prokaryotic Genes
Prokaryotic genes are relatively simple. They contain a promoter that determines
when the gene is transcribed and a coding region that contains the DNA sequence
for a protein. It is relatively easy to find genes in prokaryotic genomes. Since stop
codons are expected about every 21 triplets (there are three stop codons out of 64
Figure 2-5. Prokaryote and eukaryote cells
Prokaryotic gene
Coding sequence
Eukaryotic gene
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
Chapter 2: Biological Sequences
total triplet combinations), long open reading frames (ORFs) should be very rare, at
least from an unbiased random model. On average, proteins are 300 amino acids
long, so finding an ORF that is 900 nucleotides long is really unexpected and a pretty
clear signal that the ORF codes for a real protein. Of course, some genes encode
small proteins, and finding these is a bit more difficult.
Eukaryotic Genes
Eukaryotic gene structure is more complicated than prokaryotic gene structure.
Unlike prokaryotic genes, eukaryotic genes are often broken into pieces that are
assembled before they are translated. Like prokaryotes, eukaryotes also have promot-
ers to regulate when genes are turned on, but they are often much larger and may
exist a great distance from the start of translation. In addition, many genes respond
to additional sequences called enhancers and suppressors that aren’t necessarily
upstream of a gene and may be many kilobases away.
In eukaryotes, mRNAs are processed before they are translated (Figure 2-6). Two
kinds of processing are common: splicing and poly-adenylation. Splicing brings
together the coding sequences and throws out the intervening stuff. The sequences
that end up in the mature mRNA are called exons, and the intervening stuff is
called introns. The part of the mRNA that codes for protein is called the coding
sequence (CDS), and the parts at either end are called untranslated regions (UTRs).
The other common post-transcriptional modification is poly-adenylation. In this
process, 50 or more adenine nucleotides are added to the end of the mRNA, which
is called a poly-A tail.
To many people, the most interesting parts of a genome are its genes. However,
genes may account for a small fraction of a genome. In the human genome, for exam-
ple, only 1 to 2 percent of the sequence codes for proteins. So why not just sequence
the proteins? This procedure turns out to be much more difficult than sequencing
nucleotides, but you can sequence the transcripts that code for proteins. Using some
clever molecular biology techniques, it’s possible to separate mRNAs from the rest of
Figure 2-6. Eukaryotic mRNA processing
Primary transcript
Mature transcript
Exon Intron Exon
5' UTR
3' UTR
ATG Sto p

Get BLAST now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.