This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
The abbreviation for primary. 1˚ sequence
refers to the letters of DNA, RNA, or pro-
tein. transcript refers to an unproc-
essed RNA that still contains its introns.
The abbreviation for secondary. Most fre-
quently used for generalizing protein and
RNA structures; for example, the α-helix
and hair-pin are common 2˚ structures.
The end of a nucleic acid sequence; often
used with UTR.
The start of a nucleic acid (DNA or RNA)
sequence; often used in conjunction with
UTR (e.g., 5´UTR). Nucleotide sequences
are conventionally written with the 5´ end
at the left. DNA molecules are usually
double-stranded but when written, usu-
ally only the to strand is displayed.
The complementary strand has reversed
polarity (3´ to 5´).
The abbreviation for an amino acid that is
often used when describing the length of a
protein (e.g., the average protein is about
300 aa long).
A form of a gene. Typically, the most
common form is called wild-type, and
each allele is given a specific (and often
obscure) name.
amino acid
The basic building block for all proteins.
There are 20 common amino acids.
Arabidopsis thaliana
Known by its common name, thale cress,
this mustard weed is a favorite organism
for plant genetics and molecular biology.
It was the first plant with a complete
genomic sequence. For more information,
The contraction for binary digit. The
base-2 logarithm of a number is in units of
The abbreviation for a blocks substitution
matrix. Matrix names are followed by a
number (e.g., BLOSUM62) that indicate
the minimum percent identity between
any two aligned sequences.
The abbreviation for base pair. The length
of DNA is usually given in bp or nt, Com-
mon measures include Kb, Mb, and Gb
for thousands, millions, and billions of bp,
The end of a protein. In text form, the
C-terminus of the protein is always at the
Caenorhabditis elegans
A nematode (also called a roundworm)
that is about 1 mm long and has about
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
314 | Glossary
1,000 cells as an adult. C. elegans was the
first animal to have its complete genome
sequenced. See
The abbreviation for a coding sequence.
CDS isn’t synonymous with exon, since
exons may contain noncoding sequence.
Three contiguous letters of DNA or RNA.
Each of the 64 codons specifies either an
amino acid or a translation stop.
The complement of a DNA sequence is
the sequence on the other strand. For
example, the complement of ACCCGT is
TGGGCA. To complement a sequence in
Perl, use either of the following:
# 4-letter alphabet
$dna =~ tr/ACGT/TGCA/;
# 15-letter alphabet
Drosophila melanogaster
The common fruit fly. This is one of the
most famous organisms for genetic
research and was one of the first animals
whose complete genomic sequence was
determined. See
dynamic programming
A common technique that reduces the
computational complexity of a problem
by finding and extending a partial optimi-
E. coli
Eschericia coli. A common bacteria nor-
mally found in your gut and a favorite
organism for molecular biology research.
Some variants cause food poisoning.
effective length
Karlin-Altschul statistics assume
sequences of infinite length. To adjust for
edge effects in real sequences, the search
space is reduced by adjusting the true
lengths of the sequences to effective
Randomness; disorder; unpredictability.
Organisms with intracellular membra-
nous organelles such as the nucleus and
mitochondria are called eukaryotes.
frame-shift mutation
A mutation that causes an insertion or
deletion of nucleotides that isn’t a multi-
ple of three, and therefore causes the read-
ing frame to change.
A functional unit of the genome. When
not specifically stated, “gene” is usually
considered a “protein-coding” gene, but
many genes don’t contain the instructions
for proteins (e.g., various RNA genes).
genetic code
The mapping of codons to amino acids.
See Table 2-3.
genetic drift
The tendency of sequences to change over
time by accumulating random mutations.
The complete genetic material for an
organism. For eukaryotes, the genome
refers to the nuclear genome and doesn’t
include organelles.
global alignment
An alignment algorithm that requires
every letter of each sequence to appear in
the alignment. Globally aligning
sequences of different lengths may lead to
very strange alignments.
In sequence analysis, homologous means
derived from a common ancestor.
Sequences are either homologous or they
aren’t. It is incorrect to say that sequences
are 80 percent homologous unless you
mean that there is an 80 percent chance of
common ancestry. Use percent identity to
describe the similarity of alignments.
Literally, “likes water.” Water is a polar
molecule that mixes well with other polar
molecules. The charged amino acids K, R,
D, and E, are examples of hydrophilic
amino acids.

Get BLAST now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.