This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
313
Glossary
1˚
The abbreviation for primary. 1˚ sequence
refers to the letters of DNA, RNA, or pro-
tein. transcript refers to an unproc-
essed RNA that still contains its introns.
2˚
The abbreviation for secondary. Most fre-
quently used for generalizing protein and
RNA structures; for example, the α-helix
and hair-pin are common 2˚ structures.
3´
The end of a nucleic acid sequence; often
used with UTR.
5´
The start of a nucleic acid (DNA or RNA)
sequence; often used in conjunction with
UTR (e.g., 5´UTR). Nucleotide sequences
are conventionally written with the 5´ end
at the left. DNA molecules are usually
double-stranded but when written, usu-
ally only the to strand is displayed.
The complementary strand has reversed
polarity (3´ to 5´).
aa
The abbreviation for an amino acid that is
often used when describing the length of a
protein (e.g., the average protein is about
300 aa long).
allele
A form of a gene. Typically, the most
common form is called wild-type, and
each allele is given a specific (and often
obscure) name.
amino acid
The basic building block for all proteins.
There are 20 common amino acids.
Arabidopsis thaliana
Known by its common name, thale cress,
this mustard weed is a favorite organism
for plant genetics and molecular biology.
It was the first plant with a complete
genomic sequence. For more information,
see http://www.arabidosis.org.
bit
The contraction for binary digit. The
base-2 logarithm of a number is in units of
bits.
BLOSUM
The abbreviation for a blocks substitution
matrix. Matrix names are followed by a
number (e.g., BLOSUM62) that indicate
the minimum percent identity between
any two aligned sequences.
bp
The abbreviation for base pair. The length
of DNA is usually given in bp or nt, Com-
mon measures include Kb, Mb, and Gb
for thousands, millions, and billions of bp,
respectively.
C-terminus
The end of a protein. In text form, the
C-terminus of the protein is always at the
right.
Caenorhabditis elegans
A nematode (also called a roundworm)
that is about 1 mm long and has about
CDS
This is the Title of the Book, eMatter Edition
Copyright © 2012 O’Reilly & Associates, Inc. All rights reserved.
314 | Glossary
1,000 cells as an adult. C. elegans was the
first animal to have its complete genome
sequenced. See http://www.wormbase.org.
CDS
The abbreviation for a coding sequence.
CDS isn’t synonymous with exon, since
exons may contain noncoding sequence.
codon
Three contiguous letters of DNA or RNA.
Each of the 64 codons specifies either an
amino acid or a translation stop.
complement
The complement of a DNA sequence is
the sequence on the other strand. For
example, the complement of ACCCGT is
TGGGCA. To complement a sequence in
Perl, use either of the following:
# 4-letter alphabet
$dna =~ tr/ACGT/TGCA/;
# 15-letter alphabet
$dna =~ tr[ACGTRYWSKMBDHV]
[TGCAYRSWMKVHDB];
Drosophila melanogaster
The common fruit fly. This is one of the
most famous organisms for genetic
research and was one of the first animals
whose complete genomic sequence was
determined. See http://www.fruitfly.org.
dynamic programming
A common technique that reduces the
computational complexity of a problem
by finding and extending a partial optimi-
zation.
E. coli
Eschericia coli. A common bacteria nor-
mally found in your gut and a favorite
organism for molecular biology research.
Some variants cause food poisoning.
effective length
Karlin-Altschul statistics assume
sequences of infinite length. To adjust for
edge effects in real sequences, the search
space is reduced by adjusting the true
lengths of the sequences to effective
lengths.
entropy
Randomness; disorder; unpredictability.
eukaryote
Organisms with intracellular membra-
nous organelles such as the nucleus and
mitochondria are called eukaryotes.
frame-shift mutation
A mutation that causes an insertion or
deletion of nucleotides that isn’t a multi-
ple of three, and therefore causes the read-
ing frame to change.
gene
A functional unit of the genome. When
not specifically stated, “gene” is usually
considered a “protein-coding” gene, but
many genes don’t contain the instructions
for proteins (e.g., various RNA genes).
genetic code
The mapping of codons to amino acids.
See Table 2-3.
genetic drift
The tendency of sequences to change over
time by accumulating random mutations.
genome
The complete genetic material for an
organism. For eukaryotes, the genome
refers to the nuclear genome and doesn’t
include organelles.
global alignment
An alignment algorithm that requires
every letter of each sequence to appear in
the alignment. Globally aligning
sequences of different lengths may lead to
very strange alignments.
homologous
In sequence analysis, homologous means
derived from a common ancestor.
Sequences are either homologous or they
aren’t. It is incorrect to say that sequences
are 80 percent homologous unless you
mean that there is an 80 percent chance of
common ancestry. Use percent identity to
describe the similarity of alignments.
hydrophilic
Literally, “likes water.” Water is a polar
molecule that mixes well with other polar
molecules. The charged amino acids K, R,
D, and E, are examples of hydrophilic
amino acids.

Get BLAST now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.