BUY THIS BOOK
Add to Cart

Print Book $39.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint this content?


BLAST
BLAST By Ian Korf, Mark Yandell, Joseph Bedell
July 2003
Pages: 360

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Hello BLAST
Welcome to BLAST! This chapter offers a quick start guide to BLAST by exploring some Internet search pages. Throughout the chapter, you may encounter unfamiliar (or even frightening) terms. Don't panic. The terms are fully explained in later chapters or in the Glossary. You don't need to understand all the concepts to get the most out of this chapter. If you're already a seasoned BLAST user, feel free to skip this introduction and dive right into the later sections.
BLAST is an acronym for Basic Local Alignment Search Tool. Despite the adjective "Basic" in its name, BLAST is a sophisticated software package that has become the single most important piece of software in the field of bioinformatics. There are several reasons for this. First, sequence similarity is a powerful tool for identifying the unknowns in the sequence world. Second, BLAST is fast. The sequence world is big and growing rapidly, so speed is important. Third, BLAST is reliable, from both a rigorous statistical standpoint and a software development point of view. Fourth, BLAST is flexible and can be adapted to many sequence analysis scenarios. Finally, BLAST is entrenched in the bioinformatics culture to the extent that the word "blast" is often used as a verb. There are other BLAST-like algorithms with some useful features, but the historical momentum of BLAST maintains its popularity above all others.
Although BLAST originated at the National Center for Biotechnology Information (NCBI), its development continues at various institutions, both academic and commercial. This can be a little confusing, especially because people often put prefixes or suffixes on the acronym to come up with names like XYZ-BLAST-PDQ. We have aimed to keep this book as simple as possible, and therefore we concentrate on the two most popular versions: NCBI-BLAST and WU-BLAST (pronounced "woo blast"). NCBI-BLAST, as the name suggests, is the version available from the NCBI. WU-BLAST comes from Washington University in St. Louis and is developed by Warren Gish, one of the original authors of BLAST.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is BLAST?
BLAST is an acronym for Basic Local Alignment Search Tool. Despite the adjective "Basic" in its name, BLAST is a sophisticated software package that has become the single most important piece of software in the field of bioinformatics. There are several reasons for this. First, sequence similarity is a powerful tool for identifying the unknowns in the sequence world. Second, BLAST is fast. The sequence world is big and growing rapidly, so speed is important. Third, BLAST is reliable, from both a rigorous statistical standpoint and a software development point of view. Fourth, BLAST is flexible and can be adapted to many sequence analysis scenarios. Finally, BLAST is entrenched in the bioinformatics culture to the extent that the word "blast" is often used as a verb. There are other BLAST-like algorithms with some useful features, but the historical momentum of BLAST maintains its popularity above all others.
Although BLAST originated at the National Center for Biotechnology Information (NCBI), its development continues at various institutions, both academic and commercial. This can be a little confusing, especially because people often put prefixes or suffixes on the acronym to come up with names like XYZ-BLAST-PDQ. We have aimed to keep this book as simple as possible, and therefore we concentrate on the two most popular versions: NCBI-BLAST and WU-BLAST (pronounced "woo blast"). NCBI-BLAST, as the name suggests, is the version available from the NCBI. WU-BLAST comes from Washington University in St. Louis and is developed by Warren Gish, one of the original authors of BLAST.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using NCBI-BLAST
This book begins by exploring the BLAST pages on the NCBI web site. The NCBI, part of the National Institutes of Health, is a U.S. government-funded center for the curation and presentation of public biological knowledge. The NCBI is a public repository for DNA and protein sequences (GenBank), but it's far more than just a data storehouse. The NCBI also maintains a comprehensive medical publication archive (PubMed), distributes many tools for biological analyses (NCBI toolbox), and puts together its own tools for making the most use of the data that it stores (LocusLink, UniGene, RefSeq, Taxonomy browser). Most importantly, for our purposes, it's where the BLAST algorithm was first developed (Altschul et al., 1990) and where it can be obtained, distributed, and used for free without restrictions. Anyone with access to the Internet can run a BLAST search and explore the plethora of genetic resources that have been amassed and curated by the NCBI over the years.
You'll get the most out of this chapter if you follow along with a web browser. Begin by going to the BLAST homepage at http://www.ncbi.nlm.nih.gov/BLAST.
Without explaining all of the options presented on the homepage, let's get right into it with a default BLASTN search. Choose "Standard nucleotide-nucleotide BLAST [blastn]" as shown in Figure 1-1. BLASTN is a program that compares a nucleotide query sequence to a database of nucleotide sequences.
Figure 1-1: NCBI BLAST home page
After choosing the kind of search you want to perform, the next step is to define the sequence with which to search. There are three options for this: paste in the bare sequence, paste in a file in FASTA format, or enter a valid NCBI identifier. You can just start typing a sequence in the search box; however, when the search is done, there will be no identifier to describe the sequence you entered. After several such searches, the lack of an identifier will make it difficult to keep track of which results go with which sequence. The second option allows you to define the sequence using the FASTA format. The FASTA format is described in detail in Chapter 11, but the basic specifications are that it's a text file beginning with a greater than sign (>) followed by an identifier and a definition line, which is then proceeded by the one-letter nucleotide or peptide sequence on subsequent lines. Let's use the following sequence:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Alternate Output Formats
This chapter showed the default HTML format, which is obviously best for viewing in a web browser. But what if you wanted to parse the output or store it in a database? HTML is not the best format for these choices. The NCBI also supports Plain Text, eXtensible Markup Language (XML), and ASN.1 formats. To see these different formats, just scroll back to the top of the report, choose another format under the Format option, and then resubmit using the Format! button. You can try this for all the formats, and then just hit the browser Back button to return to the HTML formatted page.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Alternate Alignment Views
The default Pairwise view shown in Figure 1-10 is the classic BLAST output style, but other options are available for other purposes. These options, described in the NCBI reference section and in Appendix A, include pairwise, query-anchored with identities, query-anchored without identities, flat query-anchored with identities, flat query-anchored without identities, and Hit Table. The most friendly option for text parsers is the Hit Table, which is viewed in plaintext format. This displays all the results in a tab-delimited table, which can be parsed easily. You can select this at the top of the page by changing "Format" to "Plain text" and "Alignment view" to "Hit Table" (Figure 1-12).
Figure 1-12: Changing format options
The Hit Table alignment view is shown in Figure 1-13. The first five lines start with # and are comments about the BLAST program, the query, and the database, followed by a description of the reported fields. The lines after the comments are the alignments in table format. The Hit Table contains all the necessary data to judge a hit without displaying the actual sequence being aligned.
Figure 1-13: Hit Table alignment
The other available alignment options allow a multiple sequence alignment view of the BLAST hits. One of these multiple alignment options, query-anchored with identities, is shown in Figure 1-14. In this view, the full sequence of the query is shown on the top line with a unique identifier (1_18852, in this case). Subsequently, each line shows the alignment for one database hit. Identical residues are represented with a dot (.), while nucleotide differences are shown explicitly. This alignment option is useful for quickly identifying changes common to a group of sequences. For example, you can see from the part of the alignment shown in Figure 1-14 that the bottom four sequences (6754225, 664837, 664835, and 664831) have common shared differences. A deeper look into these sequences reveals that they are actually different database entries for the same mouse Hoxa11 gene, which is homologous to the coelacanth Hoxa11 gene.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Next Step
This chapter has taken you through a simple BLASTN search at the NCBI database; however, more than two dozen specialized BLAST pages are available, and they let you do anything—from screening for vector sequence, to identifying protein family members, to mapping a sequence to the human genome. For a quick guide to these specialized pages, the NCBI presents a convenient reference to these tools at http://www.ncbi.nlm.nih.gov/BLAST/producttable.html.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Further Reading
Altschul, S.F., T.L. Madden, A.A. Schaeffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Research, 25, pp. 3389-3402.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Biological Sequences
Sequence similarity is a powerful tool for discovering biological function. Just as the ancient Greeks used comparative anatomy to understand the human body and linguists used the Rosetta stone to decipher Egyptian hieroglyphs, today we can use comparative sequence analysis to understand genomes, RNAs, and proteins. But why are biological sequences similar to one another in the first place? The answer to this question isn't simple and requires an understanding of molecular and evolutionary biology.
Most courses in molecular biology begin with the Central Dogma of Molecular Biology, which describes the path by which information contained in DNA is converted to protein molecules with specific functions. Stated simply, the Central Dogma is: "from DNA to RNA to protein." Figure 2-1 shows a more complete diagram of this process and will be referenced throughout this section.
Figure 2-1: The Central Dogma of Molecular Biology: DNA to RNA to protein
The hereditary material that carries the blueprint for an organism from one generation to the next is called deoxyribonucleic acid. It is much more commonly referred to by its acronym, DNA. Every time cells divide, the DNA is duplicated in a process called DNA replication. The entire DNA of an organism is called its genome, and genomes are sometimes called "the book of life" (especially with respect to the human genome). Reading and understanding the various books of life is one of the most important quests of the genomic age. Modern medicine, agriculture, and industry will increasingly depend on an intimate knowledge of genomes to develop individualized medicines, select and modify the most desirable traits in plants and animals, and understand the relationships among species.
The language of DNA is complicated. Over the last 50 years, scientists have begun to decipher it, but it is still largely a mystery. Although the language is elusive, the alphabet is simple, consisting of just four nucleotides: adenine, cytosine, guanine, and thymine. For simplicity in both speech and on the computer, they are usually abbreviated as A, C, G, and T. DNA usually exists as a double-stranded molecule, but we generally talk about just one strand at a time. Here's an example of a DNA sequence that is six nucleotides (nt) long:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Central Dogma of Molecular Biology
Most courses in molecular biology begin with the Central Dogma of Molecular Biology, which describes the path by which information contained in DNA is converted to protein molecules with specific functions. Stated simply, the Central Dogma is: "from DNA to RNA to protein." Figure 2-1 shows a more complete diagram of this process and will be referenced throughout this section.
Figure 2-1: The Central Dogma of Molecular Biology: DNA to RNA to protein
The hereditary material that carries the blueprint for an organism from one generation to the next is called deoxyribonucleic acid. It is much more commonly referred to by its acronym, DNA. Every time cells divide, the DNA is duplicated in a process called DNA replication. The entire DNA of an organism is called its genome, and genomes are sometimes called "the book of life" (especially with respect to the human genome). Reading and understanding the various books of life is one of the most important quests of the genomic age. Modern medicine, agriculture, and industry will increasingly depend on an intimate knowledge of genomes to develop individualized medicines, select and modify the most desirable traits in plants and animals, and understand the relationships among species.
The language of DNA is complicated. Over the last 50 years, scientists have begun to decipher it, but it is still largely a mystery. Although the language is elusive, the alphabet is simple, consisting of just four nucleotides: adenine, cytosine, guanine, and thymine. For simplicity in both speech and on the computer, they are usually abbreviated as A, C, G, and T. DNA usually exists as a double-stranded molecule, but we generally talk about just one strand at a time. Here's an example of a DNA sequence that is six nucleotides (nt) long:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Evolution
BLAST works because evolution is happening. Biological sequences show complex patterns of similarity to one another. In this regard, they mirror the external morphologies of the organisms in which they reside. You'll notice that birds, for example, show natural groupings. You don't have to be a biologist to see that ducks, geese, and swans comprise a reasonably natural group called the waterfowl, and that the similarities between ducks and geese seem too great to explain by mere coincidence. Biological sequences are no different. After all, the reason why ducks look like ducks and geese look like geese is because of their genes. Many molecular biologists are convinced that understanding sequence evolution is tantamount to understanding evolution itself.
Sequences change over time due to three forces: mutation, natural selection, and genetic drift. If you use BLAST, it's important to understand these forces because they form the biological foundation of similarity searches. The biological and mathematical foundations aren't the same, and are sometimes at odds. You need to understand both theories in order to knowledgeably interpret the sequence alignments in a BLAST report.
A mutation is simply a change in a DNA sequence. What causes mutation? Many chemicals and conditions damage DNA, so its sequence either changes or ceases to be recognizable. Mutagenic agents are often called carcinogens because cancer is caused by the accumulation of mutations in genes that control cell division. But even in a world without carcinogens there would still be mutation because the process of DNA replication isn't perfect. Every time a cell divides, it must duplicate its DNA. The human genome is about three billion letters long, and the error rate of DNA replication is about one error in every 300 million letters, so you can expect about 10 mutations per genome duplication. Genome size varies, as does the replication error rate, so don't take the 10 mutations per genome replication as any kind of biological truth. Human beings are composed of about a trillion cells, and you might take a moment now and consider just how much mutation is going on in your own body. Whatever that large number is, it's infinitesimal compared to what's happening in the biosphere as a whole.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Genomes and Genes
In general, the genomic structure of prokaryotes is very different from that of eukaryotes (Figure 2-5). Genomes are organized into chromosomes. Prokaryotes often have a single circular chromosome, and eukaryotes usually have multiple linear chromosomes. People are sometimes surprised to find that genome size and chromosome number aren't reflected in organismal complexity. For example, the single-celled Amoeba dubia has a genome that is about 200 times larger than the human genome. Although dogs and cats have very similar genome sizes, dogs have twice as many chromosomes. One rule to keep in mind when thinking about genomic organization is that genomes of viruses and prokaryotic organisms generally contain little noncoding sequence, whereas the genomes of more complex organisms usually contain a much higher percentage of noncoding sequence.
Figure 2-5: Prokaryote and eukaryote cells
Prokaryotic genes are relatively simple. They contain a promoter that determines when the gene is transcribed and a coding region that contains the DNA sequence for a protein. It is relatively easy to find genes in prokaryotic genomes. Since stop codons are expected about every 21 triplets (there are three stop codons out of 64 total triplet combinations), long open reading frames (ORFs) should be very rare, at least from an unbiased random model. On average, proteins are 300 amino acids long, so finding an ORF that is 900 nucleotides long is really unexpected and a pretty clear signal that the ORF codes for a real protein. Of course, some genes encode small proteins, and finding these is a bit more difficult.
Eukaryotic gene structure is more complicated than prokaryotic gene structure. Unlike prokaryotic genes, eukaryotic genes are often broken into pieces that are assembled before they are translated. Like prokaryotes, eukaryotes also have promoters to regulate when genes are turned on, but they are often much larger and may exist a great distance from the start of translation. In addition, many genes respond to additional sequences called
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Biological Sequences and Similarity
The beginning of this chapter asked why biological sequences are similar to one another. Let's answer that question now. You've seen that biological sequences like proteins may have important functions necessary for the survival of an organism. You've also seen that DNA sequence can mutate randomly, and this may change how a sequence functions. Over time, both functional constraints and random processes impact the course of sequence evolution. The degree to which a sequence follows a functional or random path depends on natural selection and neutral evolution. So the reason why sequences are similar to one another is because they start out similar to one another and follow different paths. When you read a BLAST report, you will find that your knowledge of molecular and evolutionary biology will help you interpret the similarities and differences you see.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Further Reading
Genetics, molecular biology, and evolution aren't especially difficult topics, but they are filled with many potentially unfamiliar terms. The following books are recommended for those just getting started in these fields. They are informative and entertaining, and can help more experienced readers communicate effectively with novices.
Clark, David P. and Lonnie D. Russell, Molecular Biology Made Simple and Fun (Cache River Press).
Gonick, Larry and Mark Wheelis, The Cartoon Guide to Genetics (Perennial).
Tagliaferro, Linda and Mark Vincent Bloom, The Complete Idiot's Guide to Decoding Your Genes (Alpha Books).
The following are typical textbooks for college-level courses in molecular biology, genetics, and evolution:
Alberts, Brooks et al., Molecular Biology of the Cell (Garland).
Futuyma, Douglas J., Evolutionary Biology (Sinauer Associates, Inc.).
Graur, Dan and Wen-Hsiung Li, Molecular Evolution (Sinauer).
Hartl, Daniel L. and Elizabeth W. Jones, Genetics: Analysis of Genes and Genomes, (Jones & Bartlett).
Lewin, Benjamin, Genes VII (Oxford University Press).
Lodish, Harvey et. al., Molecular Cell Biology (W.H. Freeman & Co.).
Page, Roderic D. M. and Edward C. Holmes, Molecular Evolution: A Phylogenetic Approach (Blackwell Science).
Watson, James D. and Joan Steitz. Molecular Biology of the Gene (Addison-Wesley).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Sequence Alignment
BLAST finds statistically significant similarities between sequences by evaluating alignments, but how are sequences aligned? In principle, there are many ways to align two sequences, but in practice, one method is used more often than any other. This chapter explains this technique with the biologist in mind, without using the mathematical notation and jargon that is usually employed to describe such algorithm. Divested of unfamiliar language and notation, these algorithms are quite simple.
Finding the optimal alignment between two sequences can be a computationally complex task. Fortunately, a technique called dynamic programming (DP) makes sequence alignment tractable as long as you follow a few rules. Rather than have you struggle with a confusing definition of DP, this chapter demonstrates how the technique works for sequence alignment and then gets back to the generalities. There are fundamentally two kinds of alignment: global and local. In global alignment, both sequences are aligned along their entire lengths and the best alignment is found. In local alignment, the best subsequence alignment is found. For example, if you want to find the two most similar sentences between two books, you use local alignment. If you want to compare the sentences end to end, use global alignment. This chapter describes global alignment, then local alignment. The example uses English words instead of biological sequences and the algorithms are quite general.
The global alignment algorithm described here is called the Needleman-Wunsch algorithm. We will explain it in a way that seems natural to biologists, that is, it tells the end of the story first, and then fills in the details. (This is why biologists make terrible comedians; they always tell the punch line first.) We will align the words COELANCANTH and PELICAN using a simple scoring scheme: +1 for letters that match, -1 for mismatches, and -1 for gaps. The alignment will eventually look like one of the following, which are equivalent given our scoring scheme:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Global Alignment: Needleman-Wunsch
The global alignment algorithm described here is called the Needleman-Wunsch algorithm. We will explain it in a way that seems natural to biologists, that is, it tells the end of the story first, and then fills in the details. (This is why biologists make terrible comedians; they always tell the punch line first.) We will align the words COELANCANTH and PELICAN using a simple scoring scheme: +1 for letters that match, -1 for mismatches, and -1 for gaps. The alignment will eventually look like one of the following, which are equivalent given our scoring scheme:
COELACANTH      COELACANTH
P-ELICAN--      -PELICAN--
Note that every letter of each sequence is aligned to a letter or a gap. In local alignments, discussed later, this isn't the case.
The alignment takes place in a two-dimensional matrix in which each cell corresponds to a pairing of one letter from each sequence. To get an intuitive understanding of the alignment algorithm, look at Figure 3-1, which shows where the maximum scoring alignment lies in the matrix. The alignment starts at the upper left and follows a mostly diagonal path down and to the right. When two letters are aligned, the path follows a diagonal trajectory. There are several places in which the letters from COELACANTH are paired to gap characters. In this case, the graph is followed horizontally. Although not shown here, the path may be also be followed vertically when the letters from PELICAN are paired with gap characters. Gap characters can never be paired to each other. Note that the first row and column are blank. The reason for this will become clear shortly.
Figure 3-1: Example of an alignment matrix
In reality, you don't store letters in the matrix as shown in Figure 3-1. Each cell of the matrix actually contains two values: a score and a pointer. The score is derived from the scoring scheme. Here, this means +1 or -1, but when aligning biological sequences, the values come from a scoring matrix (a topic of the next chapter). The pointer is a directional indicator (an arrow) that points up, left, or diagonally up and left. The pointer navigates the matrix, and its use will become clearer later in the chapter. Now, let's look at the algorithm in detail. There are three major phases: initialization, fill, and trace-back.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Local Alignment: Smith-Waterman
The local alignment algorithm we describe here, the Smith-Waterman algorithm, is a very simple modification of Needleman-Wunsch. There are only three changes:
  • The edges of the matrix are initialized to 0 instead of increasing gap penalties.
  • The maximum score is never less than 0, and no pointer is recorded unless the score is greater than 0.
  • The trace-back starts from the highest score in the matrix (rather than at the end of the matrix) and ends at a score of 0 (rather than the start of the matrix).
These simple changes have a profound effect on the behavior of algorithm. Using the same words and scoring scheme as you did in global alignment, look at the filled matrix in Figure 3-6.
Figure 3-6: Filled Smith-Waterman alignment matrix
As you can see, there are a lot of zeroes. That's because there are a lot of places where you can't get a positive score. Also note that one cell is circled. This is the cell with the maximum score in the matrix. There may be more than one place in the matrix with the same maximum score, and in this case, some kind of arbitrary decision must be made. Start your trace-back from the maximum score and follow it to a score of zero, creating your alignment as you step backward, just as you did with global alignment. At the end, you have an alignment with a score of 4 that looks like this:
ELACAN
ELICAN
Example 3-2 shows the Perl code.
Example 3-2. Local alignment with the Smith-Waterman algorithm
# Smith-Waterman  Algorithm

# usage statement
die "usage: $0 <sequence 1> <sequence 2>\n" unless @ARGV == 2;

# get sequences from command line
my ($seq1, $seq2) = @ARGV;

# scoring scheme
my $MATCH    =  1; # +1 for letters that match
my $MISMATCH = -1; # -1 for letters that mismatch
my $GAP      = -1; # -1 for any gap

# initialization
my @matrix;
$matrix[0][0]{score}   = 0;
$matrix[0][0]{pointer} = "none";
for(my $j = 1; $j <= length($seq1); $j++) {
    $matrix[0][$j]{score}   = 0;
    $matrix[0][$j]{pointer} = "none";
}
for (my $i = 1; $i <= length($seq2); $i++) {
    $matrix[$i][0]{score}   = 0;
    $matrix[$i][0]{pointer} = "none";
}

# fill
my $max_i     = 0;
my $max_j     = 0;
my $max_score = 0;

for(my $i = 1; $i <= length($seq2); $i++) {
    for(my $j = 1; $j <= length($seq1); $j++) {
        my ($diagonal_score, $left_score, $up_score);
        
        # calculate match score
        my $letter1 = substr($seq1, $j-1, 1);
        my $letter2 = substr($seq2, $i-1, 1);       
        if ($letter1 eq $letter2) {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;
        }
        else {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;
        }
        
        # calculate gap scores
        $up_score   = $matrix[$i-1][$j]{score} + $GAP;
        $left_score = $matrix[$i][$j-1]{score} + $GAP;
        
        if ($diagonal_score <= 0 and $up_score <= 0 and $left_score <= 0) {
            $matrix[$i][$j]{score}   = 0;
            $matrix[$i][$j]{pointer} = "none";
            next; # terminate this iteration of the loop
        }
        
        # choose best score
        if ($diagonal_score >= $up_score) {
            if ($diagonal_score >= $left_score) {
                $matrix[$i][$j]{score}   = $diagonal_score;
                $matrix[$i][$j]{pointer} = "diagonal";
            }
            else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = "left";
            }
        } else {
            if ($up_score >= $left_score) {
                $matrix[$i][$j]{score}   = $up_score;
                $matrix[$i][$j]{pointer} = "up";
            }
            else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = "left";
            }
        }
        
        # set maximum score
        if ($matrix[$i][$j]{score} > $max_score) {
            $max_i     = $i;
            $max_j     = $j;
            $max_score = $matrix[$i][$j]{score};
        }
    }
}

# trace-back

my $align1 = "";
my $align2 = "";

my $j = $max_j;
my $i = $max_i;

while (1) {
    last if $matrix[$i][$j]{pointer} eq "none";
    
    if ($matrix[$i][$j]{pointer} eq "diagonal") {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= substr($seq2, $i-1, 1);
        $i--; $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq "left") {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= "-";
        $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq "up") {
        $align1 .= "-";
        $align2 .= substr($seq2, $i-1, 1);
        $i--;
    }   
}

$align1 = reverse $align1;
$align2 = reverse $align2;
print "$align1\n";
print "$align2\n";
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Dynamic Programming
Now that you've seen the typical approach to global and local alignment, consider the generality of dynamic programming. The advantage of DP can be seen in the fill phase. Each cell represents the maximum scoring alignment between the two sequences up to that point. When you calculate the next cell, you use previously stored values. In other words, DP is an optimizing function whose definition is extended as the algorithm proceeds.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Algorithmic Complexity
The complexity of algorithms is often described in big-O notation, a shorthand for the asymptotic behavior of the algorithm. For example, searching for a name in a phone book by starting at the beginning and going name-by-name takes on average, n/2 operations, where n is the number of names in the phone book. Such a search has O(n) time complexity (constants are dropped from the notation). It scales linearly in time; a phone book twice as long takes twice as long to search. An approach that scales more efficiently successively splits the phone book in half based on the alphabetical order. This is called a binary search and has O(log2 n) complexity. For example, a phone book eight times longer takes only three times longer to search. The alignment algorithms as described have O(nm) complexity in both time and memory, where n and m are the lengths of the sequences. It's also common to label such algorithms as O(n2) and speak of them as scaling quadratically.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Global Versus Local
Although global and local alignment are mechanistically similar, they have very different properties. Consider the alignment between the genomic sequence of two eukaryotic genes from distantly related organisms. You'd expect the exons to remain the same because their coding sequences are evolutionarily constrained, but the introns may no longer be recognizably similar, especially if they have acquired many insertions or deletions. The problem is that exons may account for only 1 to 2 percent of the sequence. As a result, a global alignment between these sequences is an alignment of mostly random letters. In such a scenario, it's very likely (especially if introns change size, as they often do) that the exons will not align to one another because their score contribution is very small compared to the rest of the sequence. In contrast, local alignment can pick out conserved exons, but unfortunately, just the maximum scoring one. The shortcomings of the standard alignment algorithms have been addressed by numerous variants.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Variations
The algorithms as described and implemented earlier are rarely used. Over the years, important innovations have made the general algorithms more applicable to aligning biological sequences and running efficiently in a computer.
Most alignment algorithms employ affine gaps. The previous description used a simple gap-scoring scheme in which every letter inserted or deleted cost the same amount. This isn't a very good model of biology because large gaps tend to occur fairly often. To better model this phenomenon, affine gaps are used where there is a greater penalty for opening a gap than extending the gap. Figure 3-7 shows a graphical view.
Figure 3-7: Comparison of simple and affine gap scoring schemes
Affine gap penalties are a simple modification to either algorithm. All you need to do is to record a third value in each cell of the matrix that keeps track of whether a gap has already been opened or not and then assign the appropriate gap penalty. Some algorithms even allow multiple gap scoring schemes so that very long gaps are not penalized as much. You can visualize how this works by following the minimal penalty in Figure 3-7. These algorithms are slightly more complicated because scores for each affine gap must be tracked. Usually two affine penalties are employed, and the algorithms are labeled double affine.
You shouldn't attempt to align long sequences (e.g., genomic DNA) with the versions of Needleman-Wunsch and Smith-Waterman described in Examples Example 3-1 and Example 3-2. The number of cells in the alignment matrix is the product of the sequence lengths, and each cell may take 8 bytes or more of memory. Therefore, aligning two 100,000-bp DNA sequences requires approximately 80 gigabytes (GB) of RAM. Fortunately, there are linear-memory versions of the algorithms.
You may have noticed during the fill phase of the algorithms that you use only two rows at a time. The other rows just sit there, either waiting to be calculated or waiting for the trace-back. Why not just use two rows at a time and not allocate the whole matrix? If you do this, you can calculate the score of the maximum scoring alignment and record its ending coordinates. By reversing direction, you can then compute the starting coordinates. Unfortunately, you lose the ability to perform a trace-back. But the memory required is now just two times the length of one of the sequences rather than the product of the two sequences. Using this approach, you can compare two 100,000-bp DNA sequences in just 1.6 megabytes (MB) of RAM. The alignment algorithm is now
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Final Thoughts
When you use the Needleman-Wunsch or Smith-Waterman algorithms to find the maximum scoring alignment, you're playing by computational, not biological, rules. As a result, the maximum scoring alignment only approximates the truth. However, even if all the nuances of biology were clear and you could code this in a computer algorithm, you might still favor the approximation because the computational cost of the correct algorithm can be excessive. In any case, the fact that you can align the unrelated words pelican and coelacanth merits consideration. It's possible to align any sequence; finding proper meaning in alignments is up to the user, not the algorithm.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Further Reading
For more information on the Perl programming language, consider these books:
Christiansen, Tom and Nathan Torkington, Perl Cookbook (O'Reilly & Associates).
Schwartz, Randal L. and Tom Phoenix, Learning Perl (O'Reilly).
Tisdall, James, Beginning Perl for Bioinformatics (O'Reilly).
Wall, Larry, Tom Christiansen, and Jon Orwant, Programming Perl (O'Reilly).
The following texts are indispensable resources for information on sequence alignment and algorithms in general:
Cormen, Thomas H. et al., Introduction to Algorithms (MIT Press).
Gusfield, Dan, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (Cambridge University Press).
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Sequence Similarity
The fact that the human genome is often referred to as the Book of Life is an apt description because nucleic acids and proteins are often represented (and manipulated) as text files. Chapter 3 described common algorithms for aligning sequences of letters, and score is the metric used to determine the best alignment. This chapter shows what scores really are. Some of the introduced terms come from information theory, so the chapter begins with a brief introduction to this branch of mathematics. It then explores the typical ways to measure sequence similarity. You'll see that this approach fits well with the sequence-alignment algorithms described in Chapter 3. The last part of the chapter focuses on the statistical significance of sequence similarity in a database search. The theories discussed in this chapter apply only to local alignment. There is currently no theory for global alignment.
In common usage, the word information conveys many things. Forget everything you know about this word because you're going to learn the most precise definition. Information is a decrease in uncertainty. You can also think of information as a degree of surprise.
Suppose you're taking care of a child and the response to every question you ask is "no." The child is very predictable, and you are pretty certain of the answer the next time you ask a question. There's no surprise, no information, and no communication. If another child answers "yes" or "no" to some questions, you can communicate a little, but you can communicate more if her vocabulary was greater. If you ask "do you like ice cream," which most children do, you would be informed by either answer, but more surprised if the answer was "no." Qualitatively, you expect more information to be conveyed by a greater vocabulary and from surprising answers. Thus, the information or surprise of an answer is inversely proportional to its probability. Quantitatively, information is represented by either one of the following equivalent formulations shown in Figure 4-1.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Introduction to Information Theory
In common usage, the word information conveys many things. Forget everything you know about this word because you're going to learn the most precise definition. Information is a decrease in uncertainty. You can also think of information as a degree of surprise.
Suppose you're taking care of a child and the response to every question you ask is "no." The child is very predictable, and you are pretty certain of the answer the next time you ask a question. There's no surprise, no information, and no communication. If another child answers "yes" or "no" to some questions, you can communicate a little, but you can communicate more if her vocabulary was greater. If you ask "do you like ice cream," which most children do, you would be informed by either answer, but more surprised if the answer was "no." Qualitatively, you expect more information to be conveyed by a greater vocabulary and from surprising answers. Thus, the information or surprise of an answer is inversely proportional to its probability. Quantitatively, information is represented by either one of the following equivalent formulations shown in Figure 4-1.
Figure 4-1: Equation 4-1
The information, H, associated with some probability p, is by convention the base 2 logarithm of the inverse of p. Values converted to base 2 logarithms are given the unit bits, which is a contraction of the words binary and digit (it is also common to use base e, and the corresponding unit is nats). For example, if the probability that a child doesn't like ice cream is 0.25, this answer has 2 bits of information (liking ice cream has 0.41 bits).
It is typical to describe information as a message of symbols emitted from a source. For example, tossing a coin is a source of head and tail symbols, and a message of such symbols might be:
tththttt
Similarly, the numbers 1, 2, 3, 4, 5, and 6 are symbols emitted from a six-sided die source, and the letters A, C, G, and T are emitted from a DNA source. The symbols emitted by a source have a frequency distribution. If there are
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Amino Acid Similarity
Molecular biologists usually think of amino acid similarity in terms of chemical similarity (see Table 2-1). Figure 4-3 depicts a rough qualitative categorization. From an evolutionary standpoint, you expect mutations that radically change chemical properties to be rare because they may end up destroying the protein's three-dimensional structure. Conversely, changes between similar amino acids should happen relatively frequently.
Figure 4-3: Amino acid chemical relationships
In the late '60s and early '70s, Margaret Dayhoff pioneered quantitative techniques for measuring amino acid similarity. Using sequences that were available at the time, she constructed multiple alignments of related proteins and compared the frequencies of amino acid substitutions. As expected, there is quite a bit of variation in amino acid substitution frequency, and the patterns are generally what you'd expect from the chemical properties. For example, phenylalanine (F) is most frequently paired to itself. It is also found relatively frequently with tyrosine (Y) and tryptophan (W), which share similar aromatic ring structures (see Table 2-1), and to a lesser extent with the other hydrophobic amino acids (M, V, I, and L). Phenylalanine is infrequently paired with hydrophilic amino acids (R, K, D, E, and others). You can see some of these patterns in the following multiple alignment, which corresponds to a portion of the cytochrome b protein from various organisms.
PGNPFATPLEILPEWYLYPVFQILRVLPNKLLGIACQGAIPLGLMMVPFIE
PANPFATPLEILPEWYFYPVFQILRTVPNKLLGVLAMAAVPVGLLTVPFIE
PANPMSTPAHIVPEWYFLPVYAILRSIPNKLGGVAAIGLVFVSLLALPFIN
PANPLVTPPHIKPEWYFLFAYAILRSIPNKLGGVLALLFSILMLLLVPFLH
PANPLSTPAHIKPEWYFLFAYAILRSIPNKLGGVLALLLSILVLIFIPMLQ
PANPLSTPPHIKPEWYFLFAYAILRSIPNKLGGVLALLLSILILIFIPMLQ
IANPMNTPTHIKPEWYFLFAYSILRAIPNKLGGVIGLVMSILIL..YIMIF
ESDPMMSPVHIVPEWYFLFAYAILRAIPNKVLGVVSLFASILVL..VVFVL
IVDTLKTSDKILPEWFFLYLFGFLKAIPDKFMGLFLMVILLFSL..FLFIL
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Scoring Matrices
A two-dimensional matrix containing all possible pair-wise amino acid scores is called a scoring matrix. Scoring matrices are also called substitution matrices because the scores represent relative rates of evolutionary substitutions. Scoring matrices are evolution in a nutshell. Take a moment now to peruse the scoring matrix in Figure 4-5 and compare it to the chemical groupings in Figure 4-3.
Figure 4-5: BLOSUM62 scoring matrix
Lod scores are real numbers but are usually represented as integers in text files and computer programs. To retain precision, the scores are generally multiplied by some scaling factor before converting them to integers. For example, a lod score of -1.609 nats may be scaled by a factor of two and then rounded off to an integer value of -3. Scores that have been scaled and converted to integers have a unitless quantity and are called raw scores.
Two different kinds of amino acid scoring matrices, PAM (Percent Accepted Mutation) and BLOSUM (BLOcks SUbstitution Matrix), are in wide use. The PAM matrices were created by Margaret Dayhoff and coworkers and are thus sometimes referred to as the Dayhoff matrices. These scoring matrices have a strong theoretical component and make a few evolutionary assumptions. The BLOSUM matrices, on the other hand, are more empirical and derive from a larger data set. Most researchers today prefer to use BLOSUM matrices because in silico experiments indicate that searches employing BLOSUM matrices have higher sensitivity.
There are several PAM matrices, each one with a numeric suffix. The PAM1 matrix was constructed with a set of proteins that were all 85 percent or more identical to one another. The other matrices in the PAM set were then constructed by multiplying the PAM1 matrix by itself: 100 times for the PAM100; 160 times for the PAM160; and so on, in an attempt to model the course of sequence evolution. Though highly theoretical (and somewhat suspect), it is certainly a reasonable approach. There was little protein sequence data in the 1970s when these matrices were created, so this approach was a good way to extrapolate to larger distances.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Target Frequencies, lambda, and H
The most important property of a scoring matrix is its target frequencies and the expected frequencies of the individual amino acid pairs. Target frequencies represent the underlying evolutionary model. While scoring matrices don't actually contain the target frequencies, they are implicit in the scores.
The Karlin-Altschul statistical theory on which BLAST is based (discussed in the next section) states that all scoring schemes for which a positive score is possible (and the expected score is negative) have implicit target frequencies. Thus they are lod-odds scoring schemes; even a simple "+1 match -1 mismatch" scheme is implicitly a log-odds scoring scheme and has target frequencies. You'll learn how to calculate those frequencies in just a bit, but you first need to understand two additional concepts associated with scoring schemes: lambda and relative entropy.
Raw score can be a misleading quantity because scaling factors are arbitrary. A normalized score, corresponding to the original lod score, is therefore a more useful measure. Converting a raw score to a normalized score requires a matrix-specific constant called lambda (or λ ). Lambda is approximately the inverse of the original scaling factor, but its value may be slightly different due to integer rounding errors. Let's now derive lambda.
When calculating target frequencies from multiple alignments, the sum of all target frequencies naturally sums to 1 (see Figure 4-6).
Figure 4-6: Equation 4-4
Recall from Figure 4-4 that the score of two amino acids is the log-odds ratio of the observed and expected frequencies. The same equation is presented in Figure 4-7, but the lod score is replaced by the product of lambda and the raw score (in other words, lambda had a value of 1 in Figure 4-4).
Figure 4-7: Equation 4-5
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Sequence Similarity
Sequence similarity is a simple extension of amino acid or nucleotide similarity. To determine it, sum up the individual pair-wise scores in an alignment. For example, the raw score of the following BLAST alignment under the BLOSUM62 matrix is 72. Converting 72 to a normalized score is as simple as multiplying by lambda. (Note that for BLAST statistical calculations, the normalized score is λS - lnk.)
Query:   885 QCPVCHKKYSNALVLQQHIRLHTGE 909
             +C VC K ++    L++H RLHTGE
Sbjct:   267 ECDVCSKSFTTKYFLKKHKRLHTGE 291
Recall from Chapter 3 that the score of each pair of letters is considered independently from the rest of the alignment. This is the same idea. There is a convenient synergy between alignment algorithms and alignment scores. However, when treating the letters independen