BUY THIS BOOK
Add to Cart

Print Book $39.95


Safari Books Online

What is this?

Add to UK Cart

Print Book £28.50

What is this?

Looking to Reprint this content?


Mastering Perl for Bioinformatics
Mastering Perl for Bioinformatics

By James Tisdall
Price: $39.95 USD
£28.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Modular Programming with Perl
Perl modules are essential to any Perl programmer. They are a great way to organize code into logical collections of interacting parts. They collect useful Perl subroutines and provide them to other programs (and programmers) in an organized and convenient fashion.
This chapter begins with a discussion of the reasons for organizing Perl code into modules. Modules are comparable to subroutines: both organize Perl code in convenient, reusable "chunks."
Later in this chapter, I'll introduce a small module, GeneticCode.pm. This example shows how to create simple modules, and I'll give examples of programs that use this module.
I'll also demonstrate how to find, install, and use modules taken from the all-important CPAN collection. A familiarity with searching and using CPAN is an essential skill for Perl programmers; it will help you avoid lots of unnecessary work. With CPAN, you can easily find and use code written by excellent programmers and road-tested by the Perl community. Using proven code and writing less of your own, you'll save time, money, and headaches.
A Perl module is a library file that uses package declarations to create its own namespace. Perl modules provide an extra level of protection from name collisions beyond that provided by my and use strict. They also serve as the basic mechanism for defining object-oriented classes.
Building a medium- to large-sized program usually requires you to divide tasks into several smaller, more manageable, and more interactive pieces. (A rule of thumb is that each "piece" should be about one or two printed pages in length, but this is just a general guideline.) An analogy can be made to building a microarray machine, which requires that you construct separate interacting pieces such as housing, temperature sensors and controls, robot arms to position the pipettes, hydraulic injection devices, and computer guidance for all these systems.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is a Module?
A Perl module is a library file that uses package declarations to create its own namespace. Perl modules provide an extra level of protection from name collisions beyond that provided by my and use strict. They also serve as the basic mechanism for defining object-oriented classes.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Why Perl Modules?
Building a medium- to large-sized program usually requires you to divide tasks into several smaller, more manageable, and more interactive pieces. (A rule of thumb is that each "piece" should be about one or two printed pages in length, but this is just a general guideline.) An analogy can be made to building a microarray machine, which requires that you construct separate interacting pieces such as housing, temperature sensors and controls, robot arms to position the pipettes, hydraulic injection devices, and computer guidance for all these systems.
Subroutines divide a large programming job into more manageable pieces. Modern programming languages all provide subroutines, which are also called functions, coroutines, or macros in other programming languages.
A subroutine lets you write a piece of code that performs some part of a desired computation (e.g., determining the length of DNA sequence). This code is written once and then can be called frequently throughout the main program. Using subroutines speeds the time it takes to write the main program, makes it more reliable by avoiding duplicated sections (which can get out of sync and make the program longer), and makes the entire program easier to test. A useful subroutine can be used by other programs as well, saving you development time in the future. As long as the inputs and outputs to the subroutine remain the same, its internal workings can be altered and improved without worrying about how the changes will affect the rest of the program. This is known as encapsulation .
The benefits of subroutines that I've just outlined also apply to other approaches in software engineering. Perl modules are a technique within a larger umbrella of techniques known as software encapsulation and reuse. Software encapsulation and reuse are fundamental to object-oriented programming.
A related design principle is abstraction , which involves writing code that is usable in many different situations. Let's say you write a subroutine that adds the fragment TTTTT to the end of a string of DNA. If you then want to add the fragment AAAAA to the end of a string of DNA, you have to write another subroutine. To avoid writing two subroutines, you can write one that's more abstract and adds to the end of a string of DNA whatever fragment you give it as an argument. Using the principle of abstraction, you've saved yourself half the work.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Namespaces
A namespace is implemented as a table containing the names of the variables and subroutines in a program. The table itself is called a symbol table and is used by the running program to keep track of variable values and subroutine definitions as the program evolves. A namespace and a symbol table are essentially the same thing. A namespace exists under the hood for many programs, especially those in which only one default namespace is used.
Large programs often accidentally use the same variable name for different variables in different parts of the program. These identically named variables may unintentionally interact with each other and cause serious, hard-to-find errors. This situation is called namespace collision . Separate namespaces are one way to avoid namespace collision.
The package declaration described in the next section is one way to assign separate namespaces to different parts of your code. It gives strong protection against accidentally using a variable name that's used in another part of the program and having the two identically-named variables interact in unwanted ways.
The unintentional interaction between variables with the same name is enough of a problem that Perl provides more than one way to avoid it. You are probably already familiar with the use of my to restrict the scope of a variable to its enclosing block (between matching curly braces {}) and should be accustomed to using the directive use strict to require the use of my for all variables. use strict and my are a great way to protect your program from unintentional reuse of variable names. Make a habit of using my and working under use strict.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Packages
Packages are a different way to protect a program's variables from interacting unintentionally. In Perl, you can easily assign separate namespaces to entire sections of your code, which helps prevent namespace collisions and lets you create modules.
Packages are very easy to use. A one-line package declaration puts a new namespace in effect. Here's a simple example:
$dna = 'AAAAAAAAAA';
package Mouse;
$dna = 'CCCCCCCCCC';
package Celegans;
$dna = 'GGGGGGGGGG';
In this snippet, there are three variables, each with the same name, $dna. However, they are in three different packages, so they appear in three different symbol tables and are managed separately by the running Perl program.
The first line of the code is an assignment of a poly-A DNA fragment to a variable $dna. Because no package is explicitly named, this $dna variable appears in the default namespace main.
The second line of code introduces a new namespace for variable and subroutine definitions by declaring package Mouse;. At this point, the main namespace is no longer active, and the Mouse namespace is brought into play. Note that the name of the namespace is capitalized; it's a well-established convention you should follow. The only noncapitalized namespace you should use is the default main.
Now that the Mouse namespace is in effect, the third line of code, which declares a variable, $dna, is actually declaring a separate variable unrelated to the first. It contains a poly-C fragment of DNA.
Finally, the last two lines of code declare a new package called Celegans and a new variable, also called $dna, that stores a poly-G DNA fragment.
To use these three $dna variables, you need to explicitly state which packages you want the variables from, as the following code fragment demonstrates:
print "The DNA from the main package:\n\n";
print $main::dna, "\n\n";

print "The DNA from the Mouse package:\n\n";
print $Mouse::dna, "\n\n";

print "The DNA from the Celegans package:\n\n";
print $Celegans::dna, "\n\n";
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Defining Modules
To begin, take a file of subroutine definitions and call it something like Newmodule.pm. Now, edit the file and give it a new first line:
package Newmodule;
and a new last line 1;. You've now created a Perl module.
To make a Celegans module, place subroutines in a file called Celegans.pm, and add a first line:
package Celegans;
Add a last line 1;, and you've defined a Celegans module. This last line just ensures that the library returns a true value when it's read in. It's annoying, but necessary.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Storing Modules
Where you store your .pm module files on your computer affects the name of the module, so let's take a moment to sort out the most important points. For all the details, consult the perlmod and the perlmodlib parts of the Perl documentation at http://www.perldoc.org. You can also type perldoc perlmod or perldoc perlmodlib at a shell prompt or in a command window.
Once you start using multiple files for your program code, which happens if you're defining and using modules, Perl needs to be able to find these various files; it provides a few different ways to do so.
The simplest method is to put all your program files, including your modules, in the same directory and run your programs from that directory. Here's how the module file Celegans.pm is loaded from another program:
use Celegans;
However, it's often not so simple. Perl uses modules extensively; many are built-in when you install Perl, and many more are available from CPAN, as you'll see later. Some modules are used frequently, some rarely; many modules call other modules, which in turn call still other modules.
To organize the many modules a Perl program might need, you should place them in certain standard directories or in your own development directories. Perl needs to know where these directories are so that when a module is called in a program, it can search the directories, find the file that contains the module, and load it in.
When Perl was installed on your computer, a list of directories in which to find modules was configured. Every time a Perl program on your computer refers to a module, Perl looks in those directories. To see those directories, you only need to run a Perl program and examine the built-in array @INC , like so:
print join("\n", @INC), "\n";
On my Linux computer, I get the following output from that statement:
/usr/local/lib/perl5/5.8.0/i686-linux
/usr/local/lib/perl5/5.8.0
/usr/local/lib/perl5/site_perl/5.8.0/i686-linux
/usr/local/lib/perl5/site_perl/5.8.0
/usr/local/lib/perl5/site_perl/5.6.1
/usr/local/lib/perl5/site_perl/5.6.0
/usr/local/lib/perl5/site_perl
.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Writing Your First Perl Module
Now that you've been introduced to the basic ideas of modules, it's time to actually examine a working example of a module.
In this section, we'll write a module called Geneticcode.pm , which implements the genetic code that maps DNA codons to amino acids and then translates a string of DNA sequence data to a protein fragment.
Let's start by creating a file called Geneticcode.pm and using it to define the mapping of codons to amino acids in a hash variable called %genetic_code. We'll also discuss a subroutine called codon2aa that uses the hash to translate its codon arguments into amino acid return values.
Here are the contents of the first module file Geneticcode.pm:
package Geneticcode;

use strict;
use warnings;

my(%genetic_code) = (
   
    'TCA' => 'S',    # Serine
    'TCC' => 'S',    # Serine
    'TCG' => 'S',    # Serine
    'TCT' => 'S',    # Serine
    'TTC' => 'F',    # Phenylalanine
    'TTT' => 'F',    # Phenylalanine
    'TTA' => 'L',    # Leucine
    'TTG' => 'L',    # Leucine
    'TAC' => 'Y',    # Tyrosine
    'TAT' => 'Y',    # Tyrosine
    'TAA' => '_',    # Stop
    'TAG' => '_',    # Stop
    'TGC' => 'C',    # Cysteine
    'TGT' => 'C',    # Cysteine
    'TGA' => '_',    # Stop
    'TGG' => 'W',    # Tryptophan
    'CTA' => 'L',    # Leucine
    'CTC' => 'L',    # Leucine
    'CTG' => 'L',    # Leucine
    'CTT' => 'L',    # Leucine
    'CCA' => 'P',    # Proline
    'CCC' => 'P',    # Proline
    'CCG' => 'P',    # Proline
    'CCT' => 'P',    # Proline
    'CAC' => 'H',    # Histidine
    'CAT' => 'H',    # Histidine
    'CAA' => 'Q',    # Glutamine
    'CAG' => 'Q',    # Glutamine
    'CGA' => 'R',    # Arginine
    'CGC' => 'R',    # Arginine
    'CGG' => 'R',    # Arginine
    'CGT' => 'R',    # Arginine
    'ATA' => 'I',    # Isoleucine
    'ATC' => 'I',    # Isoleucine
    'ATT' => 'I',    # Isoleucine
    'ATG' => 'M',    # Methionine
    'ACA' => 'T',    # Threonine
    'ACC' => 'T',    # Threonine
    'ACG' => 'T',    # Threonine
    'ACT' => 'T',    # Threonine
    'AAC' => 'N',    # Asparagine
    'AAT' => 'N',    # Asparagine
    'AAA' => 'K',    # Lysine
    'AAG' => 'K',    # Lysine
    'AGC' => 'S',    # Serine
    'AGT' => 'S',    # Serine
    'AGA' => 'R',    # Arginine
    'AGG' => 'R',    # Arginine
    'GTA' => 'V',    # Valine
    'GTC' => 'V',    # Valine
    'GTG' => 'V',    # Valine
    'GTT' => 'V',    # Valine
    'GCA' => 'A',    # Alanine
    'GCC' => 'A',    # Alanine
    'GCG' => 'A',    # Alanine
    'GCT' => 'A',    # Alanine
    'GAC' => 'D',    # Aspartic Acid
    'GAT' => 'D',    # Aspartic Acid
    'GAA' => 'E',    # Glutamic Acid
    'GAG' => 'E',    # Glutamic Acid
    'GGA' => 'G',    # Glycine
    'GGC' => 'G',    # Glycine
    'GGG' => 'G',    # Glycine
    'GGT' => 'G',    # Glycine
);


#
# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
#   Version 3, using hash lookup

sub codon2aa {
        my($codon) = @_;

        $codon = uc $codon;
 
        if(exists $genetic_code{$codon}) {
                return $genetic_code{$codon};
        }else{
                die "Bad codon '$codon'!!\n";
        }
}

1;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Modules
So far, the benefit of modules may seem questionable. You may be wondering what the advantage is over simple libraries (without package declarations), since the main result seems to be the necessity to refer to subroutines in the modules with longer names!
There's a way to avoid lengthy module names and still use the short ones if you place a call to the special Exporter module in the module code and modify the use MODULE declaration in the calling code.
Going back to the first example Geneticcode.pm module, recall it began with this line:
package Geneticcode;
and included the definition for the hash genetic_code and the subroutine codon2aa.
If you add these lines to the beginning of the file, you can export the symbol names of variables or subroutines in the module into the namespace of the calling program. You can then use the convenient short names for things (e.g., codon2aa instead of Geneticcode::codon2aa). Here's a short example of how it works (try typing perldoc Exporter to see the whole story):
package Geneticcode;

require Exporter;
@ISA = qw(Exporter);

@EXPORT_OK = qw(...);         # symbols to export on request
Here's how to export the name codon2aa from the module only when explicitly requested:
@EXPORT_OK = qw(codon2aa);    # symbols to export on request
The calling program then has to explicitly request the codon2aa symbol like so:
use Geneticcode qw(codon2aa);
If you use this approach, the calling program can just say:
codon2aa($codon);
instead of:
Geneticcode::codon2aa($codon);
The Exporter module that's included in the standard Perl distribution has several other optional behaviors, but the way just shown is the safest and most useful. As you'll see, the object-oriented programming style of using modules doesn't use the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
CPAN Modules
The Comprehensive Perl Archive Network (CPAN, http://www.cpan.org) is an impressively large collection of Perl code (mostly Perl modules). CPAN is easily accessible and searchable on the Web, and you can use its modules for a variety of programming tasks.
By now you should have the basic idea of how modules are defined and used, so let's take some time to explore CPAN to see what goodies are available.
There are two important points about CPAN. First, a large number of the things you might want your programs to do have already been programmed and are easily obtained in downloadable modules. You just have to go find them at CPAN, install them on your computer, and call them from your program. We'll take a look at an example of exactly that in this section.
Second, all code on CPAN is free of charge and available for use by a very unrestrictive copyright declaration. Sound good? Keep reading.
CPAN includes convenient ways to search for useful modules, and there's a CPAN.pm module built-in with Perl that makes downloading and installing modules quite easy (when things work well, which they usually do). If you can't find CPAN.pm, you should consider updating your current version.
You can find more information by typing the following at the command line:
perldoc CPAN
You can also check the Frequently Asked Questions (FAQ) available at the CPAN web site.
The CPAN web site offers several "views" of the CPAN collection of modules and several alternate ways of searching (by module name, category, full text search of the module documentation, etc.). Here is the top-level organization of the modules by overall category:
Development Support
Operating System Interfaces
Networking Devices IPC
Data Type Utilities
Database Interfaces
User Interfaces
Language Interfaces
File Names Systems Locking
String Lang Text Proc
Opt Arg Param Proc
Internationalization Locale
Security and Encryption
World Wide Web HTML HTTP CGI
Server and Daemon Utilities
Archiving and Compression
Images Pixmaps Bitmaps
Mail and Usenet News
Control Flow Utilities
File Handle Input Output
Microsoft Windows Modules
Miscellaneous Modules
Commercial Software Interfaces
Not In Modulelist
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Exercises
Exercise 1.1
What are the problems that might arise when dividing program code into separate module files?
Exercise 1.2
What are the differences between libraries, modules, packages, and namespaces?
Exercise 1.3
Write a module that finds modules on your computer.
Exercise 1.4
Where do the standard Perl distribution modules live on your computer?
Exercise 1.5
Research how Perl manages its namespaces.
Exercise 1.6
When might it be necessary to export names from a module? When might it be useful? When might it be convenient? When might it be a very bad idea?
Exercise 1.7
The program testGeneticcode contains the following loop:
# Translate each three-base codon to an amino acid, and append to a protein 
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
        $protein .= Geneticcode::codon2aa( substr($dna,$i,3) );
}
Here's another way to accomplish that loop:
# Translate each three-base codon to an amino acid, and append to a protein 
my $i=0;
while (my $codon = substr($dna, $i += 3, 3) ) {
        $protein .= Geneticcode::codon2aa( $codon );
}
Compare the two methods. Which is easier to understand? Which is easier to maintain? Which is faster? Why?
Exercise 1.8
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: Data Structures and String Algorithms
So far in this book, I've used the standard Perl data structures of scalars, arrays, and hashes. However, it is often necessary to handle data with a more complex structure than what those basics allow. For instance, it is frequently useful to have a two-dimensional array.
In this chapter, you'll learn how to define and use references and complex data structures. After you learn the fundamentals, you'll apply the new techniques to implement a biologically important algorithm. These techniques are also fundamental to the implementation of object-oriented programming, as you'll see in Chapter 3.
The algorithm we'll study is called approximate string matching. It lets you find the closest match for a peptide fragment in a protein, for instance. It uses an algorithmic technique called dynamic programming, an essential tool for many similar biological tasks, such as aligning biological sequences. In this chapter, you'll see how Perl references can be used to write programs for data problems with more complex relationships. References are also used for the objects of object-oriented programming.
Before tackling references, let's review the basic Perl data types:
Scalar
A scalar value is a string or any one of several kinds of numbers such as integers, floating-point (decimal) numbers, or numbers in scientific notation such as 2.3E23. A scalar variable begins with the dollar sign $, as in $dna.
Array
An array is an ordered collection of scalar values. An array variable begins with an at sign @ , as in @peptides. An array can be initialized by a list such as @peptides = ('zeroth', 'first', 'second'). Individual scalar elements of an array are referred to by first preceding the array name with a dollar sign (an individual element of an array is a scalar value) and then following the array name with the position of the desired element in square brackets. Thus the first element of the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Basic Perl Data Types
Before tackling references, let's review the basic Perl data types:
Scalar
A scalar value is a string or any one of several kinds of numbers such as integers, floating-point (decimal) numbers, or numbers in scientific notation such as 2.3E23. A scalar variable begins with the dollar sign $, as in $dna.
Array
An array is an ordered collection of scalar values. An array variable begins with an at sign @ , as in @peptides. An array can be initialized by a list such as @peptides = ('zeroth', 'first', 'second'). Individual scalar elements of an array are referred to by first preceding the array name with a dollar sign (an individual element of an array is a scalar value) and then following the array name with the position of the desired element in square brackets. Thus the first element of the @peptides array is referenced by $peptides[0] and has the value 'zeroth'. (Note that array elements are given the positions 0, 1, 2, ..., n-1, where n is the number of elements in the array.)
Recall that printing an array within double quotes causes the elements to be separated by spaces; without the double quotes, the elements are printed one after the other without separations. This snippet:
@pentamers = ('cggca', 'tgatc', 'ttggc');

print "@pentamers", "\n";
print @pentamers, "\n";
produces the output:
cggca tgatc ttggc
cggcatgatcttggc
Hash
A hash is an unordered collection of key value pairs of scalar values. Each scalar key is associated with a scalar value. A hash variable begins with the percent sign
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
References
Many computer languages provide variables that allow you to refer to, or point at, other values. So, instead of a variable containing data such as a string or number of interest, the variable contains the location of the data; it tells you where to go to get the value you want. In Perl, the use of a scalar variable to refer to another value is called a reference, and the value being pointed at is called a referent.
References allow you to do many useful things in Perl; you can define multidimensional arrays and other more complex data structures and avoid copying large amounts of data (for instance, when passing arguments into subroutines). Using references can make your programs faster, more efficient, and shorter. References have a number of uses, as you'll see in the next sections.
Here's an example of a reference:
$peptide = 'EIQADEVRL';

$peptideref = \$peptide;

print "Here is what's in the reference:\n";
print $peptideref, "\n";

print "Here is what the reference is pointing to:\n";
print ${$peptideref}, "\n";
print $$peptideref, "\n";
This Perl code produces the following output:
Here is what's in the reference:
SCALAR(0x80fe4ac)
Here is what the reference is pointing to:
EIQADEVRL
EIQADEVRL
What's going on here?
First, a string value of EIQADEVRL is assigned to the scalar variable $peptide. Next, a backslash operator is used before the $peptide variable to return a reference to the variable. This reference is saved in the scalar variable $peptideref.
The next lines of code show what this example really does. When you print out the (actual) value of the reference variable $peptideref, you get the value:
SCALAR(0x80fe4ac)
This says that the reference variable $peptideref is pointing to a scalar value (which is the value of the scalar variable $peptide). It also gives a hexadecimal number that specifies where in the computer memory the value for that variable resides.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Matrices
Perl matrices are built from simpler data structures using references. Recall that a matrix is a set of values that can be uniquely referenced by indexes. If only one index is required, the matrix is one-dimensional (this is exactly how an array works in Perl). If n indexes are required, the matrix is n-dimensional.
A two-dimensional matrix is one of the simplest complex data structures. It can be conceptualized as a table of rows and columns, in which each element of the table is uniquely identified by its particular row and column.
There are several ways to build matrices in Perl. We'll look at some of the most useful.
Because there is no built-in matrix data structure, you have to build a matrix from other data structures. The most straightforward way to do this is with an array of arrays :
@probes = (
    [1, 3, 2, 9],
    [2, 0, 8, 1],
    [5, 4, 6, 7],
    [1, 9, 2, 8]
);

print "The probe at row 1, column 2 has value ", $probes[1][2], "\n";
This prints out:
The probe at row 1, column 2 has value 8
Recall that in Perl the first element of an array is indexed 0; so row 1 in this program is actually the second row, and column 2 is actually the third column. Sometimes you may want to refer to the 0th row as row 1; you have to adjust your code and your interactions with the user accordingly.
This matrix is implemented as an array (in parentheses), each element of which is a reference to an anonymous array [in square brackets], which itself is a list of integers.
Another good way to build an array is to declare a reference to an anonymous array. In the following example, I declare an empty anonymous array and then populate it as desired. This is, in effect, an anonymous array of anonymous arrays:
# Declare reference to (empty) anonymous array
$array = [  ];

# Initialize the array
for($i=0; $i < 4 ; ++$i) {
  for($j=0; $j < 4 ; ++$j) {
      $array->[$i][$j] = $i * $j;
  }
}

# Reset one of the elements of the array
$array->[3][2] = 99;

# Print the array
for($i=0; $i < 4 ; ++$i) {
  for($j=0; $j < 4 ; ++$j) {
      printf("%3d ", $array->[$i][$j]);
  }
  print "\n";
}
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Complex Data Structures
Different algorithms require different data structures. Using references in Perl, it is possible to build very complex data structures.
This section gives a short introduction to some of the possibilities, such as a hash with array values and a two-dimensional array of hashes. See the recommended reading in Section 2.9 of this chapter for books and sections of the Perl manual that are very helpful.
Perl uses the basic data types of scalar, array, and hash, plus the ability to declare scalar references to those basic data types, to build more complex structures. For instance, an array must have scalar elements, but those scalar elements can be references to hashes, in which case you have effectively created an array of hashes.
A common example of a complex data structure is a hash with array values. Using such a data structure, you can associate a list of items with each keyword. The following code shows an example of how to build and manage such a data structure. Assume you have a set of human genes, and for each human gene, you want to manage an array of organisms that are known to have closely related genes. Of course, each such array of related organisms can be a different length:
use Data::Dumper;

%relatedgenes = (  );

$relatedgenes{'stromelysin'} = [
    'C.elegans',
    'Arabidopsis thaliana'
];
$relatedgenes{'obesity'} = [
    'Drosophila',
    'Mus musculus'
];

# Now add a new related organism to the entry for 'stromelysin'

push( @{$relatedgenes{'stromelysin'}}, 'Canis' );

print Dumper(\%relatedgenes);
This program prints out the following (the very useful Data::Dumper module is described in more detail later; try typing perldoc Data::Dumper for the details of this useful way to print out complex data structures):
$VAR1 = {
        'stromelysin' => [
                           'C.elegans',
                           'Arabidopsis thaliana',
                           'Canis'
                         ],
        'obesity' => [
                      'Drosophila',
                      'Mus musculus'
                     ]
};
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Printing Complex Data Structures
Sometimes you need to look inside your complex data structures to see what the settings are. One of the most useful ways to examine a data structure is by means of the Data::Dumper module. This module comes standard with all recent versions of Perl.
Here is the summary and part of the synopsis and description as output from the perldoc Data::Dumper command:
NAME
       Data::Dumper - stringified perl data structures, suitable
       for both printing and "eval"

SYNOPSIS
           use Data::Dumper;

           # simple procedural interface
           print Dumper($foo, $bar);
(...)

DESCRIPTION
       Given a list of scalars or reference variables, writes out
       their contents in perl syntax. The references can also be
       objects.  The contents of each variable is output in a
       single Perl statement.  Handles self-referential strucTures correctly.

       The return value can be "eval"ed to get back an identical
       copy of the original reference structure.
(...)
This output of a two-dimensional array illustrates its use:
use Data::Dumper;

$array = [  ];

# Initialize the array
for($i=0; $i < 4 ; ++$i) {
  for($j=0; $j < 4 ; ++$j) {
      $array->[$i][$j] = $i * $j;
  }
}

# Print the array "by hand"
for($i=0; $i < 4 ; ++$i) {
  for($j=0; $j < 4 ; ++$j) {
      printf("%3d ", $array->[$i][$j]);
  }
  print "\n";
}

# Print the array using Data::Dumper
print Dumper($array);
This produces the output:
  0   0   0   0 
  0   1   2   3 
  0   2   4   6 
  0   3   6   9 
$VAR1 = [
          [
            0,
            0,
            0,
            0
          ],
          [
            0,
            1,
            2,
            3
          ],
          [
            0,
            2,
            4,
            6
          ],
          [
            0,
            3,
            6,
            9
          ]
        ];
You can make a nicer display by knowing exactly what the data is and in what form to write it out.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Data Structures in Action
The previous sections introduced a fair amount of new Perl syntax and capabilities. Now, let's see some of these new capabilities in action.
It is frequently important in biology to find the best possible match for a short sequence in a longer sequence; for example, between an oligonucleotide and the sequence of DNA that has been cloned in a YAC or BAC. This match need not always be perfect; frequently, what is important is to find the closest match available. This problem is known in computer science as approximate string matching, and dynamic programming is a popular technique used to compute the solution.
The problem of string matching is to find a pattern, such as a nucleotide or peptide fragment, in a longer text such as a chromosome or protein.
The problem of approximate string matching is to find a pattern in a text in which the match might not be perfect. Perhaps a few of the characters are different or missing; the problem is to find the best match possible.
Biologically, approximate matches are of commanding importance. Evolutionary changes between species can make genes with essentially the same function collect a fair number of individual base changes; they may even have acquired differences in exon structure. Even within a species, individual base changes among groups in the population (single nucleotide polymorphisms) are important causes of disease and important clues in the discovery of disease-causing genes.
Mutations tend to accumulate over time in noncoding regions of DNA; mutations in coding regions tend to avoid altering critical regions essential for the functioning of the gene (where mutations may be fatal to the organism). Even a noncoding region may be critical to the regulation of a gene and thus tend to resist mutations. As a result, studying where mutations are not accumulating is often an important clue to discerning the function and control of a gene and its associated protein.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Dynamic Programming
Dynamic programming computes the values for small subproblems and stores those values in a matrix. The stored values are then used to solve larger subproblems (without incurring the cost of recomputing the smaller subproblems) and so on until the solution to the overall problem is found. The term "dynamic programming" is a bit of a misnomer since it doesn't involve changing values over time as the word "dynamic" suggests.
This approach relies on having a data structure available to store the intermediate values as the algorithm proceeds. The data structure may require a fair amount of computer memory, but the overall speed of the algorithm often makes the memory cost worthwhile. In this section, we'll use a Perl multidimensional array, namely a simple two-dimensional matrix, to solve an approximate string matching problem.
Our algorithm will find a (shorter) pattern in a (longer) text. We'll start with a two-dimensional array, or matrix. The columns of the matrix will be associated with the (shorter) pattern, and the rows of the matrix will be associated with the (longer) text. The zeroth row and the zeroth column will be initialized to the appropriate starting values. We'll then calculate each value in the matrix by examining adjacent, already calculated values in conjunction with the characters of the pattern and the text. After the entire matrix has been filled in, we'll have the answer to our problem. That is, we'll find the position(s) in the text that most closely match the pattern, and we'll do so by simply examining the values in the last row of the matrix.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Approximate String Matching
You've most likely learned how to use regular expressions to find any of a set of possible patterns in a string. Approximate string matching is similar: an approximate string matching algorithm finds any of a set of possible patterns in a string. However, the two approaches are quite different in their capabilities and their ease of use. Simply stated, approximate string matching can find many close matches to a pattern that would be very tedious to specify using regular expressions.
There are several ways to measure the distance between two strings, and our algorithm will use one such measure. Some variants of this measure are considered in the exercises at the end of the chapter.
Our algorithm uses the idea of edit distance to measure the similarity between two strings. The idea is quite simple. Assume that there are three things you can do to alter a string:
Substitution
Change any character to a different character
Deletion
Delete any character
Insertion
Insert a new character at any position
Now, let's say that every time you make any of these three edits, you incur an edit cost of 1. Now, call the edit distance between two strings as the minimum edit cost needed to change one string into the other.
For instance, let's say there are two strings portend and profound. You can apply the following edits to portend:
portend
        (delete o)
prtend
        (insert o)
protend
        (change t to f)
profend
        (change e to o)
profond
        (insert u)
profound
You can see that five edits were applied. Assuming you can't find a quicker way to change one string into the other, the edit distance between the two strings is 5.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Resources
I recommend the following O'Reilly sources for more details on data structures in Perl:
  • Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant. This is the bible of Perl programming. Everything about data structures is explained in detail.
  • Advanced Perl Programming by Sriram Srinivasan. An excellent book that covers references and data structures.
  • Mastering Algorithms with Perl by Jon Orwant, Jarkko Hietaniemi, and John Macdonald. A marvelous book, especially if you like this chapter. Many interesting data structures and algorithms are explained and implemented in Perl.
  • The Perl Cookbook by Tom Christiansen and Nathan Torkington. As the title implies, this book is composed of fairly short recipes that accomplish particular tasks, grouped according to application area.
Here's where to go for Perl documentation:
  • The perlreftut tutorial page from the Perl documentation gives a short introduction to Perl references (type perldoc perlreftut at your command line if Perl is installed, or visit the web page http://www.perldoc.com).
  • The perlref tutorial page from the Perl documentation discusses Perl references in detail.
  • The perldata tutorial page from the Perl documentation gives an introduction to Perl data structures.
  • The perldsc tutorial page from the Perl documentation presents a "cookbook" overview of Perl data structures.
  • The perllol tutorial page from the Perl documentation gives an introduction to arrays of arrays.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Exercises
Exercise 2.1
Suggest a programming situation in which it would make sense to have several scalar references to one scalar variable that contains a peptide fragment.
Exercise 2.2
When might you want to use a reference to an (anonymous) scalar constant?
Exercise 2.3
Is $$arr[0] the same as $arr->[0]? Why or why not?
Exercise 2.4
Write a subroutine that returns a reference to a hash. Declare a reference to this subroutine and call it using the reference, then print out the hash whose reference is returned from the subroutine.
Exercise 2.5
Write a subroutine that returns a new anonymous subroutine based on its arguments, which are passed to it as references. Call the subroutine and then call the new subroutine that is returned.
Exercise 2.6
Write a subroutine to multiply two matrices.
Exercise 2.7
Develop a data structure that is a hash at the top level and can be used to record the data from microarray runs.
Exercise 2.8
Write a min subroutine that returns the minimum of two integers. Rewrite min3 using it.
Exercise 2.9
Make a subroutine that prints the distance matrix. Make it handle the display of longer numbers appropriately.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: Object-Oriented Programming in Perl
In Chapter 1, you saw how modules are defined and used, and in Chapter 2, how references and data structures work. Now, it's time to introduce the important concepts and techniques of object-oriented programming in Perl that are based on modules and references.
Object-oriented (OO) programming is one of the most important approaches to writing programs, and it is an approach that has been well supported by Perl for quite a while. Other OO languages of interest include Java, C++, and Smalltalk. Many Perl modules are written in an OO style, and their proper use requires some fundamental understanding of the OO approach. Luckily, the key concepts are fairly simple.
Perl easily supports both declarative and OO programming. (Perl was originally a declarative language only; the OO style was added fairly early on.) Declarative programming is characterized by code that declares variables and subroutines, conditional tests, if-else branches, and loops, and various arithmetic, logical, and string operators. It is up to you to manage the definition and use of the variables and subroutines so that they interact in appropriate ways. (You'll see shortly how object-oriented programming imposes additional constraints that help you create well-behaved programs.) Many declarative programming languages are well established, including Perl and such stalwarts as C, FORTRAN, and BASIC, to name just a few. By this point, assuming you have some experience programming in Perl, you should be fairly comfortable with the declarative style.
The first part of this chapter is an overview of OO programming and how OO Perl modules are used. If you're a beginning Perl programmer, you'll find them easy to use because they rarely require you to know how to write OO Perl code. Depending on your needs and goals, this might be all the information you'll require from this chapter.
As a more advanced programmer, you'll sometimes need to write your own OO bioinformatics software. If you're such a programmer, the second part of this chapter will be of greatest interest to you. However, because the material is developed incrementally, you will most likely want to read the chapter in order from beginning to end.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
What Is Object-Oriented Programming?
Object-oriented programming is a way to organize code so it interacts in certain prescribed ways, obeying certain rules about how the data and subroutines are organized. In other words, it imposes a certain programming discipline that can lead to better and more reliable code.
The key idea of OO programming is that all data is stored and modified with special data structures called objects, and each kind of object can be accessed only by its defined subroutines called methods. The user of an OO class is typically spared the effort of directly manipulating data, and can use class methods for this instead.
The promise of this OO structure of program code is that it makes the resulting programs cleanly designed, more reliable, easier to reuse in other programs, and easier to modify and improve. In essence, the approach imposes certain restrictions on what a programmer can do with the data and subroutines at hand.
Proponents of the OO approach cite the benefits this extra discipline provides. It is certainly true that you can follow good programming practices without using an OO approach. However, OO does provide a well-defined framework for encouraging discipline and good programming practices. In a very flexible language such as Perl, good practices can sometimes be easier to enforce in the framework of OO. We'll see how this comes about in the examples that follow.
It is often important and necessary to weigh the costs and benefits of a given system against the alternatives in an applied engineering discipline such as programming. The decision to use OO programming, declarative programming, or some other paradigm, is often subject to religious debates, with some enthusiasts promoting their favorite approach against all comers. This is especially relevant to the Perl programmer, because Perl allows you to write in the declarative or in the OO style. You should know that OO programming isn't always the correct choice for a programming project. Despite the real benefits it can confer upon a software development project, it can also have certain costs; these costs and benefits should be weighed against each other.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using Perl Classes (Without Writing Them)
Before you actually start writing classes, it's helpful to know how to use them. This section shows you how to use OO Perl classes, even if the syntax is new to you and you've never written one yourself.
Thanks to the large and active community of Perl programmers, there are many useful Perl classes already written and freely available to use in your programs. Very often, the class you want already exists. All you need to do is obtain it and use it.
First, you need to find the appropriate module or modules (CPAN is the most common source for modules), install it, and examine the documentation to learn how to use the class. Finding and installing OO modules employs the same process covered in Chapter 1.
What's different about OO modules is how they create data structures and call and pass arguments to subroutines. In short, there's some new syntax to lea