Chapter 1. Modular Programming with Perl
Perl modules are essential to any Perl programmer. They are a great way to organize code into logical collections of interacting parts. They collect useful Perl subroutines and provide them to other programs (and programmers) in an organized and convenient fashion.
This chapter begins with a discussion of the reasons for organizing Perl code into modules. Modules are comparable to subroutines: both organize Perl code in convenient, reusable “chunks.”
Later in this chapter, I’ll introduce a small
module, GeneticCode.pm
. This example shows how to
create simple modules, and I’ll give examples of
programs that use this module.
I’ll also demonstrate how to find, install, and use modules taken from the all-important CPAN collection. A familiarity with searching and using CPAN is an essential skill for Perl programmers; it will help you avoid lots of unnecessary work. With CPAN, you can easily find and use code written by excellent programmers and road-tested by the Perl community. Using proven code and writing less of your own, you’ll save time, money, and headaches.
What Is a Module?
A Perl
module
is a library file that uses package declarations to create its own
namespace. Perl modules provide an extra level of protection from
name collisions beyond that provided by my
and
use
strict
. They also serve as
the basic mechanism for defining object-oriented classes.
Why Perl Modules?
Building a medium- to large-sized program usually requires you to divide tasks into several smaller, more manageable, and more interactive pieces. (A rule of thumb is that each “piece” should be about one or two printed pages in length, but this is just a general guideline.) An analogy can be made to building a microarray machine, which requires that you construct separate interacting pieces such as housing, temperature sensors and controls, robot arms to position the pipettes, hydraulic injection devices, and computer guidance for all these systems.
Subroutines and Software Engineering
Subroutines divide a large programming job into more manageable pieces. Modern programming languages all provide subroutines, which are also called functions, coroutines, or macros in other programming languages.
A subroutine lets you write a piece of code that performs some part of a desired computation (e.g., determining the length of DNA sequence). This code is written once and then can be called frequently throughout the main program. Using subroutines speeds the time it takes to write the main program, makes it more reliable by avoiding duplicated sections (which can get out of sync and make the program longer), and makes the entire program easier to test. A useful subroutine can be used by other programs as well, saving you development time in the future. As long as the inputs and outputs to the subroutine remain the same, its internal workings can be altered and improved without worrying about how the changes will affect the rest of the program. This is known as encapsulation .
The benefits of subroutines that I’ve just outlined also apply to other approaches in software engineering. Perl modules are a technique within a larger umbrella of techniques known as software encapsulation and reuse. Software encapsulation and reuse are fundamental to object-oriented programming.
A related design principle is abstraction , which involves writing code that is usable in many different situations. Let’s say you write a subroutine that adds the fragment TTTTT to the end of a string of DNA. If you then want to add the fragment AAAAA to the end of a string of DNA, you have to write another subroutine. To avoid writing two subroutines, you can write one that’s more abstract and adds to the end of a string of DNA whatever fragment you give it as an argument. Using the principle of abstraction, you’ve saved yourself half the work.
Here is an example of a Perl subroutine that takes two strings of DNA as inputs and returns the second one appended to the end of the first:
sub DNAappend { my ($dna, $tail) = @_; return($dna . $tail); }
This subroutine can be used as follows:
my $dna = 'ACCGGAGTTGACTCTCCGAATA'; my $polyT = 'TTTTTTTT'; print DNAappend($dna, $polyT);
If you wish, you can also define subroutines polyT
and polyA
like so:
sub polyT { my ($dna) = @_; return DNAappend($dna, 'TTTTTTTT'); } sub polyA { my ($dna) = @_; return DNAappend($dna, 'AAAAAAAA'); }
At this point, you should think about how to divide a problem into interacting parts; that is, an optimal (or at least good) way to define a set of subroutines that can cooperate to solve a particular problem.
Modules and Libraries
In my projects, I gather subroutine definitions into separate files called libraries,[1] or modules, which let me collect subroutine definitions for use in other programs. Then, instead of copying the subroutine definitions into the new program (and introducing the potential for inaccurate copies or for alternate versions proliferating), I can just insert the name of the library or module into a program, and all the subroutines are available in their original unaltered form. This is an example of software reuse in action.
To fully understand and use modules, you need to understand the simple concepts of namespaces and packages. From here on, think of a Perl module as any Perl library file that uses package declarations to create its own namespace. These simple concepts are examined in the next sections.
Namespaces
A namespace is implemented as a table containing the names of the variables and subroutines in a program. The table itself is called a symbol table and is used by the running program to keep track of variable values and subroutine definitions as the program evolves. A namespace and a symbol table are essentially the same thing. A namespace exists under the hood for many programs, especially those in which only one default namespace is used.
Large programs often accidentally use the same variable name for different variables in different parts of the program. These identically named variables may unintentionally interact with each other and cause serious, hard-to-find errors. This situation is called namespace collision . Separate namespaces are one way to avoid namespace collision.
The package
declaration described in the next
section is one way to assign separate namespaces to different parts
of your code. It gives strong protection against accidentally using a
variable name that’s used in another part of the
program and having the two identically-named variables interact in
unwanted ways.
Namespaces Compared with Scoping: my and use strict
The unintentional interaction between variables with the same name is
enough of a problem that Perl provides more than one way to avoid it.
You are probably already familiar with the use of
my
to
restrict the scope of a variable to its enclosing block (between
matching curly braces
{}) and should be accustomed to
using the directive use
strict
to require the use of my
for all variables.
use strict
and my
are a great
way to protect your program from unintentional reuse of variable
names. Make a habit of using my
and working under
use
strict
.
Packages
Packages are a different way to protect a program’s variables from interacting unintentionally. In Perl, you can easily assign separate namespaces to entire sections of your code, which helps prevent namespace collisions and lets you create modules.
Packages are very easy to use. A one-line
package
declaration puts a new namespace in
effect. Here’s a simple example:
$dna = 'AAAAAAAAAA'; package Mouse; $dna = 'CCCCCCCCCC'; package Celegans; $dna = 'GGGGGGGGGG';
In this snippet, there are three variables, each with the same name,
$dna
. However, they are in three different
packages, so they appear in three different symbol tables and are
managed separately by the running Perl program.
The first line of the code is an assignment of a poly-A DNA fragment
to a variable $dna
. Because no package is
explicitly named, this $dna
variable appears in
the default namespace main
.
The second line of code introduces a new namespace for variable and
subroutine definitions by declaring package Mouse;
. At this point, the main
namespace is no longer active, and the Mouse
namespace is brought into play. Note that the name of the namespace
is capitalized; it’s a well-established convention
you should follow. The only
noncapitalized
namespace you should use is the default main
.
Now that the Mouse
namespace is in effect, the
third line of code, which declares a variable,
$dna
, is actually declaring a separate variable
unrelated to the first. It contains a poly-C fragment of DNA.
Finally, the last two lines of code declare a new package called
Celegans
and a new variable, also called
$dna
, that stores a poly-G DNA fragment.
To use these three $dna
variables, you need to
explicitly state which packages you want the variables from, as the
following code fragment demonstrates:
print "The DNA from the main package:\n\n"; print $main::dna, "\n\n"; print "The DNA from the Mouse package:\n\n"; print $Mouse::dna, "\n\n"; print "The DNA from the Celegans package:\n\n"; print $Celegans::dna, "\n\n";
This gives the following output:
The DNA from the main package: AAAAAAAAAA The DNA from the Mouse package: CCCCCCCCCC The DNA from the Celegans package: GGGGGGGGGG
As you can see, the variable name can be specified as to a particular
package by putting the package name and two colons before the
variable name (but after the $
,
@
, or %
that specifies the type
of variable). If you don’t specify a package in this
way, Perl assumes you want the current package, which may not
necessarily be the main
package, as the following
example shows:
# # Define the variables in the packages # $dna = 'AAAAAAAAAA'; package Mouse; $dna = 'CCCCCCCCCC'; # # Print the values of the variables # print "The DNA from the current package:\n\n"; print $dna, "\n\n"; print "The DNA from the Mouse package:\n\n"; print $Mouse::dna, "\n\n";
This produces the following output:
The DNA from the current package: CCCCCCCCCC The DNA from the Mouse package: CCCCCCCCCC
Both print $dna
and print $Mouse::dna
reference the same variable. This is because
the last package
declaration was
package
Mouse;
, so the
print
$dna
statement prints the
value of the variable $dna
as defined in the
current package, which is Mouse
.
The rule is, once a package has been declared, it becomes the current package until the next package declaration or until the end of the file. (You can also declare packages within blocks, evals, or subroutine definitions, in which case the package stays in effect until the end of the block, eval, or subroutine definition.)
By far the most common use of package
is to call
it once near the top of a file and have it stay in effect for all the
code in the file. This is how modules are
defined, as the next section shows.
Defining Modules
To
begin, take a file of subroutine definitions and call it something
like Newmodule.pm
. Now, edit the file and give it
a new first line:
package Newmodule;
and a new last line 1;
. You’ve
now created a Perl module.
To make a Celegans
module, place subroutines in a
file called Celegans.pm
, and add a first line:
package Celegans;
Add a last line 1;
, and you’ve
defined a Celegans
module. This last line just
ensures that the library returns a true value when
it’s read in. It’s annoying, but
necessary.
Storing Modules
Where
you store your .pm
module files on your computer
affects the name of the module, so let’s take a
moment to sort out the most important points. For all the details,
consult the
perlmod
and the
perlmodlib
parts of the Perl documentation at http://www.perldoc.org. You can also type
perldoc
perlmod
or
perldoc
perlmodlib
at a shell
prompt or in a command window.
Once you start using multiple files for your program code, which happens if you’re defining and using modules, Perl needs to be able to find these various files; it provides a few different ways to do so.
The simplest method is to put all your program files, including your
modules, in the same directory and run your programs from that
directory. Here’s how the module file
Celegans.pm
is loaded from another program:
use Celegans;
However, it’s often not so simple. Perl uses modules extensively; many are built-in when you install Perl, and many more are available from CPAN, as you’ll see later. Some modules are used frequently, some rarely; many modules call other modules, which in turn call still other modules.
To organize the many modules a Perl program might need, you should place them in certain standard directories or in your own development directories. Perl needs to know where these directories are so that when a module is called in a program, it can search the directories, find the file that contains the module, and load it in.
When Perl was installed on your computer, a list of directories in
which to find modules was configured. Every time a Perl program on
your computer refers to a module, Perl looks in those directories. To
see those directories, you only need to run a Perl program and
examine the built-in array @INC
, like so:
print join("\n", @INC), "\n";
On my Linux computer, I get the following output from that statement:
/usr/local/lib/perl5/5.8.0/i686-linux /usr/local/lib/perl5/5.8.0 /usr/local/lib/perl5/site_perl/5.8.0/i686-linux /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.6.1 /usr/local/lib/perl5/site_perl/5.6.0 /usr/local/lib/perl5/site_perl .
These are all locations in which the standard Perl modules live on my
Linux computer. @INC
is simply an array whose
entries are directories on your computer. The way it looks depends on
how your computer is configured and your operating system (for
instance, Unix computers handle directories a bit differently than
Windows).
Note that the last line of that
list of directories is a solitary period. This is shorthand for
“the current directory,” that is,
whatever directory you happen to be in when you run your Perl
program. If this directory is on the list, and you run your program
from that directory as well, Perl will find the
.pm
files.
When you develop Perl software that uses modules, you should put all
the modules together in a certain directory. In order for Perl to
find this directory, and load the modules, you need to add a line
before the use MODULE
directives, telling Perl to
additionally search your own module directory for any modules
requested in your program. For instance, if I put a module
I’m developing for my program into a file named
Celegans.pm
, and put the
Celegans.pm
file into my Linux directory
/home/tisdall/MasteringPerlBio/development/lib,
I need to add a use lib
directive to my program, like so:
use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Celegans;
Perl then adds my development module directory to the
@INC
array and searches there for the
Celegans.pm
module file. The following code
demonstrates this:
use lib "/home/tisdall/MasteringPerlBio/development/lib"; print join("\n", @INC), "\n";
This produces the output:
/home/tisdall/MasteringPerlBio/development/lib /usr/local/lib/perl5/5.8.0/i686-linux /usr/local/lib/perl5/5.8.0 /usr/local/lib/perl5/site_perl/5.8.0/i686-linux /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.6.1 /usr/local/lib/perl5/site_perl/5.6.0 /usr/local/lib/perl5/site_perl .
Thanks to the use lib
directive, Perl can now find
the Celegans.pm
file in the
@INC
list of directories.
A problem with this approach to finding libraries is that the directory pathnames are hardcoded into each program. If you then want to move your own library directory somewhere else or move the programs to another computer where different pathnames are used, you need to change the pathnames in all the program files where they occur.
If, for instance, you download several programs from this book’s web site, and you don’t want to edit each one to change pathnames, you can use the PERL5LIB environmental variable. To do so, put all the modules under the directory /my/perl/modules (for example). Now set the PERL5LIB variable:
PERL5LIB=$PERL5LIB:/my/perl/modules
You can also set it this way:
setenv PERL5LIB /my/perl/modules
If you have “taint” security checks enabled in your version of Perl, you still have to hardcode the pathname into the program. This, of course, behaves differently on different operating systems.
You can also specify an additional directory on the command line:
perl -I/my/perl/modules myprogram.pl
There’s one other detail about modules
that’s important. You’ll sometimes
see modules in Perl programs with names such as
Genomes::Modelorganisms::Celegans
,
in which the name is two or more words separated by two colons. This
is how Perl looks into subdirectories of directories named in the
@INC
built-in array. In the example, Perl looks
for a subdirectory named Genomes
in one of the
@INC
directories; then for a subdirectory named
Modelorganisms
within the
Genomes
subdirectory; finally, for a file named
Celegans.pm
within the
Modelorganisms
subdirectory. That is, my module is
in the file:
/home/tisdall/MasteringPerlBio/development/lib/Genomes/Modelorganisms/Celegans.pm
and it’s called in my Perl program like so:
use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Genomes::Modelorganisms::Celegans;
There are more details you can learn about storing and finding
modules on your computer, but these are the most useful facts. See
the perlmod
, perlrun
, and
perlmodlib
sections of the Perl manual for more
details if and when you need them.
Writing Your First Perl Module
Now that you’ve been introduced to the basic ideas of modules, it’s time to actually examine a working example of a module.
In this section, we’ll write a module called
Geneticcode.pm
,
which implements the genetic code that maps DNA codons to amino acids
and then translates a string of DNA sequence data to a protein
fragment.
An Example: Geneticcode.pm
Let’s start by creating a file called
Geneticcode.pm
and using it to define the mapping
of codons to amino acids in a hash variable called
%genetic_code
. We’ll also discuss
a subroutine called codon2aa
that uses the hash to
translate its codon arguments into amino acid return values.
Here are the contents of the first module file
Geneticcode.pm
:
package Geneticcode; use strict; use warnings; my(%genetic_code) = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine 'TTT' => 'F', # Phenylalanine 'TTA' => 'L', # Leucine 'TTG' => 'L', # Leucine 'TAC' => 'Y', # Tyrosine 'TAT' => 'Y', # Tyrosine 'TAA' => '_', # Stop 'TAG' => '_', # Stop 'TGC' => 'C', # Cysteine 'TGT' => 'C', # Cysteine 'TGA' => '_', # Stop 'TGG' => 'W', # Tryptophan 'CTA' => 'L', # Leucine 'CTC' => 'L', # Leucine 'CTG' => 'L', # Leucine 'CTT' => 'L', # Leucine 'CCA' => 'P', # Proline 'CCC' => 'P', # Proline 'CCG' => 'P', # Proline 'CCT' => 'P', # Proline 'CAC' => 'H', # Histidine 'CAT' => 'H', # Histidine 'CAA' => 'Q', # Glutamine 'CAG' => 'Q', # Glutamine 'CGA' => 'R', # Arginine 'CGC' => 'R', # Arginine 'CGG' => 'R', # Arginine 'CGT' => 'R', # Arginine 'ATA' => 'I', # Isoleucine 'ATC' => 'I', # Isoleucine 'ATT' => 'I', # Isoleucine 'ATG' => 'M', # Methionine 'ACA' => 'T', # Threonine 'ACC' => 'T', # Threonine 'ACG' => 'T', # Threonine 'ACT' => 'T', # Threonine 'AAC' => 'N', # Asparagine 'AAT' => 'N', # Asparagine 'AAA' => 'K', # Lysine 'AAG' => 'K', # Lysine 'AGC' => 'S', # Serine 'AGT' => 'S', # Serine 'AGA' => 'R', # Arginine 'AGG' => 'R', # Arginine 'GTA' => 'V', # Valine 'GTC' => 'V', # Valine 'GTG' => 'V', # Valine 'GTT' => 'V', # Valine 'GCA' => 'A', # Alanine 'GCC' => 'A', # Alanine 'GCG' => 'A', # Alanine 'GCT' => 'A', # Alanine 'GAC' => 'D', # Aspartic Acid 'GAT' => 'D', # Aspartic Acid 'GAA' => 'E', # Glutamic Acid 'GAG' => 'E', # Glutamic Acid 'GGA' => 'G', # Glycine 'GGC' => 'G', # Glycine 'GGG' => 'G', # Glycine 'GGT' => 'G', # Glycine ); # # codon2aa # # A subroutine to translate a DNA 3-character codon to an amino acid # Version 3, using hash lookup sub codon2aa { my($codon) = @_; $codon = uc $codon; if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }else{ die "Bad codon '$codon'!!\n"; } } 1;
Now, let’s examine the code. First, the module
declares its package with a name (Geneticcode
)
that is the same as the file it is in
(Geneticcode.pm
), but minus the file extension
.pm
.
The directives:
use strict; use warnings;
will appear in all the code. The use strict
directive enforces the use of the my
directive for
all variables. The use warnings
directive produces
useful messages about potential problems in your code. (It is
possible to turn both directives off when required—to avoid
annoying warnings in your program output, for instance. See the
perldiag
, perllexwarn
, and
perlmodlib
sections of the Perl manual.)
Finally, there is a subroutine definition for
codon2aa
. As an argument, this subroutine
takes a codon represented as a string of three DNA bases and returns
the amino acid code corresponding to the codon. It accomplishes this
by a simple lookup in the hash %genetic_code
and
returns the result from the subroutine using the
return
built-in function.
The codon2aa
subroutine calls
die
and exits the program when it encounters an
undefined codon. See the exercises at the end of this chapter for a
discussion of the pros and cons of this behavior.
In my earlier book, I defined the hash
%genetic_code
within the subroutine
codon2aa
. That meant that every time the
subroutine was called, the hash would have to be initialized, which
took a bit of time. In this version, the hash only has to be
initialized once, when the program is first called, which results in
a significant speedup. The definition of the hash is outside of the
subroutine definition, but in the namespace of the
Geneticcode
package. The hash is initialized when
the Geneticcode.pm
module is loaded by this
statement:
use Geneticcode;
Every subsequent call to the codon2aa
subroutine
simply accesses the hash without having to initialize it each time.
Here’s an example that uses the new
Geneticcode
module, which is saved in a file
called testGeneticcode
and run by typing
perl testGeneticcode
:
use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Geneticcode; my $dna = 'AACCTTCCTTCCGGAAGAGAG'; # Initialize variables my $protein = ''; # Translate each three-base codon to an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $protein .= Geneticcode::codon2aa( substr($dna,$i,3) ); } print $protein, "\n";
Recall that the Perl built-in function substr
can
extract a portion of a string. In this case,
substr
extracts from $dna
the
three characters beginning at the position given in the counter
variable $i
; this three-character codon is then
passed as the argument to the subroutine codon2aa
.
This program produces the output:
NLPSGRE
Expanding Geneticcode.pm
Now, let’s expand our Geneticcode
module example. This new version of the module includes a few short
helper subroutines. The interest here lies in how the subroutines
interact with each other in the module’s namespace,
and how to access the code within the module from a Perl program that
uses the module.
Modules are a great way to organize code into logical collections of
interacting parts. When you create modules, you need to decide how to
organize your code into the appropriate collection of modules. Here,
we have some subroutines that translate codons into amino acids;
others read sequence data from files and print it to the screen. This
is a fairly obvious division of functionality, so
let’s create two modules for this code.
We’ll expand the Geneticcode
module; let’s also create a
SequenceIO
module. Of course, the new module will
be created in a file called
SequenceIO.pm
,
and that file will be placed in a directory that Perl can
find—in this case, the same directory in which
we’ve placed the Geneticcode
module.
Here’s the code for
Geneticcode.pm
:
package Geneticcode; use strict; use warnings; my(%genetic_code) = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine ... as before ... 'GAG' => 'E', # Glutamic Acid 'GGA' => 'G', # Glycine 'GGC' => 'G', # Glycine 'GGG' => 'G', # Glycine 'GGT' => 'G', # Glycine ); # # codon2aa # # A subroutine to translate a DNA 3-character codon to an amino acid # Version 3, using hash lookup sub codon2aa { my($codon) = @_; $codon = uc $codon; if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }else{ die "Bad codon '$codon'!!\n"; } } # # dna2peptide # # A subroutine to translate DNA sequence into a peptide sub dna2peptide { my($dna) = @_; # Initialize variables my $protein = ''; # Translate each three-base codon to an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $protein .= codon2aa( substr($dna,$i,3) ); } return $protein; } # translate_frame # # A subroutine to translate a frame of DNA sub translate_frame { my($seq, $start, $end) = @_; my $protein; # To make the subroutine easier to use, you won't need to specify # the end point-it will just go to the end of the sequence # by default. unless($end) { $end = length($seq); } # Finally, calculate and return the translation return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) ); } 1;
Now, we have in one module the code that accomplishes a translation from the genetic code. However, we also need to read sequence in from FASTA sequence files, and print out sequence (the translated protein) to the screen. Because these needs are likely to recur in many programs, it makes sense to make a separate module for just the file reading, sequence extraction, and sequence printing operations. (This may even be too much in one module; maybe there should be separate modules for each need? See the exercises at the end of the chapter.)
Here’s the code for the second module
SequenceIO.pm
,
which handles reading from a file, extracting FASTA sequence data,
and printing sequence data:
package SequenceIO; use strict; use warnings; # get_file_data # # A subroutine to get data from a file given its filename sub get_file_data { my($filename) = @_; # Initialize variables my @filedata = ( ); open(GET_FILE_DATA, $filename) or die "Cannot open file '$filename':$!\n\n"; @filedata = <GET_FILE_DATA>; close GET_FILE_DATA; return @filedata; } # extract_sequence_from_fasta_data # # A subroutine to extract FASTA sequence data from an array sub extract_sequence_from_fasta_data { my(@fasta_file_data) = @_; # Declare and initialize variables my $sequence = ''; foreach my $line (@fasta_file_data) { # discard blank line if ($line =~ /^\s*$/) { next; # discard comment line } elsif($line =~ /^\s*#/) { next; # discard fasta header line } elsif($line =~ /^>/) { next; # keep line, add to sequence string } else { $sequence .= $line; } } # remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; } # print_sequence # # A subroutine to format and print sequence data sub print_sequence { my($sequence, $length) = @_; # Print sequence in lines of $length for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) { print substr($sequence, $pos, $length), "\n"; } } 1;
Before we discuss the code, let’s see a small program that uses it:
# Translate a DNA sequence into one of the six reading frames use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Geneticcode; use SequenceIO; # Initialize variables my @file_data = ( ); my $dna = ''; my $revcom = ''; my $protein = ''; # Read in the contents of the file "sample.dna" @file_data = SequenceIO::get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = SequenceIO::extract_sequence_from_fasta_data(@file_data); # Translate the DNA to protein in one of the six reading frames # and print the protein in lines 70 characters long print "\n -------Reading Frame 1--------\n\n"; $protein = Geneticcode::translate_frame($dna, 1); SequenceIO::print_sequence($protein, 70); exit;
Here’s the input file:
> sample dna (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc acacctgagccactctcagatgaggaccta
Here’s the output of the program:
-------Reading Frame 1-------- RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAEWSVQVRGSLAGVVRE CAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCDNCNEWFHGDCIRITEKMAKA IREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLAR GSASPHKSSPQPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCR LRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATA TPEPLSDEDL
A few comments are in order. First, the subroutines for translating
codons are in the Geneticcode
module. They include
the hash %genetic_code
and the subroutines
codon2aa
, dna2peptide
, and
translate_frame
, which are involved with
translating DNA data to peptides. The subroutines for reading
sequence data in from files, and for formatting and printing it to
the screen, are in the SequenceIO
module. They are
the subroutines get_file_data
,
extract_sequence_from_fasta_data
, and
print_sequence
.
Now, we have two modules and code that exercises them; let’s look at some more facets of using modules.
Using Modules
So
far, the benefit of modules may seem questionable. You may be
wondering what the advantage is over simple libraries (without
package
declarations), since the main result seems
to be the necessity to refer to subroutines in the modules with
longer names!
Exporting Names
There’s a way to avoid
lengthy module names and still use the short ones if you place a call
to the special Exporter
module in the module code
and modify the use MODULE
declaration in the
calling code.
Going back to the first example Geneticcode.pm
module, recall it began with this line:
package Geneticcode;
and included the definition for the hash
genetic_code
and the subroutine
codon2aa
.
If you add these lines to the beginning of the file, you can export
the symbol names of variables or subroutines in the module into the
namespace of the calling program. You can then use the convenient
short names for things (e.g., codon2aa
instead of
Geneticcode::codon2aa
). Here’s a
short example of how it works (try typing perldoc Exporter
to see the whole story):
package Geneticcode; require Exporter; @ISA = qw(Exporter); @EXPORT_OK = qw(...); # symbols to export on request
Here’s how to export the name
codon2aa
from the module only when explicitly
requested:
@EXPORT_OK = qw(codon2aa); # symbols to export on request
The calling program then has to explicitly request the
codon2aa
symbol like so:
use Geneticcode qw(codon2aa);
If you use this approach, the calling program can just say:
codon2aa($codon);
instead of:
Geneticcode::codon2aa($codon);
The Exporter
module that’s
included in the standard Perl distribution has several other optional
behaviors, but the way just shown is the safest and most useful. As
you’ll see, the object-oriented programming style of
using modules doesn’t use the
Export
facility, but it is a useful thing to have
in your bag of tricks. For more information about exporting (such as
why exporting is also known as “polluting your
namespace”), see the Perl documentation for the
Exporter
module (by typing perldoc Exporter
at a command line or by going to the http://www.perldoc.com web page).
CPAN Modules
The Comprehensive Perl Archive Network (CPAN, http://www.cpan.org) is an impressively large collection of Perl code (mostly Perl modules). CPAN is easily accessible and searchable on the Web, and you can use its modules for a variety of programming tasks.
By now you should have the basic idea of how modules are defined and used, so let’s take some time to explore CPAN to see what goodies are available.
There are two important points about CPAN. First, a large number of the things you might want your programs to do have already been programmed and are easily obtained in downloadable modules. You just have to go find them at CPAN, install them on your computer, and call them from your program. We’ll take a look at an example of exactly that in this section.
Second, all code on CPAN is free of charge and available for use by a very unrestrictive copyright declaration. Sound good? Keep reading.
CPAN includes convenient ways to search for useful modules, and
there’s a CPAN.pm
module built-in
with Perl that makes downloading and installing modules quite easy
(when things work well, which they usually do). If you
can’t find CPAN.pm
, you should
consider updating your current version.
You can find more information by typing the following at the command line:
perldoc CPAN
You can also check the Frequently Asked Questions (FAQ) available at the CPAN web site.
What’s Available at CPAN?
The CPAN web site offers several “views” of the CPAN collection of modules and several alternate ways of searching (by module name, category, full text search of the module documentation, etc.). Here is the top-level organization of the modules by overall category:
Development Support Operating System Interfaces Networking Devices IPC Data Type Utilities Database Interfaces User Interfaces Language Interfaces File Names Systems Locking String Lang Text Proc Opt Arg Param Proc Internationalization Locale Security and Encryption World Wide Web HTML HTTP CGI Server and Daemon Utilities Archiving and Compression Images Pixmaps Bitmaps Mail and Usenet News Control Flow Utilities File Handle Input Output Microsoft Windows Modules Miscellaneous Modules Commercial Software Interfaces Not In Modulelist
Searching CPAN
CPAN’s main web page has a few ways to search the contents. Let’s say you need to perform some statistics and are looking for code that’s already available. We’ll go through the steps necessary to search for the code, download and install it, and use the module in a program.
At the main CPAN page, look for “searching” and click on search.cpan.org. If you search for “statistics” in all locations, you’ll get over 300 hits, so you should restrict your search to modules with the pull-down menu. You’ll get 25 hits (more by the time you read this); here’s what you’ll see:
1. Statistics::Candidates Statistics-MaxEntropy-0.9 - 26 Nov 1998 - Hugo WL ter Doest 2. Statistics::ChiSquare How random is your data? Statistics-ChiSquare-0.3 - 23 Nov 2001 - Jon Orwant 3. Statistics::Contingency Calculate precision, recall, F1, accuracy, etc. Statistics-Contingency-0.03 - 09 Aug 2002 - Ken Williams 4. Statistics::DEA Discontiguous Exponential Averaging Statistics-DEA-0.04 - 17 Aug 2002 - Jarkko Hietaniemi 5. Statistics::Descriptive Module of basic descriptive statistical functions. Statistics-Descriptive-2.4 - 26 Apr 1999 - Colin Kuskie 6. Statistics::Distributions Perl module for calculating critical values of common statistical distributions Statistics-Distributions-0.07 - 22 Jun 2001 - Michael Kospach 7. Statistics::Frequency simple counting of elements Statistics-Frequency-0.02 - 24 Apr 2002 - Jarkko Hietaniemi 8. Statistics::GaussHelmert General weighted least squares estimation Statistics-GaussHelmert-0.05 - 18 Apr 2002 - Stephan Heuel 9. Statistics::LTU An implementation of Linear Threshold Units Statistics-LTU-2.8 - 27 Feb 1997 - Tom Fawcett 10. Statistics::Lite Small stats stuff. Statistics-Lite-1.02 - 15 Apr 2002 - Brian Lalonde 11. Statistics::MaxEntropy Statistics-MaxEntropy-0.9 - 26 Nov 1998 - Hugo WL ter Doest 12. Statistics::OLS perform ordinary least squares and associated statistics, v 0.07. Statistics-OLS-0.07 - 13 Oct 2000 - Sanford Morton 13. Statistics::ROC receiver-operator-characteristic (ROC) curves with nonparametric confidence bounds Statistics-ROC-0.01 - 22 Jul 1998 - Hans A. Kestler 14. Statistics::Regression weighted linear regression package (line+plane fitting) StatisticsRegression - 26 May 2001 - ivo welch 15. Statistics::SparseVector Perl5 extension for manipulating sparse bitvectors Statistics-MaxEntropy-0.9 - 26 Nov 1998 - Hugo WL ter Doest 16. Statistics::Descriptive::Discrete Compute descriptive statistics for discrete data sets. Statistics-Descriptive-Discrete-0.07 - 13 Jun 2002 - Rhet Turnbull 17. Bio::Tree::Statistics Calculate certain statistics for a Tree bioperl-1.0.2 - 16 Jul 2002 - Ewan Birney 18. Device::ISDN::OCLM::Statistics OCLM statistics superclass Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 19. Device::ISDN::OCLM::CurrentStatistics OCLM current call statistics Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 20. Device::ISDN::OCLM::ISDNStatistics OCLM ISDN statistics Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 21. Device::ISDN::OCLM::Last10Statistics OCLM Last10 call statistics Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 22. Device::ISDN::OCLM::LastStatistics OCLM last call statistics Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 23. Device::ISDN::OCLM::ManualStatistics OCLM manual call statistics Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 24. Device::ISDN::OCLM::SPStatistics OCLM service provider statistics Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes 25. Device::ISDN::OCLM::SystemStatistics OCLM system statistics Device-ISDN-OCLM-0.40 - 02 Jan 2000 - Merlin Hughes
Let’s check out the Statistics::ChiSquare
module.
First, click on the link to Statistics::ChiSquare
;
you’ll see a summary of the module, complete with a
description, overview, discussion of the method, examples of use, and
information about the author.
One of the modules looks interesting; let’s download
and install it. How big is the source code? If you click on the
source
link, you’ll find that the
module is really just one short subroutine with the documentation
defined right in the module. Here’s the subroutine
definition part of the module:
package Statistics::ChiSquare; # ChiSquare.pm # # Jon Orwant, orwant@media.mit.edu # # 31 Oct 95, revised Mon Oct 18 12:16:47 1999, and again November 2001 # to fix an off-by-one error # # Copyright 1995, 1999, 2001 Jon Orwant. All rights reserved. # This program is free software; you can redistribute it and/or # modify it under the same terms as Perl itself. # # Version 0.3. Module list status is "Rdpf" use strict; use vars qw($VERSION @ISA @EXPORT); require Exporter; require AutoLoader; @ISA = qw(Exporter AutoLoader); # Items to export into callers namespace by default. Note: do not export # names by default without a very good reason. Use EXPORT_OK instead. # Do not simply export all your public functions/methods/constants. @EXPORT = qw(chisquare); $VERSION = '0.3'; my @chilevels = (100, 99, 95, 90, 70, 50, 30, 10, 5, 1); my %chitable = ( ); # assume the expected probability distribution is uniform sub chisquare { my @data = @_; @data = @{$data[0]} if @data = = 1 and ref($data[0]); my $degrees_of_freedom = scalar(@data) - 1; my ($chisquare, $num_samples, $expected, $i) = (0, 0, 0, 0); if (! exists($chitable{$degrees_of_freedom})) { return "I can't handle ", scalar(@data), " choices without a better table."; } foreach (@data) { $num_samples += $_ } $expected = $num_samples / scalar(@data); return "There's no data!" unless $expected; foreach (@data) { $chisquare += (($_ - $expected) ** 2) / $expected; } foreach (@{$chitable{$degrees_of_freedom}}) { if ($chisquare < $_) { return "There's a <$chilevels[$i+1]% and <$chilevels[$i]% chance that this data is random."; } $i++; } return "There's a <$chilevels[$#chilevels]% chance that this data is random."; } $chitable{1} = [0.00016, 0.0039, 0.016, 0.15, 0.46, 1.07, 2.71, 3.84, 6.64]; $chitable{2} = [0.020, 0.10, 0.21, 0.71, 1.39, 2.41, 4.60, 5.99, 9.21]; $chitable{3} = [0.12, 0.35, 0.58, 1.42, 2.37, 3.67, 6.25, 7.82, 11.34]; $chitable{4} = [0.30, 0.71, 1.06, 2.20, 3.36, 4.88, 7.78, 9.49, 13.28]; $chitable{5} = [0.55, 1.14, 1.61, 3.00, 4.35, 6.06, 9.24, 11.07, 15.09]; $chitable{6} = [0.87, 1.64, 2.20, 3.83, 5.35, 7.23, 10.65, 12.59, 16.81]; $chitable{7} = [1.24, 2.17, 2.83, 4.67, 6.35, 8.38, 12.02, 14.07, 18.48]; $chitable{8} = [1.65, 2.73, 3.49, 5.53, 7.34, 9.52, 13.36, 15.51, 20.09]; $chitable{9} = [2.09, 3.33, 4.17, 6.39, 8.34, 10.66, 14.68, 16.92, 21.67]; $chitable{10} = [2.56, 3.94, 4.86, 7.27, 9.34, 11.78, 15.99, 18.31, 23.21]; $chitable{11} = [3.05, 4.58, 5.58, 8.15, 10.34, 12.90, 17.28, 19.68, 24.73]; $chitable{12} = [3.57, 5.23, 6.30, 9.03, 11.34, 14.01, 18.55, 21.03, 26.22]; $chitable{13} = [4.11, 5.89, 7.04, 9.93, 12.34, 15.12, 19.81, 22.36, 27.69]; $chitable{14} = [4.66, 6.57, 7.79, 10.82, 13.34, 16.22, 21.06, 23.69, 29.14]; $chitable{15} = [5.23, 7.26, 8.55, 11.72, 14.34, 17.32, 22.31, 25.00, 30.58]; $chitable{16} = [5.81, 7.96, 9.31, 12.62, 15.34, 18.42, 23.54, 26.30, 32.00]; $chitable{17} = [6.41, 8.67, 10.09, 13.53, 16.34, 19.51, 24.77, 27.59, 33.41]; $chitable{18} = [7.00, 9.39, 10.87, 14.44, 17.34, 20.60, 25.99, 28.87, 34.81]; $chitable{19} = [7.63, 10.12, 11.65, 15.35, 18.34, 21.69, 27.20, 30.14, 36.19]; $chitable{20} = [8.26, 10.85, 12.44, 16.27, 19.34, 22.78, 28.41, 31.41, 37.57]; 1;
Some of this code will look familiar; some may not. Check out the use
of package
, use
strict
, and require
Exporter
; they’re parts of Perl
you’ve just seen.
You’ll also see references to
version
, Autoloader
,
use
vars
, and an initialization
of a multidimensional array chitable
, which will
be covered later. For now, you may want to take a quick read-through
of the code and get some personal satisfaction at how much of it
makes sense.
Indeed, one of the really nice things about most modules is that you don’t really have to read the code very often. Usually you can just install the module, read enough of the documentation to see how to call it from your program, and you’re off and running. Let’s take that approach now.
Installing Modules Using CPAN.pm
Our next task is to install the module using
CPAN.pm
. This section contains a log from when I
installed Statistics::ChiSquare
on my Linux
computer using CPAN.pm
.
In fact, to make things easy, here’s the section of the CPAN FAQ that addresses installing modules:
How do I install Perl modules? Installing a new module can be as simple as typing perl -MCPAN -e 'install Chocolate::Belgian'. The CPAN.pm documentation has more complete instructions on how to use this convenient tool. If you are uncomfortable with having something take that much control over your software installation, or it otherwise doesn't work for you, the perlmodinstall documentation covers module installation for UNIX, Windows and Macintosh in more familiar terms. Finally, if you're using ActivePerl on Windows, the PPM (Perl Package Manager) has much of the same functionality as CPAN.pm.
The following is my install log. Notice that all I have to do is type a couple of lines, and everything else that follows is automatic!
[tisdall@coltrane tisdall]$ perl -MCPAN -e 'install Statistics::ChiSquare' CPAN: Storable loaded ok mkdir /root/.cpan: Permission denied at /usr/local/lib/perl5/5.6.1/CPAN.pm line 2218 [tisdall@coltrane tisdall]$ su Password: [root@coltrane tisdall]# perl -MCPAN -e 'install Statistics::ChiSquare' CPAN: Storable loaded ok Going to read /root/.cpan/Metadata Database was generated on Wed, 20 Mar 2002 00:39:29 GMT CPAN: LWP::UserAgent loaded ok Fetching with LWP: ftp://cpan.cse.msu.edu/authors/01mailrc.txt.gz Going to read /root/.cpan/sources/authors/01mailrc.txt.gz CPAN: Compress::Zlib loaded ok Fetching with LWP: ftp://cpan.cse.msu.edu/modules/02packages.details.txt.gz Going to read /root/.cpan/sources/modules/02packages.details.txt.gz Database was generated on Mon, 26 Aug 2002 00:22:07 GMT There's a new CPAN.pm version (v1.62) available! [Current version is v1.59_54] You might want to try install Bundle::CPAN reload cpan without quitting the current session. It should be a seamless upgrade while we are running... Fetching with LWP: ftp://cpan.cse.msu.edu/modules/03modlist.data.gz Going to read /root/.cpan/sources/modules/03modlist.data.gz Going to write /root/.cpan/Metadata Running install for module Statistics::ChiSquare Running make for J/JO/JONO/Statistics-ChiSquare-0.3.tar.gz Fetching with LWP: ftp://cpan.cse.msu.edu/authors/id/J/JO/JONO/Statistics-ChiSquare-0.3.tar.gz CPAN: MD5 loaded ok Fetching with LWP: ftp://cpan.cse.msu.edu/authors/id/J/JO/JONO/CHECKSUMS Checksum for /root/.cpan/sources/authors/id/J/JO/JONO/Statistics-ChiSquare-0.3. tar.gz ok Scanning cache /root/.cpan/build for sizes Deleting from cache: /root/.cpan/build/IO-stringy-2.108 (21.4>20.0 MB) Deleting from cache: /root/.cpan/build/XML-Node-0.11 (20.8>20.0 MB) Deleting from cache: /root/.cpan/build/bioperl-0.7.2 (20.7>20.0 MB) Statistics/ChiSquare-0.3/ Statistics/ChiSquare-0.3/ChiSquare.pm Statistics/ChiSquare-0.3/Makefile.PL Statistics/ChiSquare-0.3/test.pl Statistics/ChiSquare-0.3/Changes Statistics/ChiSquare-0.3/MANIFEST Package seems to come without Makefile.PL. (The test -f "/root/.cpan/build/Statistics/Makefile.PL" returned false.) Writing one on our own (setting NAME to StatisticsChiSquare) CPAN.pm: Going to build J/JO/JONO/Statistics-ChiSquare-0.3.tar.gz Checking if your kit is complete... Looks good Writing Makefile for Statistics::ChiSquare Writing Makefile for StatisticsChiSquare make[1]: Entering directory `/root/.cpan/build/Statistics/ChiSquare-0.3' cp ChiSquare.pm ../blib/lib/Statistics/ChiSquare.pm AutoSplitting ../blib/lib/Statistics/ChiSquare.pm (../blib/lib/auto/ Statistics/ChiSquare) Manifying ../blib/man3/Statistics::ChiSquare.3 make[1]: Leaving directory `/root/.cpan/build/Statistics/ChiSquare-0.3' /usr/bin/make -- OK Running make test make[1]: Entering directory `/root/.cpan/build/Statistics/ChiSquare-0.3' make[1]: Leaving directory `/root/.cpan/build/Statistics/ChiSquare-0.3' make[1]: Entering directory `/root/.cpan/build/Statistics/ChiSquare-0.3' PERL_DL_NONLAZY=1 /usr/bin/perl -I../blib/arch -I../blib/lib -I/usr/local/lib/ perl5/5.6.1/i686-linux -I/usr/local/lib/perl5/5.6.1 test.pl 1..2 ok 1 ok 2 make[1]: Leaving directory `/root/.cpan/build/Statistics/ChiSquare-0.3' /usr/bin/make test -- OK Running make install make[1]: Entering directory `/root/.cpan/build/Statistics/ChiSquare-0.3' make[1]: Leaving directory `/root/.cpan/build/Statistics/ChiSquare-0.3' Installing /usr/local/lib/perl5/site_perl/5.6.1/Statistics/ChiSquare.pm Installing /usr/local/lib/perl5/site_perl/5.6.1/auto/Statistics/ChiSquare/ autosplit.ix Installing /usr/local/man/man3/Statistics::ChiSquare.3 Writing /usr/local/lib/perl5/site_perl/5.6.1/i686-linux/auto/ StatisticsChiSquare/.packlist Appending installation info to /usr/local/lib/perl5/5.6.1/i686-linux/perllocal.pod /usr/bin/make install UNINST=1 -- OK [root@coltrane tisdall]#
This may seem like a confusing amount of output, but, again, all you have to do is type a couple of lines, and the installation follows automatically.
You may get something like the following message when you try to install a CPAN module:
[tisdall@coltrane tisdall]$ perl -MCPAN -e 'install Statistics::ChiSquare' CPAN: Storable loaded ok mkdir /root/.cpan: Permission denied at /usr/local/lib/perl5/5.6.1/CPAN.pm line 2218
As you can see, it didn’t work, and it produced an
error message. On Unix machines, it’s often
necessary to become root to install things.[2] In that case, use the Unix
su
command and try the CPAN command again:
[tisdall@coltrane tisdall]$ su Password: [root@coltrane tisdall]# perl -MCPAN -e 'install Statistics::ChiSquare'
Great, it worked. If you look over the rather verbose output, you’ll see that it finds the module, installs it, tests it, and logs the installation.
Pretty easy, huh?
It’s usually this easy, but not always. Occasionally, errors result, and the module may not be installed. In that case, the error messages may be enough to explain the problem; for instance, the module may depend on another module you have to install first. Another problem is that some modules haven’t been tested on, or even designed to work on, all operating systems; if you try to install a Windows-specific module on Linux, it is likely to complain. In extreme cases, the module documentation usually provides the author’s email address.
Using the Newly Installed CPAN Module
Now comes the payoff. Let’s look again at the documentation for the module and see if we can use it from our own Perl code.
Now that the module is installed, you can see the documentation by typing:
perldoc Statistics::ChiSquare
You can also simply go back to the web documentation found at http://search.cpan.org. Either way, you’ll find the following example using this ChiSquare module:
NAME "Statistics::ChiSquare" - How random is your data? SYNOPSIS use Statistics::Chisquare; print chisquare(@array_of_numbers); Statistics::ChiSquare is available at a CPAN site near you. DESCRIPTION Suppose you flip a coin 100 times, and it turns up heads 70 times. Is the coin fair? Suppose you roll a die 100 times, and it shows 30 sixes. Is the die loaded? In statistics, the chi-square test calculates "how random" a series of numbers is. But it doesn't simply say "yes" or "no". Instead, it gives you a confidence interval, which sets upper and lower bounds on the likelihood that the variation in your data is due to chance. See the examples below. ...
The documentation continues with more discussion and some concrete examples that use the module and interpret the results.
Very often, the SYNOPSIS
part of the documentation
is all you need to look at. It shows you specific examples of how to
call the code in the module. In this case, because
it’s a very simple module, there is just one
subroutine that can be used. As you see from the documentation
excerpt, you just need to pass the chisquare
subroutine an array of numbers and print out the return value to use
the code. Let’s try it. We’ll take
as our input an array of numbers that corresponds to the stops of the
Broadway-7th Avenue local subway train on the west side of Manhattan,
from 14th Street up to 137th Street in Harlem.
(We’ll assume you didn’t run fast
enough and missed the A train.) Let’s see how random
these stops really are:
use strict; use warnings; use Statistics::ChiSquare; my(@subwaystops) = (14, 18, 23, 28, 34, 42, 50, 59, 66, 72, 79, 86, 96, 103, 110, 116, 125, 137); print chisquare(@subwaystops);
This produces the output:
There's a <1% chance that this data is random.
(Knowing firsthand the feelings of long-suffering New York City Subway riders, I predict that this result might provoke some spirited discussion. Nevertheless, we seem to have working code.)
Problems with CPAN Modules
Actually, the sharp-eyed reader may have noticed a problem in our mad
dash uptown. In the first line of the SYNOPSIS
section, there’s the following:
use Statistics::Chisquare;
The name of the module is spelled Chisquare, whereas in all other
places in the documentation the module is spelled ChiSquare with a
capital S. In Perl, the case of a letter, uppercase or lowercase, is
important, and this looks suspiciously like a typographical error in
the documentation. If you try use Statistics::Chisquare
, you’ll discover
that the module can’t be found, whereas if you try
use Statistics::ChiSquare
, the module is there.
This is a minor bug, but some modules have poor documentation, and it
can be a time-consuming problem, especially if you are forced to wade
into the module code or try various tests, to figure out how the
module works.
Apart from bugs, I’ve also mentioned the problem that some modules are not tested, or designed, for all operating systems. In addition, many modules require other modules to be present. It’s possible to configure CPAN to automatically install all the required modules a requested module uses, as described in the CPAN documentation, but you may need to intervene personally. It’s useful to remember that if you have a program that uses a certain module running on one computer, and you move the program to another computer, you may have to install the required modules on the new computer as well.
Saving the worst for last, it’s also important to remember that contributing to CPAN is open to one and all, and not all the code there is well-written or well-tested. The heavily used modules are, but counterexamples can be found. So, don’t bet the farm on your code just because it uses a CPAN module; you should still carefully read the documentation for the module and test your program.
The CPAN FAQ explains in detail the way to be a good citizen when it comes to testing and reporting bugs that you discover in CPAN code.
Exercises
- Exercise 1.1
What are the problems that might arise when dividing program code into separate module files?
- Exercise 1.2
What are the differences between libraries, modules, packages, and namespaces?
- Exercise 1.3
Write a module that finds modules on your computer.
- Exercise 1.4
Where do the standard Perl distribution modules live on your computer?
- Exercise 1.5
Research how Perl manages its namespaces.
- Exercise 1.6
When might it be necessary to export names from a module? When might it be useful? When might it be convenient? When might it be a very bad idea?
- Exercise 1.7
The program
testGeneticcode
contains the following loop:# Translate each three-base codon to an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $protein .= Geneticcode::codon2aa( substr($dna,$i,3) ); }
Here’s another way to accomplish that loop:
# Translate each three-base codon to an amino acid, and append to a protein my $i=0; while (my $codon = substr($dna, $i += 3, 3) ) { $protein .= Geneticcode::codon2aa( $codon ); }
Compare the two methods. Which is easier to understand? Which is easier to maintain? Which is faster? Why?
- Exercise 1.8
The subroutine
codon2aa
causes the entire program to halt when it encounters a “bad” codon in the data. Often (usually) it is best for a subroutine to return some indication that it encountered a problem and let the calling program decide how to handle it. It makes the subroutine more generally useful if it isn’t always halting the program (although that is what you want to do sometimes).Rewrite
codon2aa
and the calling programtestGeneticcode
so that the subroutine returns some error—perhaps the valueundef
—and the calling program checks for that error and performs some action.- Exercise 1.9
Write a separate module for each of the following: reading a file, extracting FASTA sequence data, and printing sequence data to the screen.
- Exercise 1.10
[1] Perl libraries were traditionally put in files ending with .pl, which stands for perl library; the term library is also used to refer to a collection of Perl modules. The common denominator is that a library is a collection of reusable subroutines.
[2] You may need to contact your system administrator about getting root permission. The CPAN documentation discusses how to do a non-root installation. If you’re not on a Unix or Linux machine and are using ActiveState’s Perl on a Windows machine, for instance, you need to consult that documentation.
Get Mastering Perl for Bioinformatics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.