A large part of what you, the Perl bioinformatics programmer, will spend your time doing amounts to variations on the same theme as Examples 4-1 and 4-2. You'll get some data, be it DNA, proteins, GenBank entries, or what have you; you'll manipulate the data; and you'll print out some results.
Example 4-3 is another program that
manipulates DNA; it transcribes DNA to RNA. In the cell, this transcription of DNA
to RNA is the outcome of the workings of a delicate, complex, and error-correcting
molecular machinery. Here it's a simple substitution. When DNA is transcribed to RNA,
T's are changed to
U's, and that's all that our program needs to
Example 4-3. Transcribing DNA into RNA
#!/usr/bin/perl -w # Transcribing DNA into RNA # The DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Print the DNA onto the screen print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; # Transcribe the DNA to RNA by substituting all T's with U's. $RNA = $DNA; $RNA =~ s/T/U/g; # Print the RNA onto the screen print "Here is the result of transcribing the DNA to RNA:\n\n"; print "$RNA\n"; # Exit the program. exit;
Here's the output of Example 4-3:
Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC Here is the result of transcribing the DNA to RNA: ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC
This short program introduces an important part of Perl: the ability to easily manipulate text data such as a string of DNA. The manipulations can be of many different sorts: translation, reversal, substitution, deletions, reordering, and so on. This facility of Perl is one of the main reasons for its success in bioinformatics and among programmers in general.
$RNA = $DNA;
Note that after this statement is executed, there's a variable called
$RNA that actually contains DNA. Remember this is perfectly legal—you can call variables anything you
like—but it is potentially confusing to have inaccurate variable names. Now in this
case, the copy is preceded with informative comments and followed immediately with a
statement that indeed causes the variable
contain RNA, so it's all right. Here's a way to prevent
$RNA from containing anything except RNA:
($RNA = $DNA) =~ s/T/U/g;
In Example 4-3, the transcription happens in this statement:
$RNA =~ s/T/U/g;
There are two new items in this statement: the binding operator (
=~) and the substitute command
The binding operator
=~ is used, obviously enough, on variables containing strings; here the variable
$RNA contains DNA sequence data. The binding operator means "apply
the operation on the right to the string in the variable on the left."
The substitution operator
, shown in
Figure 4-1, requires a little more
explanation. The different parts of the command are separated (or delimited) by the
forward slash. First, the
s indicates this is a
substitution. After the first
/ comes a
T, which represents the element in the string that
will be substituted. After the second
/ comes a
U, which represents the element that's going
to replace the
T. Finally, after the third
g stands for "global" and is one of several possible modifiers that can appear in this
part of the statement. Global means "make this substitution throughout the entire
string," that is to say, everywhere possible in the string.
Thus, the meaning of the statement is: "substitute all
U's in the string data
stored in the variable
The substitution operator is an example of the use of regular expressions. Regular expressions are the key to text manipulation, one of the most powerful features of Perl as you'll see in later chapters.
 Briefly, the coding DNA strand is the reverse complement of the other strand, which is used as a template to synthesize its reverse complement as RNA, with T's replaced as U's. With the two reverse complements, this is the same as the coding strand with the T→U replacement.
 We're ignoring the mechanism of the splicing out of introns, obviously.
T stands for thymine; the
U stands for uracil.
 Recall the discussion in Section 18.104.22.168 about the importance of the order of the parts
in an assignment statement. Here, the value of
$DNA, that is, the DNA sequence data that has been stored in
$DNA variable, is being assigned to
$RNA. If you had written
$DNA = $RNA;, the value of the
$RNA variable (which is empty) would
have been assigned to the
in effect wiping out the DNA sequence data in that variable and leaving two