Beginning Perl for Bioinformatics by James Tisdall The unconfirmed error reports are from readers. They have not yet been approved or disproved by the author or editor and represent solely the opinion of the reader. Here's a key to the markup: [page-number]: serious technical mistake {page-number}: minor technical mistake : important language/formatting problem (page-number): language change or minor formatting problem ?page-number?: reader question or request for clarification This page was updated March 23, 2007. UNCONFIRMED errors and comments from readers: [1] 1; Can't download the examples and answers, Bro! [55] Exercise 4.5; It seems to me that there is misunderstainig of how transcription occurs (DNA to RNA or RNA to DNA). Will appreciate your feed-back. Thanks Hemant # Perl: Exercise 4-5 Reverse Transcribing RNA into DNA # The RNA $RNA = 'ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC'; print "\n$RNA\n"; # Transcribe RNA to DNA - Replace 'U' where there is 'T'. # However, transcription occurs A -> T, U -> A, C -> G, G -> C # So the correct answer is as below given the above RNA structure. $DNA = $RNA; $DNA =~ tr/ACGU/TGCA/; print "\n$DNA\n"; # The correct DNA seq is TGCCCTCCTGCCCTTTTAATGATGCCGTAATCG # The code below is incorrect $DNA = $RNA; $DNA =~ s/U/T/g; print "\n$DNA\n"; # result is ACGGGAGGACGGGAAAATTACTACGGCATTAGC exit; [107] for each loop in the code; When I ran example 6-4.pl after fixing the two bugs described in the text, the program still did not generate the correct output. It seems that the variable $receivingcommittment was never set to 1. It turned out that the variable name was misspelt as "$recieving..." whereas it should be "$receiving...". Further correction of the variable name would fix the problem. [117] exercise 6.5; there is no argument passed when the subroutine is called, therefore the printout statement is always executed, even if the file doesn't exist It should be : if(file_passes_tests($file)) { print "File $file exists, is a regular file, and is nonzero in size\n"; } {132} first sub-routine (second line); As the list of nucleotides (A/C/G/T)is specifically stated in the sub-routine 'randomnucleotide' (on Page 133) it seems supefluous to also specifically name them in this sub-routine ('mutate') and to pass them to the second sub-routine as a parameter which isn't used. (143) last paragraph; Hi, When I run the subroutine, the error message show that: syntax error at c7_s4.pl line 76, near ") {" Global symbol "$count" requires explicit package name at c7_s4.pl line 78. Global symbol "$length" requires explicit package name at c7_s4.pl line 78. syntax error at c7_s4.pl line 79, near "}" Execution of c7_s4.pl aborted due to compilation errors. ##################################### sub match_percentage { my ($string1,$string2) =@_; #assume the two strings with same length my $length=length($string1); my ($position); my ($count) =0; for ($position=0; $position < $length; ++$position) { if(substr($string1, $position, 1) eq (substr($string2, $position, 1)) {++$count;} } return $count/$length; } {146} 3rd paragraph; The output of example 7-4 contains "matching positions is 0.24%" and the accompaning text says "a quarter of the positions match". This would be try if it said 24% or 0.24. 0.24% is a quarter of a percent, not 25 percent. Something is wrong here. {185} 2nd paragraph; it will look for restriction enzymes .... the restriction enzymes appear. -> it will look for restriction sites .... the restriction sites appear. {191} Example 9.2; In the (errata) correction of this example (changing from a foreach loop over an array which has been read in, to a while look which reads in the array - so the range statement will work) use is made of the open_file() subroutine. I didn't remember seeing this subroutine, and it isn't mentioned in the Index (either under its name, or under subroutines). It is on page 218. The location should be mentioned both where it is used, and in the Index. {198} Exercise 9.6; On Line 95 of origianl answer: for ( my $i = 1, my $j = shift(@locations) ; @locations ; $i = $j, $j = shift(@locations) ) { push(@digest, substr($dna, $i-1, $j-$i)); } using this for loop, it will miss the last restriction digest because after getting the last ensyme site, @locations will be empty, then the loop will stop. The right for loop should like this: for ( my $i = 1, my $j = shift(@locations), my $k = 0; $k <= scalar(@locations)+2 ; $i = $j, $j = shift(@locations) ) { $k++; push(@digest, substr($dna, $i-1, $j-$i)); } {203} 3; ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt is given as the location for finding gbrel.txt which is the Genebank release notes, is not correct (or at least not working at the moment) ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt does work. [211] near bottom of page; the following code in Example 10-2 ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); generates an error (uninitialized value chunk 1) on my mac, using MacPerl {219} sub get_annotation_and_dna; The final statement return ($annotation, $dna) needs a ';' (221) 6; Using a hash for annotations is a great idea except in cases where an annotation type occurs more than once in a Genbank record. I have seen many cases of Genbank records with multiple REFERENCE annotations. I was hoping that the author would point this out and have another example showing a hash whose values were arrays of strings. (221) example 10.5; i have spent extraordinary effort trying to parse the elements of the Features of Genbank files ... a proper answer to Exercise 10.5 would have been wonderfully helpful ... it's disingenuous to fail to provide an answer and to say that "it makes a good class project" when this book should be designed for individuals who have no teacher; and to state that it is "straighforward but challenging" is a contradiction in terms ... in fact, it is exactly what i want to be able to do, and have not yet succeeded with after a great deal of effort # Answer to Exercise 10.5 # # The answer to this exercise is left to the student, as it makes a good class project. It is a straightforward but challenging extension of material already presented in the text; it also can be the basis of interesting and biologically focused projects. # # Good luck with it! [222] bottom; This code: while ( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm) generates a segmentation fault, when the code runs on any real genbank file, such as hs_ref_chr22.gbs or hs_ref_chr22.gbk [223] 1; Example 10-6, Parsing GenBank Annotation, which begins on page 221, produces incorrect results on pages 223 and 224. In particular, the parse_annotation() subroutine does not check to see if the 'field' ($key:$value) it is about to store in the hash table has already been stored. As a result, previous occurrences of a particular field are clobbered and only the last occurrence is recorded. In the example given, with the input taken from page 201, only the second "REFERENCE" field is displayed (page 224). Interestingly, the very next section on parsing the "FEATURES" table warns on page 228 about the possibility of running into this scenario when parsing the FEATURES table's multiple fields - some of which have the same name. The same coding solution should have been applied to the entire GenBank record. [241] last paragraph; . .. 3c 44 pdb1a4o.ent -> . .. 3c 44 c1 c4 pdb1a4o.ent Also, you have to make this change on p.243 244 246 247 {288} code at bottom of page; As noted in another "confirmed" error report, there is an error in the code found at the bottom of page 288. However, I believe the solution is still in error. In particular, while the proposed solution (adding parentheses to the regular expressions; e.g. changing /^Query.*\n/ to /^Query(.*)\n/ and /^Sbjct.*\n/ to /^Sbjct(.*)\n/) may correct an error (I have not tested the code, so I do not know if there are other errors), I do not think it will fix the error of the extraneous "ct" being prepended to the "Subject String" lines in the output at the top of page 289. That error, I believe, is caused by another faulty regular expression at the very end of the code; in particular, the line: $subject =~ s/[^acgt]//g;. As you can see, this line will NOT remove c's and t's from the long, concatenated "Sbjct:" line created from the HSP hash table. Hence, the multiple occurrences of "ct" in the output.