book

Beginning Perl for Bioinformatics

Name: Beginning Perl for Bioinformatics
Author: James Tisdall
ISBN: 9780596000806

by James Tisdall

October 2001

Beginner

386 pages

12h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Special Upgrade Offer
A Note Regarding Supplemental Files
Preface
What Is Bioinformatics?What Bioinformatics Can DoAbout This BookWho This Book Is ForWhy Should I Learn to Program?Structure of This BookConventions Used in This BookComments and QuestionsAcknowledgments
1. Biology and Computer Science
1.1. The Organization of DNA1.2. The Organization of Proteins1.3. In Silico1.4. Limits to Computation
2. Getting Started with Perl
2.1. A Low and Long Learning Curve2.2. Perl’s Benefits2.2.1. Ease of Programming2.2.2. Rapid Prototyping2.2.3. Portability, Speed, and Program Maintenance2.2.4. Versions of Perl2.3. Installing Perl on Your Computer2.3.1. Perl May Already Be Installed!2.3.2. No Internet Access?2.3.3. Downloading2.3.4. Binary Versus Source Code2.3.5. Installation2.3.5.1. Unix and Linux2.3.5.2. Macintosh2.3.5.3. Windows2.4. How to Run Perl Programs2.4.1. Unix or Linux2.4.2. Macs2.4.3. Windows2.5. Text Editors2.6. Finding Help
3. The Art of Programming
3.1. Individual Approaches to Programming3.2. Edit—Run—Revise (and Save)3.2.1. Saves and Backups3.2.2. Error Messages3.2.3. Debugging3.3. An Environment of Programs3.3.1. Open Source Programs3.4. Programming Strategies3.5. The Programming Process3.5.1. The Design Phase3.5.2. Algorithms3.5.3. Pseudocode and Code3.5.4. Comments
4. Sequences and Strings
4.1. Representing Sequence Data4.2. A Program to Store a DNA Sequence4.2.1. Control Flow4.2.2. Comments Revisited4.2.3. Command Interpretation4.2.4. Statements4.2.4.1. Variables4.2.4.2. Strings4.2.4.3. Assignment4.2.4.4. Print4.2.4.5. Exit4.3. Concatenating DNA Fragments4.4. Transcription: DNA to RNA4.5. Using the Perl Documentation4.6. Calculating the Reverse Complement in Perl4.7. Proteins, Files, and Arrays4.8. Reading Proteins in Files4.9. Arrays4.10. Scalar and List Context4.11. Exercises
5. Motifs and Loops
5.1. Flow Control5.1.1. Conditional Statements5.1.1.1. Conditional tests and matching braces5.1.2. Loops5.1.2.1. open and unless5.2. Code Layout5.3. Finding Motifs5.3.1. Getting User Input from the Keyboard5.3.2. Turning Arrays into Scalars with join5.3.3. do-until Loops5.3.4. Regular Expressions5.3.4.1. Regular expressions and character classes5.3.4.2. Pattern matching with =~ and regular expressions5.4. Counting Nucleotides5.5. Exploding Strings into Arrays5.6. Operating on Strings5.7. Writing to Files5.8. Exercises
6. Subroutines and Bugs
6.1. Subroutines6.1.1. Advantages of Subroutines6.1.2. Writing Subroutines6.2. Scoping and Subroutines6.2.1. Arguments6.2.2. Scoping6.3. Command-Line Arguments and Arrays6.4. Passing Data to Subroutines6.4.1. Subroutines: Pass by Value6.4.2. Subroutines: Pass by Reference6.5. Modules and Libraries of Subroutines6.6. Fixing Bugs in Your Code6.6.1. use warnings; and use strict;6.6.2. Fixing Bugs with Comments and Print Statements6.6.3. The Perl Debugger6.6.3.1. A program with bugs6.6.3.2. How to start and stop the debugger6.6.3.3. Debugger command summary6.6.3.4. Stepping through statements with the debugger6.6.3.5. Setting breakpoints6.6.3.6. Fixing another bug6.6.3.7. use warnings; and use strict; redux6.7. Exercises
7. Mutations and Randomization
7.1. Random Number Generators7.2. A Program Using Randomization7.2.1. Seeding the Random Number Generator7.2.2. Control Flow7.2.3. Making a Sentence7.2.4. Randomly Selecting an Element of an Array7.2.5. Formatting7.2.6. Another Way to Calculate the Random Position7.3. A Program to Simulate DNA Mutation7.3.1. Pseudocode Design7.3.1.1. Select a random position in a string7.3.1.2. Choose a random nucleotide7.3.1.3. Place a random nucleotide into a random position7.3.2. Improving the Design7.3.3. Combining the Subroutines to Simulate Mutation7.3.4. A Bug in Your Program?7.4. Generating Random DNA7.4.1. Bottom-up Versus Top-down7.4.2. Subroutines for Generating a Set of Random DNA7.4.3. Turning the Design into Code7.5. Analyzing DNA7.5.1. Some Notes About the Code7.6. Exercises

8. The Genetic Code
8.1. Hashes8.2. Data Structures and Algorithms for Biology8.2.1. A Gene Expression Database8.2.2. Gene Expression Data Using Unsorted Arrays8.2.3. Gene Expression Data Using Sorted Arrays and Binary Search8.2.4. Gene Expression Data Using Hashes8.2.5. Relational Databases8.2.6. DBM8.3. The Genetic Code8.3.1. Background8.3.2. Translating Codons to Amino Acids8.3.3. The Redundancy of the Genetic Code8.3.4. Using Hashes for the Genetic Code8.4. Translating DNA into Proteins8.5. Reading DNA from Files in FASTA Format8.5.1. FASTA Format8.5.2. A Design to Read FASTA Files8.5.3. A Subroutine to Read FASTA Files8.5.4. Writing Formatted Sequence Data8.5.5. A Main Program for Reading DNA and Writing Protein8.6. Reading Frames8.6.1. What Are Reading Frames?8.6.2. Translating Reading Frames8.7. Exercises
9. Restriction Maps and Regular Expressions
9.1. Regular Expressions9.2. Restriction Maps and Restriction Enzymes9.2.1. Background9.2.2. Planning the Program9.2.3. Restriction Enzyme Data9.2.4. Logical Operators and the Range Operator9.2.5. Finding the Restriction Sites9.3. Perl Operations9.3.1. Precedence of Operations and Parentheses9.4. Exercises
10. GenBank
10.1. GenBank Files10.2. GenBank Libraries10.3. Separating Sequence and Annotation10.3.1. Using Arrays10.3.2. Using Scalars10.3.2.1. Pattern modifiers10.3.2.2. Examples of pattern modifiers10.3.2.3. Separating annotations from sequence10.4. Parsing Annotations10.4.1. Using Arrays10.4.2. When to Use Regular Expressions10.4.3. Main Program10.4.4. Parsing Annotations at the Top Level10.4.5. Parsing the FEATURES Table10.4.5.1. Features10.4.5.2. Parsing10.5. Indexing GenBank with DBM10.5.1. DBM Essentials10.5.2. A DBM Database for GenBank10.6. Exercises
11. Protein Data Bank
11.1. Overview of PDB11.2. Files and Folders11.2.1. Opening Directories11.2.2. Recursion11.2.3. Processing Many Files11.3. PDB Files11.3.1. PDB File Format11.3.2. SEQRES11.4. Parsing PDB Files11.4.1. Extracting Primary Sequence11.4.2. Finding Atomic Coordinates11.5. Controlling Other Programs11.5.1. The Stride Secondary Structure Predictor11.5.2. Parsing Stride Output11.6. Exercises
12. BLAST
12.1. Obtaining BLAST12.2. String Matching and Homology12.3. BLAST Output Files12.4. Parsing BLAST Output12.4.1. Extracting Annotation and Alignments12.4.2. Parsing BLAST Alignments12.5. Presenting Data12.5.1. The printf Function12.5.2. here Documents12.5.3. format and write12.6. Bioperl12.6.1. Sample Modules12.6.2. Bioperl Tutorial Script12.7. Exercises
13. Further Topics
13.1. The Art of Program Design13.2. Web Programming13.3. Algorithms and Sequence Alignment13.4. Object-Oriented Programming13.5. Perl Modules13.5.1. Bioperl13.6. Complex Data Structures13.7. Relational Databases13.8. Microarrays and XML13.9. Graphics Programming13.10. Modeling Networks13.11. DNA Computers
A. Resources
A.1. PerlA.1.1. Web SiteA.1.2. CPAN: Comprehensive Perl Archive NetworkA.1.3. FAQs: Frequently Asked QuestionsA.1.3.1. BeginnersA.1.4. Online ManualsA.1.5. BooksA.1.6. ConferenceA.1.7. NewsgroupsA.2. Computer ScienceA.2.1. AlgorithmsA.2.2. Software EngineeringA.2.3. Theory of Computer ScienceA.2.4. General ProgrammingA.3. LinuxA.4. BioinformaticsA.4.1. BooksA.4.2. Governmental OrganizationsA.4.3. ConferencesA.5. Molecular Biology
B. Perl Summary
B.1. Command InterpretationB.2. CommentsB.3. Scalar Values and Scalar VariablesB.3.1. StringsB.3.2. NumbersB.3.3. Scalar VariablesB.4. AssignmentB.5. Statements and BlocksB.6. ArraysB.7. HashesB.8. OperatorsB.9. Operator PrecedenceB.10. Basic OperatorsB.10.1. Arithmetic OperatorsB.10.2. Bitwise OperatorsB.10.3. String OperatorsB.10.4. File Test OperatorsB.11. Conditionals and Logical OperatorsB.11.1. true and falseB.11.2. Logical OperatorsB.11.3. Using Logical Operators for Control FlowB.11.4. The if StatementB.12. Binding OperatorsB.13. LoopsB.14. Input/OutputB.14.1. Input from FilesB.14.2. Input from STDINB.14.3. Input from Files Named on the Command LineB.14.4. Output CommandsB.14.4.1. Output to STDOUT, STDERR, and FilesB.15. Regular ExpressionsB.15.1. OverviewB.15.2. MetacharactersB.15.2.1. Escaping with \B.15.2.2. Alternation with |B.15.2.3. Grouping with ( )B.15.2.4. Character classesB.15.2.5. Matching any character with .B.15.2.6. Beginning and end of strings with ^ and $B.15.2.7. Quantifiers: * + {MIN,} {MIN,MAX} ?B.15.2.8. Making quantifiers match minimally with ?B.15.3. Capturing Matched PatternsB.15.4. MetasymbolsB.15.5. Extending Regular-Expression SequencesB.15.6. Pattern ModifiersB.16. Scalar and List ContextB.17. Subroutines and ModulesB.18. Built-in Functions
Index
About the Author
Colophon
Special Upgrade Offer
Copyright

Content preview from Beginning Perl for Bioinformatics

Chapter 1. Biology and Computer Science

One of the most exciting things about being involved in computer programming and biology is that both fields are rich in new techniques and results.

Of course, biology is an old science, but many of the most interesting directions in biological research are based on recent techniques and ideas. The modern science of genetics, which has earned a prominent place in modern biology, is just about 100 years old, dating from the widespread acknowledgement of Mendel’s work. The elucidation of the structure of deoxyribonucleic acid (DNA) and the first protein structure are about 50 years old, and the polymerase chain reaction (PCR) technique of cloning DNA is almost 20 years old. The last decade saw the launching and completion of the Human Genome Project that revealed the totality of human genes and much more. Today, we’re in a golden age of biological research—a point in human history of great medical, scientific, and philosophical importance.

Computer science is relatively new. Algorithms have been around since ancient times (Euclid), and the interest in computing machinery is also antique (Pascal’s mechanical calculator, for instance, or Babbage’s steam-driven inventions of the 19th century). But programming was really born about 50 years ago, at the same time as construction of the first large-scale, programmable, digital, electronic computers (such as ENIAC ). Programming has grown very rapidly to the present day. The Internet is about 20 years ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0596000804Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Beginning Perl for Bioinformatics

by James Tisdall

Chapter 1. Biology and Computer Science

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.