book

Python for Bioinformatics

Name: Python for Bioinformatics
Author: Jason Kinser
ISBN: 9781449613075

by Jason Kinser

June 2008

Beginner to intermediate

417 pages

10h 41m

English

Jones & Bartlett Learning

Read now

Unlock full access

Cover
Title
Copyright
Dedication
Preface
Brief Contents
Contents
Chapter 1 Introduction
1.1 The Purpose of This Book1.2 Use of Third-Party Software1.3 Required Background of Readers1.4 Object-Oriented Programming1.5 Presentation Convention1.6 Conversion from C/C++ to Python1.6.1 Similarities1.6.2 Fundamental Python Commands that Differ from C/C++1.7 The Environment1.8 BiopythonBibliography
Chapter 2 NumPy and SciPy
2.1 Introduction to NumPy and SciPy2.2 Basic Array Manipulations2.3 Basic Math2.4 More on Multiplication2.5 More Math2.5.1 Equals or Copy2.5.2 Comparisons2.5.3 More on Slicing2.5.4 Sorting and Shaping2.5.5 Random Numbers2.5.6 Statistical Methods2.6 Thinking About Problems2.7 Array Conversions2.8 SciPy2.9 SummaryBibliographyProblems
Chapter 3 Image Manipulation
3.1 The Image Module3.2 Colors and Conversions3.3 Digital Image Formats3.4 Simple Image Manipulations3.5 Conversions to and from Arrays3.6 SummaryBibliographyProblems

Chapter 4 The Akando and Dancer Modules
4.1 The Akando Module4.1.1 Plotting Routines4.1.2 Algebraic and Geometric Functions4.1.3 Correlation4.1.4 Image Conversions4.2 The Dancer Module4.3 SummaryProblems
Chapter 5 Statistics
5.1 Simple Statistics5.2 Distributions5.3 Normalization5.4 Multivariate Statistics5.5 Probabilities5.6 Odds5.7 Decisions from Distributions5.8 SummaryProblems
Chapter 6 Parsing DNA Data Files
6.1 FASTA Files6.2 Genbank Files6.2.1 File Overview6.2.2 Parsing the DNA6.2.3 Gene and Protein Information6.2.4 Gene Locations6.2.5 Normal and Complement6.2.6 Splices6.2.7 Extracting All Gene Locations6.2.8 Coding DNA6.2.9 Proteins6.2.10 Extracting Translations6.3 ASN.1 File Format6.4 SummaryBibliographyProblems
Chapter 7 Sequence Alignment
7.1 Alphabets7.2 Matching Sequences7.2.1 Perfect Matches7.2.2 Insertions and Deletions7.2.3 Rearrangements7.2.4 Global Versus Local Alignments7.2.5 Sequence Length7.3 Simple Alignments7.3.1 Direct Alignment7.3.2 Statistical Alignment7.3.3 Brute Force Alignment7.4 SummaryBibliographyProblems
Chapter 8 Dynamic Programming
8.1 The Problem with the Brute Force Approach8.2 The Dynamic Programming Algorithm8.2.1 The Scoring Matrix8.2.2 The Arrow Matrix8.2.3 Extracting the Aligned Sequences8.3 Efficient Programming8.3.1 Flowing along the Diagonals8.3.2 Slicing Matrices8.3.3 Extracting Diagonal Element Locations8.3.4 Extracting Values from the Substitution Matrix8.3.5 Computing the Scoring Matrix Values for a Single Diagonal8.3.6 An Efficient Computation of the Scoring Matrix8.4 Global Versus Local Alignments8.5 Gap Penalties8.6 Does Dynamic Programming Find the Best Alignments?8.7 SummaryProblems
Chapter 9 Tandem Repeats
9.1 Tandem Repeats9.2 Hauth’s Solution9.2.1 Foundation9.2.2 Multiple Words9.2.3 Tandem Repeats9.3 SummaryBibliographyProblems
Chapter 10 Hidden Markov Models
10.1 The Emission HMM10.2 The Transition HMM10.3 The Recurrent HMM10.4 Constructing a Transition HMM10.5 Considerations10.5.1 Assuming Data10.5.2 Spurious Strings10.5.3 Recurrent Probabilities10.6 SummaryProblems
Chapter 11 Genetic Algorithms
11.1 Simulated Annealing11.2 The Genetic Algorithm11.2.1 Energy Surfaces11.2.2 The Genetic Algorithm Approach11.2.3 Checking the Solution11.3 Nonnumerical Genetic Algorithms11.3.1 Notes on Copying11.3.2 Creating Random Arrangements11.3.3 The Genetic Algorithm11.4 SummaryProblems
Chapter 12 Multiple Sequence Alignment
12.1 The Greedy Approach12.1.1 Sequence Comparison12.1.2 Assembly12.2 Nongreedy Approach12.2.1 Creating Genes12.2.2 Steps in the Genetic Algorithm12.2.3 The Test Run12.2.4 Improvements12.3 SummaryProblems
Chapter 13 Gapped Alignments
13.1 Theory of Gapped Alignments13.2 Chopping the Data13.3 Pairwise Alignments13.4 Building the Assembly13.4.1 Creating New Contigs13.4.2 Adding to a Contig13.4.3 Joining Contigs13.4.4 Performing the Assembly13.5 SummaryBibliographyProblems
Chapter 14 Trees
14.1 Basic Tree Theory14.2 Python and Trees14.3 An Example Using UPGMA14.4 Examples of Trees14.4.1 Sorting Trees14.4.2 Dictionary Trees14.4.3 Percolation Trees14.4.4 Suffix Trees14.5 Decision Trees and Random Forests14.6 SummaryProblems
Chapter 15 Text Mining
15.1 An Introduction to Text Mining15.2 Collecting Bioinformatic Textual Data15.3 Creating Dictionaries15.4 Methods of Finding Root Words15.4.1 Porter Stemming15.4.2 Suffix Trees15.4.3 Combining Simplified Porter Stemming with Slicing15.5 Document Analysis15.5.1 Text Mining Ten Documents15.5.2 Word Frequency15.5.3 Indicative Words15.5.4 Document Classification15.6 SummaryBibliographyProblems
Chapter 16 Measuring Complexity
16.1 Linguistic Complexity16.2 Suffix Trees16.3 Superstrings16.4 SummaryBibliographyProblems
Chapter 17 Clustering
17.1 The Purpose of Clustering17.2 k-Means Clustering17.3 Solving More Difficult Problems17.3.1 Preprocessing Data17.3.2 Modifications of k-Means17.4 Dynamic k-Means17.5 Comments on k-Means17.6 SummaryBibliographyProblems
Chapter 18 Self-Organizing Maps
18.1 SOM Theory18.2 An SOM Example18.2.1 Reading an Image18.2.2 Initializing the SOM18.2.3 The Best Matching Unit (BMU)18.2.4 Updating the SOM18.2.5 SOM Iterations18.2.6 Interpreting the SOM18.3 SummaryBibliographyProblems
Chapter 19 Principal Component Analysis
19.1 The Purpose of PCA19.2 Eigenvectors19.3 The PCA Process19.3.1 Case 1: More Dimensions than Vectors19.3.2 Case 2: Linear Combinations in the Data19.3.3 Case 3: Imperfect Dimensionality Reductions19.3.4 Coordinate Selection19.4 Using SVD to Compute PCA19.5 Describing Systems with Eigenvectors19.6 Eigenimages19.7 SummaryBibliographyProblems
Chapter 20 Species Identification
20.1 Data Collection20.2 The First Clustering20.3 Using Principal Component Analysis20.4 The Second Clustering20.5 Using a Self-Organizing Map20.6 SummaryBibliographyProblems
Chapter 21 Fourier Transforms
21.1 Fourier Theory21.2 Digital Fourier Transform21.2.1 DFT Theory21.2.2 Example with a Simple Sawtooth Signal21.2.3 Features of the DFT21.2.4 Power Spectrum21.3 Fast Fourier Transform21.3.1 Duplicate Computations21.3.2 The FFT Method21.3.3 FFTs in SciPy21.3.4 The Swap Function21.4 Frequency Analysis21.4.1 Simple Signals21.4.2 DNA Coding Regions21.5 SummaryBibliographyProblems
Chapter 22 Correlations
22.1 Correlation Theory22.2 Random Signal Correlation22.3 Structured Signal Correlation22.4 Correlation of DNA Strings22.5 Higher Dimensions22.5.1 Two-Dimensional FFTs in SciPy22.5.2 Image Frequencies22.6 The Onset of Image Processing22.7 Two-Dimensional CorrelationsSummaryBibliographyProblems
Chapter 23 Numerical Sequence Alignment
23.1 Alternate Encodings23.1.1 Hydrophobicity23.1.2 GC Content23.1.3 Numerical Methods23.2 Numerical Alignments23.3 Measuring the Hurst Exponent23.4 Chaos Representation23.4.1 Representing the Data23.4.2 A Simpler Method23.4.3 Comparing Chaos Images of Different Species23.4.4 Organizing the Data23.5 SummaryBibliographyProblems
Chapter 24 Gene Expression Array Files
24.1 Raw Data24.1.1 Reading Raw Data in Python24.1.2 Dealing with 16-Bit Data24.2 GEL Files24.2.1 TIFF Headers24.2.2 The Image File Directory24.2.3 Reading the Data24.3 SummaryBibliographyProblems
Chapter 25 Spot Finding and Measurement
25.1 Spot Finding25.1.1 Intensity Variations25.1.2 Block Location25.1.3 The Coarse Grid25.1.4 Fine-Tuning the Spot Locations25.2 Spot Measurements25.3 SummaryBibliographyProblems
Chapter 26 Spreadsheet Arrays and Data Displays
26.1 Reading Spreadsheets26.1.1 The Platform File26.1.2 The Z-Ratio File26.1.3 Reading Two Channel Files26.2 Displaying the Data26.2.1 The Heat Map26.2.2 The R Versus G Graph26.2.3 The R/G Versus I Graph26.2.4 M Versus A Graph26.3 SummaryBibliographyProblems
Chapter 27 Applications with Expression Arrays
27.1 LOESS Normalization27.2 Expressed Genes27.3 Multiple Slides27.3.1 Normalization27.3.2 Extracting Outliers27.4 SummaryBibliographyProblems
Index

Content preview from Python for Bioinformatics

6 Parsing DNA Data Files

Large databases of DNA information are being collected by several institutes. In the United States, a large repository is Genbank, which is under the sponsorship of the National Institutes of Health (http://www.ncbi.nlm.nih.gov/Genbank/index.html). The concern of this chapter is to develop programs capable of reading the files that are stored in three of the most popular formats: FASTA, Genbank, and ASN.1.

6.1 FASTA Files

The FASTA format is extremely simple, but it contains very little information aside from the sequence. A typical FASTA format is shown in Figure 6-1.

The first line contains a small header that may vary in content. In this case, the accession number and name of species and chromosome number are given. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780763751869

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python for Bioinformatics

by Jason Kinser

6 Parsing DNA Data Files

6.1 FASTA Files

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.