book

Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications

by Mourad Elloumi, Albert Y. Zomaya

February 2011

Intermediate to advanced

1080 pages

33h 7m

English

Wiley

Read now

Unlock full access

Cover
Half Title page
Title page
Copyright page
Dedication
Preface
Contributors
Series page
Part I: Strings Processing and Application to Biological Sequences
Chapter 1: String Data Structures for Computational Molecular Biology
1.1 Introduction1.2 Main String Indexing Data Structures1.3 Index Structures for Weighted Strings1.4 Index Structures for Indeterminate Strings1.5 String Data Structures in Memory Hierarchies1.6 ConclusionsReferences

Chapter 2: Efficient Restricted-Case Algorithms for Problems in Computational Biology
2.1 The Need for Special Cases2.2 Assessing Efficient Solvability Options for General Problems and Special Cases2.3 String and Sequence Problems2.4 Shortest Common Superstring2.5 Longest Common Subsequence2.6 Common Approximate Substring2.7 ConclusionReferences
Chapter 3: Finite Automata in Pattern Matching
3.1 Introduction3.2 Direct Use of DFA in Stringology3.3 NFA Simulation3.4 Finite Automaton as Model of Computation3.5 Finite Automata Composition3.6 SummaryReferences
Chapter 4: New Developments in Processing of Degenerate Sequences
4.1 Introduction4.2 Background4.3 Basic Definitions4.4 Repetitive Structures in Degenerate Strings4.5 Conservative String Covering in Degenerate Strings4.6 ConclusionReferences
Chapter 5: Exact Search Algorithms for Biological Sequences
5.1 Introduction5.2 Single Pattern Matching Algorithms5.3 Algorithms for Multiple Patterns5.4 Application of Exact Set Pattern Matching for Read Mapping and Comparison with Mapping Tools5.5 ConclusionsReferences
Chapter 6: Algorithmic Aspects of Arc-Annotated Sequences
6.1 Introduction6.2 Preliminaries6.3 Longest Arc-Preserving Common Subsequence6.4 Arc-Preserving Subsequence6.5 Maximum Arc-Preserving Common Subsequence6.6 Edit DistanceReferences
Chapter 7: Algorithmic Issues in DNA Barcoding Problems
7.1 Introduction7.2 Test Set Problems: A General Framework for Several Barcoding Problems7.3 A Synopsis of Biological Applications of Barcoding7.4 Survey of Algorithmic Techniques on Barcoding7.5 Information Content Approach7.6 Set-Covering Approach7.7 Experimental Results and Software Availability7.8 Concluding RemarksAcknowledgmentsReferences
Chapter 8: Recent Advances in Weighted DNA Sequences
8.1 Introduction8.2 Preliminaries8.3 Indexing8.4 Pattern Matching8.5 Approximate Pattern Matching8.6 Repetitions, Covers, and Tandem Repeats8.7 Motif Discovery8.8 ConclusionsReferences
Chapter 9: DNA Computing for Subgraph Isomorphism Problem and Related Problems
9.1 Introduction9.2 Definitions of Subgraph Isomorphism Problem and Related Problems9.3 DNA Computing Models9.4 The Sticker-Based Solution Space9.5 Algorithms for Solving Problems9.6 Experimental Data9.7 ConclusionReferences
Part II: Analysis of Biological Sequences
Chapter 10: Graphs in Bioinformatics
10.1 Graph Theory—Origin10.2 Graphs and the Biological World10.3 ConclusionReferences
Chapter 11: A Flexible Data Store for Managing Bioinformatics Data
11.1 Introduction11.2 Data Model and System Overview11.3 Replication and Load Balancing11.4 Evaluation11.5 Related Work11.6 SummaryReferences
Chapter 12: Algorithms for the Alignment of Biological Sequences
12.1 Introduction12.2 Alignment Algorithms12.3 Score Functions12.4 Benchmarks12.5 ConclusionAcknowledgmentsReferences
Chapter 13: Algorithms for Local Structural Alignment and Structural Motif Identification
13.1 Introduction13.2 Problem Definition of Local Structural Alignment13.3 Variable-Length Alignment Fragment Pair (VLAFP) Algorithm13.4 Structural Alignment Based on Center of Gravity: SACG13.5 Searching Structural Motifs13.6 Using SACG Algorithm for Classification of New Protein Structures13.7 Experimental Results13.8 Accuracy Results13.9 ConclusionAcknowledgmentsReferences
Chapter 14: Evolution of the Clustal Family of Multiple Sequence Alignment Programs
14.1 Introduction14.2 Clustal-ClustalV14.3 ClustalW14.4 ClustalX14.5 ClustalW and ClustalX 2.014.6 DbClustal14.7 PerspectivesReferences
Chapter 15: Filters and Seeds Approaches for Fast Homology Searches in Large Datasets
15.1 Introduction15.2 Methods Framework15.3 Lossless Filters15.4 Lossy Seed-Based Filters15.5 Conclusion15.6 AcknowledgmentsReferences
Chapter 16: Novel Combinatorial and Information-Theoretic Alignment-Free Distances Biological Data Mining
16.1 Introduction16.2 Information-Theoretic Alignment-Free Methods16.3 Combinatorial Alignment-Free Methods16.4 Alignment-Free Compositional Methods16.5 Alignment-Free Exact Word Matches Methods16.6 Domains of Biological Application16.7 Datasets and Software for Experimental Algorithmics16.8 ConclusionsReferences
Chapter 17: In Silico Methods for the Analysis of Metabolites and Drug Molecules
17.1 Introduction17.2 Molecular Descriptors17.3 Databases17.4 Methods and Data Analysis Algorithms17.5 ConclusionsAcknowledgmentsReferences
Part III: Motif Finding and Structure Prediction
Chapter 18: Motif Finding Algorithms in Biological Sequences
18.1 Introduction18.2 Preliminaries18.3 The Planted (l, d)-Motif Problem18.4 The Extended (l, d)-Motif Problem18.5 The Edited Motif Problem18.6 The Simple Motif Problem18.7 ConclusionReferences
Chapter 19: Computational Characterization of Regulatory Regions
19.1 The Genome Regulatory Landscape19.2 Qualitative Models of Regulatory Signals19.3 Quantitative Models of Regulatory Signals19.4 Detection of Dependencies in Sequences19.5 Repositories of Regulatory Information19.6 Using Predictive Models to Annotate Sequences19.7 Comparative Genomics Characterization19.8 Sequence Comparisons19.9 Combining Motifs and Alignments19.10 Experimental Validation19.11 SummaryReferences
Chapter 20: Algorithmic Issues in the Analysis of Chip-SEQ Data
20.1 Introduction20.2 Mapping Sequences on the Genome20.3 Identifying Significantly Enriched Regions20.4 Deriving Actual Transcription Factor Binding Sites20.5 ConclusionsReferences
Chapter 21: Approaches and Methods for Operon Prediction Based on Machine Learning Techniques
21.1 Introduction21.2 Datasets, Features, and Preprocesses for Operon Prediction21.3 Machine Learning Prediction Methods for Operon Prediction21.4 Conclusions21.5 AcknowledgmentsReferences
Chapter 22: Protein Function Prediction with Data-Mining Techniques
22.1 Introduction22.2 Protein Annotation Based on Sequence22.3 Protein Annotation Based on Protein Structure22.4 Protein Function Prediction Based on Gene Expression Data22.5 Protein Function Prediction Based on Protein Interactome Map22.6 Protein Function Prediction Based on Data Integration22.7 Conclusions and PerspectivesReferences
Chapter 23: Protein Domain Boundary Prediction
23.1 Introduction23.2 Profiling Technique23.3 Results23.4 Discussion23.5 ConclusionsReferences
Chapter 24: An Introduction to RNA Structure and Pseudoknot Prediction
24.1 Introduction24.2 RNA Secondary Structure Prediction24.3 RNA Pseudoknots24.4 ConclusionsReferences
Part IV: Phylogeny Reconstruction
Chapter 25: Phylogenetic Search Algorithms for Maximum Likelihood
25.1 Introduction25.2 Computing the Likelihood25.3 Accelerating the PLF by Algorithmic Means25.4 Alignment Shapes25.5 General Search Heuristics25.6 Computing the Robinson Foulds Distance25.7 Convergence Criteria25.8 Future DirectionsReferences
Chapter 26: Heuristic Methods for Phylogenetic Reconstruction with Maximum Parsimony
26.1 Introduction26.2 Definitions and Formal Background26.3 Methods26.4 ConclusionReferences
Chapter 27: Maximum Entropy Method for Composition Vector Method
27.1 Introduction27.2 Models and Entropy Optimization27.3 Application and Dicussion27.4 Concluding RemarksReferences
Part V: Microarray Data Analysis
Chapter 28: Microarray Gene Expression Data Analysis
28.1 Introduction28.2 DNA Microarray Technology and Experiment28.3 Image Analysis and Expression Data Extraction28.4 Data Processing28.5 Missing Value Imputation28.6 Temporal Gene Expression Profile Analysis28.7 Cyclic Gene Expression Profiles Detection28.8 SummaryAcknowledgmentsReferences
Chapter 29: Biclustering of Microarray Data
29.1 Introduction29.2 Types of Biclusters29.3 Groups of Biclusters29.4 Evaluation Functions29.5 Systematic and Stochastic Biclustering Algorithms29.6 Biological Validation29.7 ConclusionReferences
Chapter 30: Computational Models for Condition-Specific Gene and Pathway Inference
30.1 Introduction30.2 Condition-Specific Pathway Identification30.3 Disease Gene Prioritization and Genetic Pathway Detection30.4 Module Networks30.5 SummaryAcknowledgementsReferences
Chapter 31: Heterogeneity of Differential Expression in Cancer Studies: Algorithms and Methods
31.1 Introduction31.2 Notations31.3 Differential Mean of Expression31.4 Differential Variability of Expression31.5 Differential Expression in Compendium of Tumors31.6 Differential Expression by Chromosomal Aberrations: The Local Properties31.7 Differential Expression in Gene Interactome31.8 Differential Coexpression: Global Multidimensional InteractomeAcknowledgmentsReferences
Part VI: Analysis of Genomes
Chapter 32: Comparative Genomics: Algorithms and Applications
32.1 Introduction32.2 Notations32.3 Ortholog Assignment32.4 Gene Cluster and Synteny Detection32.5 ConclusionsReferences
Chapter 33: Advances in Genome Rearrangement Algorithms
33.1 Introduction33.2 Preliminaries33.3 Sorting by Reversals33.4 Sorting by Transpositions33.5 Other Operations33.6 Sorting by More Than One Operation33.7 Future Research Directions33.8 Notes on SoftwareReferences
Chapter 34: Computing Genomic Distances : An Algorithmic Viewpoint
34.1 Introduction34.2 Interval-Based Criteria34.3 Character-Based Criteria34.4 ConclusionReferences
Chapter 35: Wavelet Algorithms for DNA Analysis
35.1 Introduction35.2 DNA Representation35.3 Statistical Correlations in DNA35.4 Wavelet Analysis35.5 Haar Wavelet Coefficients and Statistical Parameters35.6 Algorithm of the Short Haar Discrete Wavelet Transform35.7 Clusters of Wavelet Coefficients35.8 ConclusionReferences
Chapter 36: Haplotype Inference Models and Algorithms
36.1 Introduction36.2 Problem Statement and Notations36.3 Combinatorial Methods36.4 Statistical Methods36.5 Pedigree Methods36.6 Evaluation36.7 DiscussionReferences
Part VII: Analysis of Biological Networks
Chapter 37: Untangling Biological Networks Using Bioinformatics
37.1 Introduction37.2 Types of Biological Networks37.3 Network Dynamic, Evolution and Disease37.4 Future Challenges and ScopeAcknowledgmentsReferences
Chapter 38: Probabilistic Approaches for Investigating Biological Networks
38.1 Probabilistic Models for Biological Networks38.2 Interpretation and Quantitative Analysis of Probabilistic Models38.3 ConclusionAcknowledgmentsReferences
Chapter 39: Modeling and Analysis of Biological Networks with Model Checking
39.1 Introduction39.2 Preliminaries39.3 Analyzing Genetic Networks with Model Checking39.4 Probabilistic Model Checking for Biological SystemsReferencesAppendix
Chapter 40: Reverse Engineering of Molecular Networks from a Common Combinatorial Approach
40.1 Introduction40.2 Reverse-Engineering of Biological Networks40.3 Classical Combinatorial Algorithms: A Case Study40.4 Concluding RemarksAcknowledgmentsReferences
Chapter 41: Unsupervised Learning for Gene Regulation Network Inference from Expression Data: A Review
41.1 Introduction41.2 Gene Networks: Definition and Properties41.3 Gene Expression: Data and Analysis41.4 Network Inference as an Unsupervised Learning Problem41.5 Correlation-Based Methods41.6 Probabilistic Graphical Models41.7 Constraint-Based Data Mining41.8 Validation41.9 Conclusion and PerspectivesReferences
Chapter 42: Approaches to Construction and Analysis of Microrna-Mediated Networks
42.1 Introduction42.2 Fundamental Component Interaction Research: Predicting Mirna Genes, Regulators, and Targets42.3 Identifying Mirna-Mediated Networks42.4 Global and Local Architecture Analysis in Mirna-Containing Networks42.5 ConclusionReferences
Index

Content preview from Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications

CHAPTER 35

WAVELET ALGORITHMS FOR DNA ANALYSIS

Carlo Cattani

35.1 INTRODUCTION

One of the main tasks of the genome project is to understand completely the underlying biological function from a possible interpretation of the given sequence of nucleotides that is from the distribution of the four symbols A, C, G, T along the sequence [21, 24, 25]. The main hypotheses of this project are as follows:

1. The activity (functional) of the organism is a result of the distribution of nucleotides.

2. The distribution of nucleotides should follow some hidden rules.

3. It should be possible to discover these rules by singling out some regular features like periodicity, typical patterns, trends, sequence evolution, and so on.

In recent years, the analysis of DNA sequences has been focused mainly on the existence of hidden law, periodicities, and autocorrelation [14, 17, 24, 34]. The main task is to find (if any) some kind of mathematical rules or meaningful statistics in the nucleotides distribution. This would help us to characterize each DNA sequence to construct a possible classification. From a mathematical point a view, the DNA sequence is a symbolic sequence (of nucleotides) with some empty spaces (no coding regions). To get some numerical information from this sequence, it must be transformed into a digital sequence. It follows that the symbolic sequence is transformed into a very large time series (from one half of a million digits for the primitive organisms such as fungus, eukaryotes, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Fundamental Concepts and Computations in Chemical Engineering

Publisher Resources

ISBN: 9781118101988Purchase book

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications

by Mourad Elloumi, Albert Y. Zomaya

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.