CHAPTER 16
NOVEL COMBINATORIAL AND INFORMATION-THEORETIC ALIGNMENT-FREE DISTANCES BIOLOGICAL DATA MINING
16.1 INTRODUCTION
Sequence comparison plays a central role in many scientific areas [70] and, in particular, for computational molecular biology [41, 81]. In fact, biological molecules can be seen as linear sequences of discrete units similar to linguistic representations, despite their physical nature as a three-dimensional (3D) structure and the dynamic nature of molecular evolution.
The problem of defining good mathematical functions to measure similarity between biological sequences is fundamental for their analysis. Indeed, a high similarity among biological sequences, as measured by mathematical functions, usually gives a strong indication of functional relatedness and/or common ancestry [41]. However, there is not a bijective correspondence between sequence and structure or sequence and function because, in general, the converse of the previous statement is not true.
Classically, the notions of similarity and distance hinge on sequence alignment [41]. Algorithmically, these methods usually are implemented by using dynamic programming and target specific goals such as global and local alignments, with many of their variants (e.g., [40, 41, 60, 72). Because of their worst case superlinear running time in the input parameters, these methods are considered inadequate for the analysis of long ...