CHAPTER 1

STRING DATA STRUCTURES FOR COMPUTATIONAL MOLECULAR BIOLOGY

Christos Makris and Evangelos Theodoridis

1.1 INTRODUCTION

The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let Σ be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by |Σ| expresses the number of distinct characters in the alphabet. A string or word is an ordered list of zero or more characters drawn from the alphabet. A word or string w of length n is represented by , where and |w| denotes the length of w. The empty word is the empty sequence (of zero length) and is denoted by ε. A list of characters of w, appearing in consecutive positions, is called a substring of w, denoted by w[i···j], where i and j are the starting and ending positions, respectively. If the substring starts at position 1, then it is called a prefix, whereas if it ends at position n, then it is called a suffix of w. However, an ordered list of characters of w that are not necessarily consecutive is called a subsequence of w.

Strings and subsequences appear in a plethora of computational molecular biology problems because the basic types of DNA, RNA, and protein molecules can be represented as strings—pieces of DNA as ...

Get Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.