CHAPTER 1

STRING DATA STRUCTURES FOR COMPUTATIONAL MOLECULAR BIOLOGY

1.1 INTRODUCTION

The topic of the chapter is string data structures with applications in the field of computational molecular biology. Let Σ be a finite alphabet consisting of a set of characters (or symbols). The cardinality of the alphabet denoted by |Σ| expresses the number of distinct characters in the alphabet. A string or word is an ordered list of zero or more characters drawn from the alphabet. A word or string w of length n is represented by , where and |w| denotes the length of w. The empty word is the empty sequence (of zero length) and is denoted by ε. A list of characters of w, appearing in consecutive positions, is called a substring of w, denoted by w[i···j], where i and j are the starting and ending positions, respectively. If the substring starts at position 1, then it is called a prefix, whereas if it ends at position n, then it is called a suffix of w. However, an ordered list of characters of w that are not necessarily consecutive is called a subsequence of w.

Strings and subsequences appear in a plethora of computational molecular biology problems because the basic types of DNA, RNA, and protein molecules can be represented as strings—pieces of DNA as ...