Chapter 6
Rough Fuzzy c-Medoids and Amino Acid Sequence Analysis
6.1 Introduction
Recent advancement and wide use of high throughput technology for biological research are producing enormous size of biological data. Data mining techniques and machine learning methods provide useful tools for analyzing these biological data. The successful analysis of biological sequences relies on the efficient coding of the biological information contained in sequences or subsequences. For example, to recognize functional sites within a biological sequence, the subsequences obtained through moving a fixed length sliding window are generally analyzed. The problem with using most pattern recognition algorithms to analyze these biological subsequences is that they cannot recognize nonnumerical features such as the biochemical codes of amino acids. Investigating a proper encoding process before modeling the amino acids is then critical.
The most commonly used method for coding a subsequence is distributed encoding, which encodes each of the 20 amino acids using a 20-bit binary vector [1]. However, in this method the input space is expanded unnecessarily. Also, this method may not be able to encode biological content in sequences efficiently. On the other hand, different distances for different amino acid pairs have been defined by various mutation matrices and validated [2–4]. But, they cannot be used directly for encoding an amino acid to a unique numerical value.
In this background, Yang and Thomson ...