CHAPTER 22
PROTEIN FUNCTION PREDICTION WITH DATA-MINING TECHNIQUES
22.1 INTRODUCTION
One of the most challenging problems in the postgenomic era is to annotate uncharacterized proteins with biological functions. In past decades, a huge amount of protein sequences were accumulated in public databases. However, the pace at which proteins are annotated is far behind the one at which protein sequences accumulate.
Currently, about 25% of genes remain uncharacterized for the well-studied Saccharomyces cerevisiae, whereas only about 20% of genes are not annotated for Homo sapiens. It would be time consuming and expensive to determine the functions of all proteins in a lab. Computational biology that uses data mining techniques provides an alternative way to predict functions of proteins based on their sequences, structures, gene expression profiles, and so on. For instance, a straightforward way is to apply PSI-blast [1] and FASTA [63] to find homologous proteins and transfer their annotations to the target protein in which the proteins with similar sequences are assumed to carry out similar functions. However, the alignment-based methods may not work well when the sequence similarity between known proteins and the query protein is very low (e.g., below 30%). Under the circumstances, the alignment-free methods provide an alternative solution to this problem by using data-mining techniques [37, 47, 90, 98], in which the alignment-free methods can detect ...