18
THE CATH DOMAIN STRUCTURE DATABASE
INTRODUCTION
Protein sequences change during evolution due to both mutations in their residues and the insertion and deletion of residues. These changes give rise to families of related proteins. The earliest protein family resources were first established in the 1970s by the pioneering work of Dayhoff and many other sequence databases have been established since then. These resources are derived solely from sequence data and relationships are often detected using alignment methods based on powerful dynamic programming algorithms adapted from the realm of computer science. Such methods very efficiently handle residue insertions and deletions occurring between distant evolutionary relatives.
Structural data have always been sparser than the sequence data due to the technical challenges of structure determination. There is currently over two orders of magnitude discrepancy between the sequence and structure resources. Thus, while the Protein Data Bank (PDB) contains about 42,500 structural entries, the sequence databank at the NCBI (GenBank) contains over 60 million entries.
Although the first crystal structures were solved in the early 1970s, it was not until the mid-1990s that structural classifications began to emerge, primarily with SCOP (Murzin et al., 1995; Andreeva et al., 2004), DALI (Holm and Sander, 1996), and CATH (Orengo et al., 1997; Greene et al., 2007) databases and data ...