| General information | |||||||||||||||||||||||
| Accession Number | PCB00029 | ||||||||||||||||||||||
| Record Name | CATH95_Topology_5fold_Filtered; | ||||||||||||||||||||||
| Created | 12-DEC-2006 | ||||||||||||||||||||||
| Updated | 12-DEC-2006 | ||||||||||||||||||||||
| Description | Classification of protein domain sequences and structures into topology (T) groups, based on 5-fold crossvalidation (CATH95 v. 3.0.0) | ||||||||||||||||||||||
| Data | |||||||||||||||||||||||
| Data Description | Protein sequences and structures from CATH (> 95% sequence identity) | ||||||||||||||||||||||
| Download | click here for the fasta file containing the sequences CATH95.fasta | ||||||||||||||||||||||
| Download | click here for the ziped file containing the structures CATH95.pdb.tar.gz | ||||||||||||||||||||||
| Subdivision into training and test groups | |||||||||||||||||||||||
| Subdivision Description | 47 topology (T) groups were subdivided into 5 classification tasks to give a total of 235 classification tasks. | ||||||||||||||||||||||
| Positive Set | Topology (T) groups, randomly subdivided into 5 equal groups | ||||||||||||||||||||||
| Negative Set | The rest of the database outside the topology (T) group randomly subwas first divided into 5 equal groups. Then 10% of the resulting two sets where randomly selected to give the final -train and -test sets. | ||||||||||||||||||||||
| Statistics | Number of tasks | 235 | |||||||||||||||||||||
| Min | Max | Average | |||||||||||||||||||||
| Positive Train | 13 | 1045 | 529 | ||||||||||||||||||||
| Positive Test | 3 | 262 | 133 | ||||||||||||||||||||
| Negative Train | 804 | 908 | 856 | ||||||||||||||||||||
| Negative test | 200 | 226 | 213 | ||||||||||||||||||||
| Full statistics | click here to download the full statistics file CATH95_T_H_kfold_10_0.1_filt_29.stats or click view to view the file in a WEB layout | ||||||||||||||||||||||
| Cast Matrix | click here to download the cast matrix CATH95_T_H_kfold_10_0.1_filt_29.cast | ||||||||||||||||||||||
| Distance Matrix | |||||||||||||||||||||||
| Blast | download matrix file CATH95_BLAST.dmx | ||||||||||||||||||||||
| Smith-Waterman | download matrix file CATH95_SW.dmx | ||||||||||||||||||||||
| Needleman-Wunsch | download matrix file CATH95_NW.dmx | ||||||||||||||||||||||
| Local Alignment Kernel | download matrix file CATH95_LA.dmx | ||||||||||||||||||||||
| Pride structure similarity | download matrix file CATH95_PRIDE.dmx | ||||||||||||||||||||||
| PSI-BLAST | download matrix file CATH95_PsiBLAST.dmx | ||||||||||||||||||||||
| Results | |||||||||||||||||||||||
| Summary |
Average AUC values for the 235 classification tasks in this record (benchmark test) |
||||||||||||||||||||||
| Detailed view | |||||||||||||||||||||||
| Methods Used | |||||||||||||||||||||||
| [1] CATH Sequences | The sequences were taken from the CATH database v.3.0.0 (Pearl, et al., 2005). The entries of the CATH95 (>95% identity)selection were downloaded from the ftp://ftp.biochem.ucl.ac.uk/pub/cathdata/v3.0.0/. site. The 1648 non-contiguous domains were discarded and 11373 were retained. The sequences were stored in concatenated FASTA format. Pearl, F., Todd, A., Sillitoe, I., Dibley, M., Redfern, O., Lewis, T., Bennett, C., Marsden, R., Grant, A., Lee, D., Akpor, A., Maibaum, M., Harrison, A., Dallman, T., Reeves, G., Diboun, I., Addou, S., Lise, S., Johnston, C., Sillero, A., Thornton, J. and Orengo, C. (2005) The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis, Nucleic Acids Res, 33, D247-251. |
||||||||||||||||||||||
| [2] CATH Structures | The 3D structures were taken from the CATH database v. 3.0.0(Pearl, et al., 2005). The entries of the CATH95 (>95% identity)selection were downloaded from the http://cathwww.biochem.ucl.ac.uk/staticdata/v3_0_0/dompdb/ site. The 1648 non-contiguous domains were discarded and 11 373 were retained. The structures were deposited as compressed archive. Pearl, F., Todd, A., Sillitoe, I., Dibley, M., Redfern, O., Lewis, T., Bennett, C., Marsden, R., Grant, A., Lee, D., Akpor, A., Maibaum, M., Harrison, A., Dallman, T., Reeves, G., Diboun, I., Addou, S., Lise, S., Johnston, C., Sillero, A., Thornton, J. and Orengo, C. (2005) The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis, Nucleic Acids Res, 33, D247-251. |
||||||||||||||||||||||
| [3] BLAST distance matrix. | An all against all comparison was carried out using BLAST (Altschul, et al., 1990) version 2.2.13 downloaded from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml The BLOSUM62 matrix was used with a gap opening penalty of 11 and a gap extension penalty of 1 (default parameters). The results were stored in a compressed, tab-delimited ASCII file. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, J Mol Biol, 215, 403-410. |
||||||||||||||||||||||
| [4] Smith-Waterman | An all against all comparison was carried out using the Smith-Waterman algorithm (Smith and Waterman, 1981) as implemented in the water program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default parameters). The results were stored in a compressed, tab-delimited ASCII file. Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences, J. Mol. Biol., 147, 195-197. Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277. |
||||||||||||||||||||||
| [5] Needleman-Wunsch | An all against all comparision was carried out using the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) as implemented in the needle program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default). The results were stored in a compressed, tab-delimited ASCII file. Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 48, 443-453. Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277. |
||||||||||||||||||||||
| [6] Local Alignment kernel | The Local Alignment Kernel program version 0.3 of Saigo and associates (Saigo, et al., 2004) was downloaded from http://cg.ensmp.fr/~vert/. The following run parameters were used: Default comparison matrix found in the parameters.h file. Gap opening penalty = 11 (default), Gap extension penalty = 1 (default), Scaling parameter = 0.5. Saigo, H., Vert, J.P., Ueda, N. and Akutsu, T. (2004) Protein homology detection using string alignment kernels, Bioinformatics, 20, 1682-1689. |
||||||||||||||||||||||
| [7] PRIDE | An all against all comparison was carried out using the PRIDE algorithm (Gaspari, et al., 2005). The program was provided by Z. Gaspari. Gaspari, Z., Vlahovicek, K. and Pongor, S. (2005) Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm, Bioinformatics, 21, 3322-3323. |
||||||||||||||||||||||
| [8] Nearest negihbour classification | Nearest neighbour (1NN) classification is a technique whereby a query sequence is assigned to the a priori known class of the database entry that was found most similar to it in terms of a distance/similarity measure (for an introduction see Duda, et al., 2001). Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York. |
||||||||||||||||||||||
| [9] Performance Evaluation | The evaluation of classification performance was carried out by the standard receiver operator characteristic (ROC) analysis (for an introduction see (Duda, et al., 2000)). This method is designed to test the ranking ability of a given classifier based on a real-valued ranking parameter. In the case of nearest neighbour classification, the ranking parameter was a similarity/distance parameter calculated between an object and the nearest member of the positive training set (outlier detection). Briefly, the analysis is carried out by plotting sensitivity vs 1-specificity at various threshold values, then the resulting curve is integrated to give an “area under curve” or AUC value. These values are determined for each classification experiment. For a perfect ranking, AUC=1.0, for random ranking AUC=0.5 (Egan, 1975). Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York. Egan, J.P. (1975) Signal Detection theory and ROC Analysis. New York. |
||||||||||||||||||||||