small logo

Protein Classification Benchmark Collection

General information
Accession Number PCB00024
Record Name SCOP95_Class_Fold_Filtered;
Created 12-DEC-2006
Updated 12-DEC-2006
Description Classification of protein domain sequences and structures into structural classes, based on folds (SCOP95 v.169)
Data
Data Description Protein sequences and structures from SCOP (< 95% sequence identity)
Download click here for the fasta file containing the sequences SCOP95.fasta
Download click here for the ziped file containing the structures SCOP95.pdb.tar.gz
Subdivision into training and test groups
Subdivision Description Only folds with at least 5 members and at least 10 members outside the fold but within the same structural class were included as positive test. This selection resulted in 390 classification tasks. 
Positive Set Structural classes, subdivided into folds 
Negative Set The rest of the database outside the structural class was first divided in such a way that members of a family can be either -test or -train. Then 30% of the resulting two sets where randomly selected to give the final -train and -test sets. 
Statistics Number of tasks 377
  Min Max Average  
Positive Train 167 3072 1620  
Positive Test 5 1018 512  
Negative Train 1335 1727 1531  
Negative test 1323 1793 1558  
Full statistics click here to download the full statistics file SCOP95_class_fold_5_0.3_filt_new_24.stats or click view to view the file in a WEB layout
Cast Matrix click here to download the cast matrix SCOP95_class_fold_5_0.3_filt_new_24.cast
Distance Matrix
Blast download matrix file SCOP95_BLAST.dmx
Smith-Waterman download matrix file SCOP95_SW.dmx
Needleman-Wunsch download matrix file SCOP95_NW.dmx
Local Alignment Kernel download matrix file SCOP95_LA.dmx
Pride structure similarity download matrix file SCOP95_PRIDE.dmx
Results
Summary
Method\Comparison BLASTSWNW LA PRIDE
1nn0.60160.61030.62920.56920.7833
RF-0.7632---
SVM-0.6959--0.9278

Average AUC values for the 377 classification tasks in this record (benchmark test)
Detailed view

Select the methods using multiple select (Ctrl +Mouse)
Select the dinstance measures
Group by Method
Distance Measure
view in a web layout
donwload the result file


Methods Used
[1] SCOP Sequences

The sequences were taken from the SCOP database 1.69 (Andreeva, et al., 2004). The entries of the SCOP95 (<95% identity) were downloaded from the ASTRAL site http://astral.berkeley.edu/site. The 121 non-contiguous domains were discarded and 11944 entries were retained. The sequences were stored in concatenated FASTA format.


Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data, Nucleic Acids Res, 32, D226-229.


[2] SCOP Structures

The 3D structures were taken from the SCOP database 1.69 (Andreeva, et al., 2004). The entries of the SCOP95 (<95% identity) were downloaded from the ASTRAL site http://astral.berkeley.edu/pdbstyle-1.69.html site. The 121 non-contiguous domains were discarded and 11944 entries were retained. The structures were deposited as compressed archive.


Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data, Nucleic Acids Res, 32, D226-229.


[3] BLAST distance matrix.

An all against all comparison was carried out using BLAST (Altschul, et al., 1990) version 2.2.13 downloaded from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml The BLOSUM62 matrix was used with a gap opening penalty of 11 and a gap extension penalty of 1 (default parameters). The results were stored in a compressed, tab-delimited ASCII file.


Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, J Mol Biol, 215, 403-410.


[4] Smith-Waterman

An all against all comparison was carried out using the Smith-Waterman algorithm (Smith and Waterman, 1981) as implemented in the water program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default parameters). The results were stored in a compressed, tab-delimited ASCII file.


Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences, J. Mol. Biol., 147, 195-197.

Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277.


[5] Needleman-Wunsch

An all against all comparision was carried out using the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) as implemented in the needle program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default). The results were stored in a compressed, tab-delimited ASCII file.


Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 48, 443-453.

Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277.


[6] Local Alignment kernel

The Local Alignment Kernel program version 0.3 of Saigo and associates (Saigo, et al., 2004) was downloaded from http://cg.ensmp.fr/~vert/. The following run parameters were used: Default comparison matrix found in the parameters.h file. Gap opening penalty = 11 (default), Gap extension penalty = 1 (default), Scaling parameter = 0.5.


Saigo, H., Vert, J.P., Ueda, N. and Akutsu, T. (2004) Protein homology detection using string alignment kernels, Bioinformatics, 20, 1682-1689.


[7] PRIDE

An all against all comparison was carried out using the PRIDE algorithm (Gaspari, et al., 2005). The program was provided by Z. Gaspari.


Gaspari, Z., Vlahovicek, K. and Pongor, S. (2005) Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm, Bioinformatics, 21, 3322-3323.


[8] Nearest negihbour classification

Nearest neighbour (1NN) classification is a technique whereby a query sequence is assigned to the a priori known class of the database entry that was found most similar to it in terms of a distance/similarity measure (for an introduction see Duda, et al., 2001).


Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York.


[9] Performance Evaluation

The evaluation of classification performance was carried out by the standard receiver operator characteristic (ROC) analysis (for an introduction see (Duda, et al., 2000)). This method is designed to test the ranking ability of a given classifier based on a real-valued ranking parameter. In the case of nearest neighbour classification, the ranking parameter was a similarity/distance parameter calculated between an object and the nearest member of the positive training set (outlier detection). Briefly, the analysis is carried out by plotting sensitivity vs 1-specificity at various threshold values, then the resulting curve is integrated to give an “area under curve” or AUC value. These values are determined for each classification experiment. For a perfect ranking, AUC=1.0, for random ranking AUC=0.5 (Egan, 1975).


Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York.

Egan, J.P. (1975) Signal Detection theory and ROC Analysis. New York.


 



©2006 ICGEBNet