small logo

Protein Classification Benchmark Collection

General information
Accession Number PCB00019
Record Name SCOP40_Minidatabase;
Created 12-DEC-2006
Updated 12-DEC-2006
Description Classification of protein domain sequences and structures into superfamilies, based on families (SCOP40 v.169 dataset of 1357 sequences and structures )
Data
Data Description A subset of the SCOP40 protein database
Download click here for the fasta file containing the sequences SCOP40mini.fasta
Download click here for the ziped file containing the structures SCOP40mini.pdb.tar.gz
Subdivision into training and test groups
Subdivision Description Protein families, with at least 5 members and at least 10 members outside the family but within the same superfamily in SCOP95 were selected. From this selection those families were selected as positive test that had at least 10 members outside the family but within the same superfamily in the selection (total: 1357 proteins/55 classification tasks). 
Positive Set Superfamilies, subdivided into families 
Negative Set The rest of the selected dataset divided into -test or -train 
Statistics Number of tasks 55
  Min Max Average  
Positive Train 10 168 89  
Positive Test 5 52 29  
Negative Train 592 670 631  
Negative test 592 671 632  
Full statistics click here to download the full statistics file SCOP40mini_sequence_minidatabase_19.stats or click view to view the file in a WEB layout
Cast Matrix click here to download the cast matrix SCOP40mini_sequence_minidatabase_19.cast
Distance Matrix
Blast download matrix file SCOP40mini_BLAST.dmx
Smith-Waterman download matrix file SCOP40mini_SW.dmx
Needleman-Wunsch download matrix file SCOP40mini_NW.dmx
Local Alignment Kernel download matrix file SCOP40mini_LA.dmx
Pride structure similarity download matrix file SCOP40mini_PRIDE.dmx
Lempel-Ziv-Welch download matrix file SCOP40mini_LZW.dmx
Prediction by Partial Match download matrix file SCOP40mini_PPMZ.dmx
DaliLite structural alignment download matrix file SCOP40mini_DALI.dmx
PSI-BLAST download matrix file SCOP40mini_PsiBLAST.dmx
Results
Summary
Method\Comparison BLASTSWNW LA PRIDELZW PPMZ DALI
1nn0.75770.81540.82520.73430.86440.71740.56440.9892
RF0.69650.82300.80300.83440.91050.73960.72530.9941
SVM0.90470.94190.93760.93960.93610.82880.85510.9946
ANN0.79880.88750.88340.9022-0.83460.8254-
LogReg0.87150.90630.91750.87660.90290.74870.83080.9636

Average AUC values for the 55 classification tasks in this record (benchmark test)
Detailed view

Select the methods using multiple select (Ctrl +Mouse)
Select the dinstance measures
Group by Method
Distance Measure
view in a web layout
donwload the result file


Methods Used
[1] SCOP Minidatabase Sequences

The sequences were taken from the SCOP database 1.69 (Andreeva, et al.,2004). The entries of the SCOP40 (<40% identity) were downloaded from the ASTRAL site <http://astral.berkeley.edu/>site. The 53 non-contiguous domains were discarded and 7237 entries were retained. The sequences were stored in concatenated FASTA format.


Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data, Nucleic Acids Res, 32, D226-229.


[2] SCOP Minidatabase Structures

The 3D structures were taken from the SCOP database 1.69 (Andreeva, et al.,2004). The entries of the SCOP40 (<40% identity) were downloaded from the ASTRAL sitehttp://astral.berkeley.edu/pdbstyle-1.69.html. The 53 non-contiguous domains were discarded and 7237 entries were retained. The structues were deposited as compressed archive.


Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2004) SCOP database in 2004: refinements integrate structure and sequence family data, Nucleic Acids Res, 32, D226-229.


[3] BLAST distance matrix.

An all against all comparison was carried out using BLAST (Altschul, et al., 1990) version 2.2.13 downloaded from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml The BLOSUM62 matrix was used with a gap opening penalty of 11 and a gap extension penalty of 1 (default parameters). The results were stored in a compressed, tab-delimited ASCII file.


Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, J Mol Biol, 215, 403-410.


[4] Smith-Waterman

An all against all comparison was carried out using the Smith-Waterman algorithm (Smith and Waterman, 1981) as implemented in the water program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default parameters). The results were stored in a compressed, tab-delimited ASCII file.


Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences, J. Mol. Biol., 147, 195-197.

Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277.


[5] Needleman-Wunsch

An all against all comparision was carried out using the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) as implemented in the needle program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default). The results were stored in a compressed, tab-delimited ASCII file.


Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 48, 443-453.

Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277.


[6] Local Alignment kernel

The Local Alignment Kernel program version 0.3 of Saigo and associates (Saigo, et al., 2004) was downloaded from http://cg.ensmp.fr/~vert/. The following run parameters were used: Default comparison matrix found in the parameters.h file. Gap opening penalty = 11 (default), Gap extension penalty = 1 (default), Scaling parameter = 0.5.


Saigo, H., Vert, J.P., Ueda, N. and Akutsu, T. (2004) Protein homology detection using string alignment kernels, Bioinformatics, 20, 1682-1689.


[7] PRIDE

An all against all comparison was carried out using the PRIDE algorithm (Gaspari, et al., 2005). The program was provided by Z. Gaspari.


Gaspari, Z., Vlahovicek, K. and Pongor, S. (2005) Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm, Bioinformatics, 21, 3322-3323.


[8] LZW and PPMZ distance functions

All vs. all comparison or the dataset was performed using a compression distance function D(X,Y) that can be calculated between two sequences, X and Y, as follows (Cilibrasi and Vitanyi, 2005, Kocsor et al, 2006): D(X,Y) = (C(xy) – min{C(X),C(Y)}) / max{C(X),C(Y)}, where C(.) denotes the length of a compressed string, compressed by the LZW (Lempel and Ziv, 1976) or the PPMZ algorithm (Bloom, 1998). The LZW algorithm was implemented in C, while the PPMZ2 algorithm was downloaded from Charles Bloom’s homepage (http://www.cbloom.com/src/ ppmz.html).


Bloom, C. (1998) Solving the Problems of Context Modelling. California Institute of Technology, 1-11. http://www.cbloom.com/papers/ppmz.zip

Cilibrasi, R. and Vitányi, P.M.B. (2005) Clustering by compression, IEEE Transactions on Information Theory, 51, 1523-1542.

Kocsor, A., Kertész-Farkas, A., Kaján, L. and Pongor, S. (2006) Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics, 22, 407-412

Lempel, A. and Ziv, J. (1976) On the complexity of finite sequences, IEEE Transactions on Information Theory, 22, 75-81.


[9] DaliLite

The DALI similarity score (raw score) was calculated by the DALI-lite program for pairwise structure comparison and database searching, version 2.4.2 (Holm and Park, 2000). The program was downloaded from http://ekhidna.biocenter.helsinki.fi/dali_lite/downloads/.


Holm L. and Park J. (2000) DaliLite workbench for protein structure comparison, Bioinformatics, 16, 566-567.


[10] Nearest negihbour classification

Nearest neighbour (1NN) classification is a technique whereby a query sequence is assigned to the a priori known class of the database entry that was found most similar to it in terms of a distance/similarity measure (for an introduction see Duda, et al., 2001).


Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York.


[11] Support Vector Machines

The Support Vector Machines (SVM) classifier computes a hyperplane with the largest margin between the +train and –train groups (Vapnik, 1998) and the classification is based on the distance of the test objects from this hyperplane. SVMLight package (Joachims, 1999, Version 2.51, January 2002, downloaded from http://svmlight.joachims.org/ ) was used for the evaluation. The capacity parameter C was set to 100, the RBF kernel (also called Gaussian kernel) was used with its sigma parameter set to the median Euclidean distance from any positive training example to the nearest negative example. Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set (Liao and Noble, 2003). The ROC analysis and the calculation of the AUC score was based on the distance of the test proteins from the hyperplane, used as the ranking variable.


Joachims,T. (1999) Making large-scale SVM learning practical. In Schoelkopf, B., Burges, C. and Smola, A. (eds), Advances in Kernel Methods - Support Vector Learning. MIT Press, Boston, MA

Liao, L. and Noble, W.S. (2003) Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships, J Comput Biol, 10, 857-868.

Vapnik,V.N. (1998) Statistical Learning Theory. John Wiley & Sons, New York.


[12] Random Forest

The Random Forest (RF) technique uses a combination of decision trees (Breiman, 2001). The RF algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/). 10 trees were used and the number of variables in each node was set to log(D+1), where D is the number of inputs. Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the RF for a test protein was the probability of the protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value.


Breiman,L. (2001) Random forests. Machine Learning, 45, 5-32.

Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman.


[13] Artificial Neural Networks

The Artificial Neural Network (ANN) classifier consists of connected artificial neurons built in a multi-layer structure (Bishop, 1995). A three layer ANN was used and the number of neurons within the hidden layer was set to 20. The ANN algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/).


Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the ANN for a test protein was the probability of the protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value.

Bishop,C.M. (1995) Neural Networks for Pattern Recognition. Clarendon Press, Oxford.

Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman.


[14] Logistic Regression

The Logistic Regression (LogReg) is a variation of ordinary regression which is used when the response variable is a dichotomous (e.g. -1 and 1) and the input variable are continuous, categorical or both (Rice, 1994). The LogReg algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/). Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the LogReg for a test protein was the probability of the test protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value.


Rice, J. C. (1994). Logistic regression: An introduction. In B. Thompson, ed., Advances in social science methodology, Vol. 3: 191-245. Greenwich, CT: JAI Press. Popular introduction.

Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman.


[15] Performance Evaluation

The evaluation of classification performance was carried out by the standard receiver operator characteristic (ROC) analysis (for an introduction see (Duda, et al., 2000)). This method is designed to test the ranking ability of a given classifier based on a real-valued ranking parameter. In the case of nearest neighbour classification, the ranking parameter was a similarity/distance parameter calculated between an object and the nearest member of the positive training set (outlier detection). Briefly, the analysis is carried out by plotting sensitivity vs 1-specificity at various threshold values, then the resulting curve is integrated to give an “area under curve” or AUC value. These values are determined for each classification experiment. For a perfect ranking, AUC=1.0, for random ranking AUC=0.5 (Egan, 1975).


Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York.

Egan, J.P. (1975) Signal Detection theory and ROC Analysis. New York.


 



©2006 ICGEBNet