small logo

Protein Classification Benchmark Collection

General information
Accession Number PCB00015
Record Name 3PGK_Protein_Kingdom_Phylum;
Created 12-DEC-2006
Updated 12-DEC-2006
Description Classification of 3PGK protein sequences into kingdoms of life (Archaea, Bacteria, Eukaryota) based on phyla.
Data
Data Description 3-phosphoglycerate kinase (3PGK) protein sequences
Download click here for the fasta file containing the sequences 3PGK_PROTEIN.fasta
Subdivision into training and test groups
Subdivision Description Only phyla with at least 5 members were included as positive test. This selection resulted in 10 classification tasks. 
Positive Set A kingdom of life (Archea, Bacteria, Eukaryota), subdivided into phyla 
Negative Set The rest of the dataset outside the kingdom divided in such a way that members of a phylum can be either -test or -train 
Statistics Number of tasks 10
  Min Max Average  
Positive Train 4 68 36  
Positive Test 4 35 20  
Negative Train 27 63 45  
Negative test 31 53 42  
Full statistics click here to download the full statistics file 3PGK_15.stats or click view to view the file in a WEB layout
Cast Matrix click here to download the cast matrix 3PGK_15.cast
Distance Matrix
Blast download matrix file 3PGK_PROTEIN_BLAST.dmx
Smith-Waterman download matrix file 3PGK_PROTEIN_SW.dmx
Local Alignment Kernel download matrix file 3PGK_PROTEIN_LA.dmx
Prediction by Partial Match download matrix file 3PGK_PROTEIN_PPMZ.dmx
Results
Summary
Method\Comparison BLASTSWNW LA LZW PPMZ
1nn0.86330.86050.86210.85960.78330.8117
RF0.85170.86590.85480.87550.84630.9152
SVM0.95330.95270.95420.95490.92420.9476
ANN0.95840.95480.95470.95640.92780.9597
LogReg0.95370.94760.94940.95930.91540.9398

Average AUC values for the 10 classification tasks in this record (benchmark test)
Detailed view

Select the methods using multiple select (Ctrl +Mouse)
Select the dinstance measures
Group by Method
Distance Measure
view in a web layout
donwload the result file


Methods Used
[1] 3PGK Protein

The protein sequences of 3-phosphoglycerate kinase (3PGK) were taken from (Pollack, et al., 2005) kindly provided by the authors.


Pollack, J.D., Li, Q. and Pearl, D.K. (2005) Taxonomic utility of a phylogenetic analysis of phosphoglycerate kinase proteins of Archaea, Bacteria, and Eukaryota: insights by Bayesian analyses, Mol Phylogenet Evol, 35, 420-430.


[2] BLAST distance matrix.

An all against all comparison was carried out using BLAST (Altschul, et al., 1990) version 2.2.13 downloaded from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml The BLOSUM62 matrix was used with a gap opening penalty of 11 and a gap extension penalty of 1 (default parameters). The results were stored in a compressed, tab-delimited ASCII file.


Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, J Mol Biol, 215, 403-410.


[3] Smith-Waterman

An all against all comparison was carried out using the Smith-Waterman algorithm (Smith and Waterman, 1981) as implemented in the water program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default parameters). The results were stored in a compressed, tab-delimited ASCII file.


Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences, J. Mol. Biol., 147, 195-197.

Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277.


[4] Needleman-Wunsch

An all against all comparision was carried out using the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) as implemented in the needle program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default). The results were stored in a compressed, tab-delimited ASCII file.


Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 48, 443-453.

Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277.


[5] Local Alignment kernel

The Local Alignment Kernel program version 0.3 of Saigo and associates (Saigo, et al., 2004) was downloaded from http://cg.ensmp.fr/~vert/. The following run parameters were used: Default comparison matrix found in the parameters.h file. Gap opening penalty = 11 (default), Gap extension penalty = 1 (default), Scaling parameter = 0.5.


Saigo, H., Vert, J.P., Ueda, N. and Akutsu, T. (2004) Protein homology detection using string alignment kernels, Bioinformatics, 20, 1682-1689.


[6] LZW and PPMZ distance functions

All vs. all comparison or the dataset was performed using a compression distance function D(X,Y) that can be calculated between two sequences, X and Y, as follows (Cilibrasi and Vitanyi, 2005, Kocsor et al, 2006): D(X,Y) = (C(xy) – min{C(X),C(Y)}) / max{C(X),C(Y)}, where C(.) denotes the length of a compressed string, compressed by the LZW (Lempel and Ziv, 1976) or the PPMZ algorithm (Bloom, 1998). The LZW algorithm was implemented in C, while the PPMZ2 algorithm was downloaded from Charles Bloom’s homepage (http://www.cbloom.com/src/ ppmz.html).


Bloom, C. (1998) Solving the Problems of Context Modelling. California Institute of Technology, 1-11. http://www.cbloom.com/papers/ppmz.zip

Cilibrasi, R. and Vitányi, P.M.B. (2005) Clustering by compression, IEEE Transactions on Information Theory, 51, 1523-1542.

Kocsor, A., Kertész-Farkas, A., Kaján, L. and Pongor, S. (2006) Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics, 22, 407-412

Lempel, A. and Ziv, J. (1976) On the complexity of finite sequences, IEEE Transactions on Information Theory, 22, 75-81.


[7] Nearest negihbour classification

Nearest neighbour (1NN) classification is a technique whereby a query sequence is assigned to the a priori known class of the database entry that was found most similar to it in terms of a distance/similarity measure (for an introduction see Duda, et al., 2001).


Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York.


[8] Support Vector Machines

The Support Vector Machines (SVM) classifier computes a hyperplane with the largest margin between the +train and –train groups (Vapnik, 1998) and the classification is based on the distance of the test objects from this hyperplane. SVMLight package (Joachims, 1999, Version 2.51, January 2002, downloaded from http://svmlight.joachims.org/ ) was used for the evaluation. The capacity parameter C was set to 100, the RBF kernel (also called Gaussian kernel) was used with its sigma parameter set to the median Euclidean distance from any positive training example to the nearest negative example. Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set (Liao and Noble, 2003). The ROC analysis and the calculation of the AUC score was based on the distance of the test proteins from the hyperplane, used as the ranking variable.


Joachims,T. (1999) Making large-scale SVM learning practical. In Schoelkopf, B., Burges, C. and Smola, A. (eds), Advances in Kernel Methods - Support Vector Learning. MIT Press, Boston, MA

Liao, L. and Noble, W.S. (2003) Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships, J Comput Biol, 10, 857-868.

Vapnik,V.N. (1998) Statistical Learning Theory. John Wiley & Sons, New York.


[9] Random Forest

The Random Forest (RF) technique uses a combination of decision trees (Breiman, 2001). The RF algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/). 10 trees were used and the number of variables in each node was set to log(D+1), where D is the number of inputs. Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the RF for a test protein was the probability of the protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value.


Breiman,L. (2001) Random forests. Machine Learning, 45, 5-32.

Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman.


[10] Artificial Neural Networks

The Artificial Neural Network (ANN) classifier consists of connected artificial neurons built in a multi-layer structure (Bishop, 1995). A three layer ANN was used and the number of neurons within the hidden layer was set to 20. The ANN algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/).


Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the ANN for a test protein was the probability of the protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value.

Bishop,C.M. (1995) Neural Networks for Pattern Recognition. Clarendon Press, Oxford.

Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman.


[11] Logistic Regression

The Logistic Regression (LogReg) is a variation of ordinary regression which is used when the response variable is a dichotomous (e.g. -1 and 1) and the input variable are continuous, categorical or both (Rice, 1994). The LogReg algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/). Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the LogReg for a test protein was the probability of the test protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value.


Rice, J. C. (1994). Logistic regression: An introduction. In B. Thompson, ed., Advances in social science methodology, Vol. 3: 191-245. Greenwich, CT: JAI Press. Popular introduction.

Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman.


[12] Performance Evaluation

The evaluation of classification performance was carried out by the standard receiver operator characteristic (ROC) analysis (for an introduction see (Duda, et al., 2000)). This method is designed to test the ranking ability of a given classifier based on a real-valued ranking parameter. In the case of nearest neighbour classification, the ranking parameter was a similarity/distance parameter calculated between an object and the nearest member of the positive training set (outlier detection). Briefly, the analysis is carried out by plotting sensitivity vs 1-specificity at various threshold values, then the resulting curve is integrated to give an “area under curve” or AUC value. These values are determined for each classification experiment. For a perfect ranking, AUC=1.0, for random ranking AUC=0.5 (Egan, 1975).


Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York.

Egan, J.P. (1975) Signal Detection theory and ROC Analysis. New York.


 



©2006 ICGEBNet