| General information | |||||||||||||||||||||||||||||||||||||||||
| Accession Number | PCB00016 | ||||||||||||||||||||||||||||||||||||||||
| Record Name | 3PGK_DNA_Kingdom_Phylum; | ||||||||||||||||||||||||||||||||||||||||
| Created | 12-DEC-2006 | ||||||||||||||||||||||||||||||||||||||||
| Updated | 12-DEC-2006 | ||||||||||||||||||||||||||||||||||||||||
| Description | Classification of 3PGK DNA sequences (reading frames) into kingdoms of life (Archaea, Bacteria, Eukaryota) based on phyla. | ||||||||||||||||||||||||||||||||||||||||
| Data | |||||||||||||||||||||||||||||||||||||||||
| Data Description | 3-phosphoglycerate kinase (3PGK) reading frame DNA sequences | ||||||||||||||||||||||||||||||||||||||||
| Download | click here for the fasta file containing the sequences 3PGK_DNA.fasta | ||||||||||||||||||||||||||||||||||||||||
| Subdivision into training and test groups | |||||||||||||||||||||||||||||||||||||||||
| Subdivision Description | Only phyla with at least 5 members were included as positive test. This selection resulted in 10 classification tasks. | ||||||||||||||||||||||||||||||||||||||||
| Positive Set | A kingdom of life (Archea, Bacteria, Eukaryota), subdivided into phyla | ||||||||||||||||||||||||||||||||||||||||
| Negative Set | The rest of the dataset outside the kingdom divided in such a way that members of a phylum can be either -test or -train | ||||||||||||||||||||||||||||||||||||||||
| Statistics | Number of tasks | 10 | |||||||||||||||||||||||||||||||||||||||
| Min | Max | Average | |||||||||||||||||||||||||||||||||||||||
| Positive Train | 4 | 68 | 36 | ||||||||||||||||||||||||||||||||||||||
| Positive Test | 4 | 35 | 20 | ||||||||||||||||||||||||||||||||||||||
| Negative Train | 27 | 63 | 45 | ||||||||||||||||||||||||||||||||||||||
| Negative test | 31 | 53 | 42 | ||||||||||||||||||||||||||||||||||||||
| Full statistics | click here to download the full statistics file 3PGK_16.stats or click view to view the file in a WEB layout | ||||||||||||||||||||||||||||||||||||||||
| Cast Matrix | click here to download the cast matrix 3PGK_16.cast | ||||||||||||||||||||||||||||||||||||||||
| Distance Matrix | |||||||||||||||||||||||||||||||||||||||||
| Blast | download matrix file 3PGK_DNA_BLAST.dmx | ||||||||||||||||||||||||||||||||||||||||
| Smith-Waterman | download matrix file 3PGK_DNA_SW.dmx | ||||||||||||||||||||||||||||||||||||||||
| Needleman-Wunsch | download matrix file 3PGK_DNA_NW.dmx | ||||||||||||||||||||||||||||||||||||||||
| Lempel-Ziv-Welch | download matrix file 3PGK_DNA_LZW.dmx | ||||||||||||||||||||||||||||||||||||||||
| Prediction by Partial Match | download matrix file 3PGK_DNA_PPMZ.dmx | ||||||||||||||||||||||||||||||||||||||||
| Results | |||||||||||||||||||||||||||||||||||||||||
| Summary |
Average AUC values for the 10 classification tasks in this record (benchmark test) |
||||||||||||||||||||||||||||||||||||||||
| Detailed view | |||||||||||||||||||||||||||||||||||||||||
| Methods Used | |||||||||||||||||||||||||||||||||||||||||
| [1] 3PGK DNA | The 3-phosphoglycerate kinase (3PGK) DNA sequences were retrieved from the public databases using the respective protein sequences (Pollack, et al., 2005). Pollack, J.D., Li, Q. and Pearl, D.K. (2005) Taxonomic utility of a phylogenetic analysis of phosphoglycerate kinase proteins of Archaea, Bacteria, and Eukaryota: insights by Bayesian analyses, Mol Phylogenet Evol, 35, 420-430. |
||||||||||||||||||||||||||||||||||||||||
| [2] BLAST distance matrix. | An all against all comparison was carried out using BLAST (Altschul, et al., 1990) version 2.2.13 downloaded from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml The BLOSUM62 matrix was used with a gap opening penalty of 11 and a gap extension penalty of 1 (default parameters). The results were stored in a compressed, tab-delimited ASCII file. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, J Mol Biol, 215, 403-410. |
||||||||||||||||||||||||||||||||||||||||
| [3] Smith-Waterman | An all against all comparison was carried out using the Smith-Waterman algorithm (Smith and Waterman, 1981) as implemented in the water program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default parameters). The results were stored in a compressed, tab-delimited ASCII file. Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences, J. Mol. Biol., 147, 195-197. Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277. |
||||||||||||||||||||||||||||||||||||||||
| [4] Needleman-Wunsch | An all against all comparision was carried out using the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) as implemented in the needle program of EMBOSS (Rice, et al., 2000). The program was downloaded from ftp://ftp.bioinformatics.org/pub/biobrew/. The BLOSUM62 matrix was used with a gap opening penalty of 10 and a gap extension penalty of 0.5 (default). The results were stored in a compressed, tab-delimited ASCII file. Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 48, 443-453. Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: the European Molecular Biology Open Software Suite, Trends Genet, 16, 276-277. |
||||||||||||||||||||||||||||||||||||||||
| [5] LZW and PPMZ distance functions | All vs. all comparison or the dataset was performed using a compression distance function D(X,Y) that can be calculated between two sequences, X and Y, as follows (Cilibrasi and Vitanyi, 2005, Kocsor et al, 2006): D(X,Y) = (C(xy) – min{C(X),C(Y)}) / max{C(X),C(Y)}, where C(.) denotes the length of a compressed string, compressed by the LZW (Lempel and Ziv, 1976) or the PPMZ algorithm (Bloom, 1998). The LZW algorithm was implemented in C, while the PPMZ2 algorithm was downloaded from Charles Bloom’s homepage (http://www.cbloom.com/src/ ppmz.html). Bloom, C. (1998) Solving the Problems of Context Modelling. California Institute of Technology, 1-11. http://www.cbloom.com/papers/ppmz.zip Cilibrasi, R. and Vitányi, P.M.B. (2005) Clustering by compression, IEEE Transactions on Information Theory, 51, 1523-1542. Kocsor, A., Kertész-Farkas, A., Kaján, L. and Pongor, S. (2006) Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics, 22, 407-412 Lempel, A. and Ziv, J. (1976) On the complexity of finite sequences, IEEE Transactions on Information Theory, 22, 75-81. |
||||||||||||||||||||||||||||||||||||||||
| [6] Nearest negihbour classification | Nearest neighbour (1NN) classification is a technique whereby a query sequence is assigned to the a priori known class of the database entry that was found most similar to it in terms of a distance/similarity measure (for an introduction see Duda, et al., 2001). Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York. |
||||||||||||||||||||||||||||||||||||||||
| [7] Support Vector Machines | The Support Vector Machines (SVM) classifier computes a hyperplane with the largest margin between the +train and –train groups (Vapnik, 1998) and the classification is based on the distance of the test objects from this hyperplane. SVMLight package (Joachims, 1999, Version 2.51, January 2002, downloaded from http://svmlight.joachims.org/ ) was used for the evaluation. The capacity parameter C was set to 100, the RBF kernel (also called Gaussian kernel) was used with its sigma parameter set to the median Euclidean distance from any positive training example to the nearest negative example. Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set (Liao and Noble, 2003). The ROC analysis and the calculation of the AUC score was based on the distance of the test proteins from the hyperplane, used as the ranking variable. Joachims,T. (1999) Making large-scale SVM learning practical. In Schoelkopf, B., Burges, C. and Smola, A. (eds), Advances in Kernel Methods - Support Vector Learning. MIT Press, Boston, MA Liao, L. and Noble, W.S. (2003) Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships, J Comput Biol, 10, 857-868. Vapnik,V.N. (1998) Statistical Learning Theory. John Wiley & Sons, New York. |
||||||||||||||||||||||||||||||||||||||||
| [8] Random Forest | The Random Forest (RF) technique uses a combination of decision trees (Breiman, 2001). The RF algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/). 10 trees were used and the number of variables in each node was set to log(D+1), where D is the number of inputs. Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the RF for a test protein was the probability of the protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value. Breiman,L. (2001) Random forests. Machine Learning, 45, 5-32. Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman. |
||||||||||||||||||||||||||||||||||||||||
| [9] Artificial Neural Networks | The Artificial Neural Network (ANN) classifier consists of connected artificial neurons built in a multi-layer structure (Bishop, 1995). A three layer ANN was used and the number of neurons within the hidden layer was set to 20. The ANN algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/). Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the ANN for a test protein was the probability of the protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value. Bishop,C.M. (1995) Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman. |
||||||||||||||||||||||||||||||||||||||||
| [10] Logistic Regression | The Logistic Regression (LogReg) is a variation of ordinary regression which is used when the response variable is a dichotomous (e.g. -1 and 1) and the input variable are continuous, categorical or both (Rice, 1994). The LogReg algorithm was used as implemented in the WEKA software package (Witten and Frank, 1999, Version 3.4.7, December 2005, downloaded from http://www.cs.waikato.ac.nz/ml/weka/). Each protein was represented by a feature vector whose components were similarity/distance scores computed between the protein and the proteins of the training set. The output of the LogReg for a test protein was the probability of the test protein belonging to the positive class. This value was then used as a ranking variable for ROC analysis and for the calculation of the AUC value. Rice, J. C. (1994). Logistic regression: An introduction. In B. Thompson, ed., Advances in social science methodology, Vol. 3: 191-245. Greenwich, CT: JAI Press. Popular introduction. Witten,I.H. and Frank, E. (1999) Data Mining: Practical Machine Learning tools and Techniques with JAVA implementations, Morgan Kaufman. |
||||||||||||||||||||||||||||||||||||||||
| [11] Performance Evaluation | The evaluation of classification performance was carried out by the standard receiver operator characteristic (ROC) analysis (for an introduction see (Duda, et al., 2000)). This method is designed to test the ranking ability of a given classifier based on a real-valued ranking parameter. In the case of nearest neighbour classification, the ranking parameter was a similarity/distance parameter calculated between an object and the nearest member of the positive training set (outlier detection). Briefly, the analysis is carried out by plotting sensitivity vs 1-specificity at various threshold values, then the resulting curve is integrated to give an “area under curve” or AUC value. These values are determined for each classification experiment. For a perfect ranking, AUC=1.0, for random ranking AUC=0.5 (Egan, 1975). Duda, R.O., Hart, P.E. and Stork, D.G. (2000) Pattern Classification. John Wiley & Sons, New York. Egan, J.P. (1975) Signal Detection theory and ROC Analysis. New York. |
||||||||||||||||||||||||||||||||||||||||