small logo

Protein Classification Benchmark Collection

General Information

 

The Protein Classification Benchmark Collection was created in order to create standard datasets on which the performance of machine learning methods can be compared.

 

The collection contains datasets of sequences and structures, each subdivided into positive/negative training/test sets. Such a subdivision is called a classification task. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences, as fell as various functional and taxonomic classification tasks. 

 

Running a performance evaluation test on an entire database can include many different classification tasks. These ensembles of classification tasks are encoded in a simple matrix format - called the cast matrix or membership table - that specifies the role of each sequence (or structure) in the different calculations. Each column of this matrix is a subdivision of the objects (rows) into positive/negative training/test sets. Typically, a database record contains such an ensemble of classification tasks, encoded in a single cast matrix.

 

In addition, there is a collection of distance matrices that contain an all vs. all comparison of the datasets using methods as BLAST, Smith-Waterman, 3D-comparisons etc.

 

Evaluation of a method on a given database consists of calculating a performance measure such as a receiver operating curve (ROC) AUC value. Results of evaluation are deposited along with the data, each dataset is evaluated at least by one classification method, such as 1NN (nearest neighbour) or SVM (support vector machines), ANN (artificial neural networks), RF (random forests) etc..

 

There are small datasets meant for program developers, as well as downloadable programs for various classification algorithms.


©2006 ICGEBNet