small logo

Protein Classification Benchmark Collection


Record Formats

The Protein Classification Benchmark Collection

The Protein Classification Benchmark Collection consists of records. The record structure approximately follows the conventions of the EMBL/SwissProt/UniProt databases.

The content of each record is benchmark test which is an ensemble of several (10 to 250) classification tasks. A classification task is a subdivision of a dataset into +train, +test, -train and –test groups. To each record belongs a) a dataset (sequences or structures); b) a cast matrix containing the benchmark test; c) at least one distance matrix (an all vs. all comparison of the dataset) and finally d) at least one results file containing the evaluation of the benchmark test.

Each record consists of fields (corresponding to EMB/SwissProt “lines”). For an easier overview, the fields are grouped into sections.

General Information section
Accession Number A stable identifier of the record in the form of PCB0001, PCB0002, etc.

Record Name

A short mnemonic description of the record content

Created

The creation date of the record
Updated  The date of the last update
Description

A detailed definition of the record content

Data section
Data description The exact definition of the sequences/structures used in the record.

Download

A link to the sequence/structure file.
Subdivision into training and test groups section
Positive set Narrative description of the positive set in terms of groups and subgroups, e.g. “Superfamilies subdivided into families” or “Folds randomly subdivided into 5 equal groups”.
Negative set Narrative description of the negative set, e.g. “The rest of the database outside the superfamily divided in such a way that members of a family can be either -test or -train “
Statistics field Short statistics containing the number of classification tasks, the minimum, maximum and average size of the +train, +test, -train and –test groups.
Cast matrix Contains a link to the cast matrix file that contains the exact subdivision of the dataset.
Distance matrix section

This section contains links to distance matrices available for the dataset. The  distance matrices are named after the methods used to generate them, e.g. “BLAST  - download matrix file SCOP95_BLAST.dmx

Results section

This section contains various table-views of the evaluation results, grouped according to distance/similarity measures (e.g. BLAST, Smith-Waterman, Needleman-Wunsch, etc.) or to learning algorithms (e.g. Nearest Neighbour, Support Vector Machines etc.)

Methods section

This section contains a detailed description of each method used to generate the data in the record, written in a form meant to facilitate the work of non-specialist users.   Each field in a record contains a methods description provided with references whenever appropriate.

References section

A list of references with a short tag containing a hint to the contents of the reference.

 
©2006 ICGEBNet