Record Formats
| The Protein Classification Benchmark Collection | |
The Protein Classification Benchmark Collection consists of records. The record structure approximately follows the conventions of the EMBL/SwissProt/UniProt databases. The content of each record is benchmark test which is an ensemble of several (10 to 250) classification tasks. A classification task is a subdivision of a dataset into +train, +test, -train and –test groups. To each record belongs a) a dataset (sequences or structures); b) a cast matrix containing the benchmark test; c) at least one distance matrix (an all vs. all comparison of the dataset) and finally d) at least one results file containing the evaluation of the benchmark test. Each record consists of fields (corresponding to EMB/SwissProt “lines”). For an easier overview, the fields are grouped into sections. |
|
| General Information section | |
| Accession Number | A stable identifier of the record in the form of PCB0001, PCB0002, etc. |
Record Name |
A short mnemonic description of the record content |
Created |
The creation date of the record |
| Updated | The date of the last update |
| Description | A detailed definition of the record content |
| Data section | |
| Data description | The exact definition of the sequences/structures used in the record. |
Download |
A link to the sequence/structure file. |
| Subdivision into training and test groups section | |
| Positive set | Narrative description of the positive set in terms of groups and subgroups, e.g. “Superfamilies subdivided into families” or “Folds randomly subdivided into 5 equal groups”. |
| Negative set | Narrative description of the negative set, e.g. “The rest of the database outside the superfamily divided in such a way that members of a family can be either -test or -train “ |
| Statistics field | Short statistics containing the number of classification tasks, the minimum, maximum and average size of the +train, +test, -train and –test groups. |
| Cast matrix | Contains a link to the cast matrix file that contains the exact subdivision of the dataset. |
| Distance matrix section | |
This section contains links to distance matrices available for the dataset. The distance matrices are named after the methods used to generate them, e.g. “BLAST - download matrix file SCOP95_BLAST.dmx |
|
| Results section | |
This section contains various table-views of the evaluation results, grouped according to distance/similarity measures (e.g. BLAST, Smith-Waterman, Needleman-Wunsch, etc.) or to learning algorithms (e.g. Nearest Neighbour, Support Vector Machines etc.) |
|
| Methods section | |
This section contains a detailed description of each method used to generate the data in the record, written in a form meant to facilitate the work of non-specialist users. Each field in a record contains a methods description provided with references whenever appropriate. |
|
| References section | |
A list of references with a short tag containing a hint to the contents of the reference. |
|