Data-file formats
The database contains datafiles accessible via links from the database site. There are separate files for sequences (*.fasta), structures (*.pdb), cast matrices (*.cm) and distance-matrices (*.dst). The files are given mnemonic names, e.g. 3PGK.SW.dst denotes a distance matrix of 3-phosphoglycerate-kinase (3PGK) sequences, calculated with the Smith-Waterman algorithm.
It is important to note that the serial order of the sequence/structure files is the same as the serial order of rows in the cast matrix file and the serial order of rows and columns of the distance matrix.
Sequences: Concatenated FASTA
Structures: PDB format
Cast matrices: Tab-delimited ASCII files, with headers. The header line contains the names of the classification experiments that are represented by a column of the cast matrix. The classification experiments are named according to the group used as positive set and the subgroup used as positive test set using the general form "group_subgroup". For example, a.1.1_a.1.1.1 denotes a classification experiment where the positive set is the a.1.1. group of the database, and the positive test set is a.1.1.1 group. Similarly, Archaea_Euryarchaeota denotes a classification experiment wherein the positive set are the Archaea sequences and the positive test set are those of Euryarchaeota. The first name in a header line is "ID".
Each line of the cast matrix corresponds to a sequence or structure specified by the row-name (first column). The row-names are those used in the corresponding sequence (*.fasta) or structure (*.pdb) file, ant the serial order of the rows is identical with that used in those files.
The values stored in the cells of each column (classification experiment,
specified by the column header) are integers that denote a role that a
sequence plays in the given experiment. "0"= no role in the ive
classification experiment; "1"= +train; "2"= -train; "3"= +test; "4"=
-test;
Distance matrices: Tab-delimited ASCII files, without headers [3PGK_PROTEIN_BLAST.dmx]
The serial order of the rows and columns is the same as the order of sequences or structures in the corresponding *.fasta or *pdb files, respectively. There are no headers, the cells are either integers or real values, depending on the method used.
Results:
The results files contain the name of the record, the name of the distance
matrix and the cast matrix, and the performance measures (e.g. ROC AUC
values) for each classification experiment. The classification experiments
(rows) are listed in the same serial order as in the columns of the cast
matrix, and are given the same names.