PRIDE2 Logo
Probability of Identity #2
Protein fold similarity
Web server
 

PRIDE2 Server description

The PRIDE2 Server uses the improved PRIDE method to compare protein structures to each other or to those in a database. The principle of the method is to compare the distributions of Cαi-Cαi+n distances in the two structures for n=3..30 (28 distributions). The average of the 28 resulting probabilities gives the PRIDE score. The first version of the method uses histograms and is described in more detail in the paper:

Carugo O and Pongor S, Protein fold similarity estimated by a probabilistic approach based on Cα-Cα distance comparison, J. Mol. Biol (2002) 315:887-898 [abstract]

The PRIDE2 approach compares unbinned distance distributions using the Kuiper variant of the Kolmogorov-Smirnov test. This is described in:

Gaspari Z, Vlahovicek K and Pongor S, Efficient recognition of folds in protein 3-D structures by the improved PRIDE algorithm, Bioinformatics (2005) 21:3322-3323. [abstract] [free accepted manuscript PDF]


Outline Available search modes in the PRIDE2 Server

The PRIDE2 Server supports three basic search modes:
  1. A set of user-submitted structures can be compared to each other
  2. Fold identification: search against the CATHselect database
  3. PDBselect search
Detailed description of the basic modes:
  1. All-against-all comparison of user-submitted structures. This mode allows to submit a bunch of PDB structures for comparison with each other using the PRIDE2 method (i.e. the Kuiper variant of the Kolmogorov-Smirnov test). This mode requires user input preparation: at least 2 but no more than 50 single-chain PDB files should be concatenated (including HEADER and END records for each structure) and submitted as a single file. After preprocessing, an all-against-all PRIDE2 search is performed. The output consists of a distance and PRIDE2 matrix and a tree built based on the former by the NEIGHBOR program of the PHYLIP package.

  2. Fold identification means searching the query/queries against the CATHselect database. The server reports the best hits at different 'H' levels in the CATH hierarchy. Two search methods are available: H-level filtered and full database search:
    The H-level filtered search is performed in two steps: first, a histogram-based PRIDE search is performed with the query against average histograms derived from CATH domains at the same 'H' level (1572 average histograms). The second step is a PRIDE2 search on the CATHselect database considering only those CATH domains that fall into one of the 10 best 'H' levels identified in the first step. This search mode typically means 1572 + 100-500 =~ 2000 comparisons which is considerably faster than the 17844 when performing full database search. In contrast, this mode assumes that the group average histograms give a high score with the query structure bleonging to the group, which is not always the case.
    In the full database search mode, all CATHselect domains are compared to the query, meaning 17844 comparisons per query structure. This mode is considerably slower than the H-level filtered search but is reliable regardless of the usability of the group averages for the particular query.
    The user may submit a single (one- or multichain) PDB structure or a file containing concatenated PDB structures. In the former case, a sliding window approach with adjustable parameters ('
    Window' and 'Slide') is available in order to be able to identify multiple domains on the polypeptide chains. The output consists of a graphical overview of the distribution of the best hits along each chain of the query and a listing of the best hits with the corresponding PRIDE2 values. For the CATHselect database, only the best hit from those at the same CATH 'H' level is shown. The reliability shown is calculated from the PRIDE2 values of hits at the same H level and the PRIDE of the query with corresponding group average histogram (H-level filtered search only). (this can also be switched off). When submitting a sigle structure, generally no specific input preparation is requried as extensive preprocessing is done by the server (a PDB file directly obtainable from the PDB database is acceptable in most cases, for more details, see the notes on PDB format).
    Multistructure search requires user input preparation: no more than 50 single-chain PDB files should be concatenated (including HEADER and END records for each structure) and submitted as a single file. No graphics and reliability values are output in this mode. This feauture may come useful when the user has an idea of the domain composition of the input structure and has split the structure accordingly before submission to the PRIDE2 server.

  3. In PDBselect search the query/queries are scanned against the PDBselect database. The sinlge- and multistructure options are available as in the fold identification mode. The sliding window option can be used for single structure search. No graphics and reliability values are output in this mode.



Searchable databases

Currently there are two databases chosable in the PRIDE2 Server: CATHselect (for fold identification) and PDBselect.
  1. The CATHselect database is derived from the CATH database (version 3.0.0). CATHselect contains only one domain at each 'I' level, when possible, a high-resoltuion one with a length close to the average of the 'I'-level group. Thus, each domain in CATHselect has a unique 'CATHSNI' classification. Domains were downloaded from the CATH web site and processed using the same principles as for input structures. The 'clean' domains were then used to generate the two database files, one with arrays for each structure for Kolmogorov-Smirnov search and one with average histograms for domains at the same 'H' level (used for H-level filtered search only). The database contains 23928 structures in 2147 H-level groups. Update history:

    • August 24, 2006: CATHselect version 3.0.0 replaced 2.6.0
    • June 21, 2005: CATHselect version 2.6.0 replaced 2.5.1

  2. The PDBselect database is based on the PDB SELECT list (2006 March release) and contains 3080 structures (chains). This database can be searched only by full search. Update history:

    • September 21, 2006: the 2006 March release replaced the 2005 July one
    • November 25, 2005: the 2005 July release replaced the 2004 October one



The sliding window

The sliding window approach is designed to allow identification of domains constituting only a segment of the input structure. The 'Window' parameter defines the lengh of the segments to be used (no. of continuous amino acid residues used) and the 'Slide' parameter is used to determine the startpont of the next segment. The defauklt values are 160 for Window and 80 for Slide. In the CATHselect database, the average domain size is 160 +- 92. If the Slide value is set to zero, it will be automatically reset to Window/2. If the 'Window' parameter is set to zero, no segment search will be perormed (i.e. each chain of the query will be treated as a whole).
The 'Window' parameter may be crucial in finding domain similarities as it may be too small or too large to represent the domain actually present in the given region of the query chain. It is important to emphasise that the sliding window approach can yield only approximate results and it is advised that the user splits the query protein into domains by independent methods and submits the resulting domains as a 'Multi scan' query.



Input file preparation guidelines

For the Sinlge scan mode, basically no input preparation is required. However, the number of searches is limited to 100, which may be exceeded using a large input structure and small Window/Slide parameters. Moreover, the character set allowed in the input file is restricted to uppercase letters and other common characters in PDB files, thus 'exotic' chain identifiers ("|", "#" etc.) cause input file rejection by the server. This should not be a common problem even with files directly obtained from the PDB.
If prepared by the user, input files for Sinlge scan should start with a 'HEADER' and end with and 'END' record. It is important to note that input file reading stops at the first 'END'/'ENDMDL' line to ensure that only one structure is actually processed.
For the Multi scan and Cluster modes user input preparation is required, as these modes accept a single file containing concatenated PDB structures each staring with a 'HEADER' and ending with an 'END' field. No multi-chain structures are allowed, this means that only one chain identifier may be present between each 'HEADER'-'END' pair.
As only the lines containing the coordinates of Cα atoms are used in the comparison, CA-only structures are accepted and the side-chain cooordinates do not affect the PRIDE2 calculations.



Input preprocessing

Input preprocessing is intended to ensure error-free operation of the C++ server program by filtering input PDB files. This includes standardizing of residue numbering (i.e. renumbering the CA atoms in the PDB structure when negative residue numbers of non-numeric characters in the residue number field are found [e.g. 22, 22A, 23 etc.]), elimination of multiple conformers (only the first of these is retained) and passing only the HEADER, END and Cα ATOM lines to the C++ server for each structure. In the 'Multi scan' and 'Cluster' modes input files with more than one chain in any of the input structures and also queries with more than 50 structures are rejected. The server is designed to yield information on input filtering (if filtering was necessary, a file and a corresponding link 'Messages generated during input processing' is created) and the types of errors that prevented the actual comparison.



PRIDE2 output interpretation

The interpretation of server output should primarily be based on the PRIDE2 values reported for the structure pairs. As its name (PRobability of IDEntity) indicates, the PRIDE2 score should be treated as a probability of the two structures being identical. A PRIDE2 score > 0.9 generally indicates close structural similarity. Values in the range of 0.6 - 0.8 may also indicate similarities, though these should be treated more carefully. In general, if a database search does not yield a hit with a PRIDE2 score > 0.85 than the query should not be regarded as having an identified fold before further investigations. It is noted here that the H level-filtered search is not always capable of finding the closest similarities, thus if no PRIDE2 > 0.9 hits are found with this method it is recommended to run a full search on the CATHselect database. It is also possible that a fold is represented in only one of the CATHselect/PDBselect databases (that is the reason why both are available for searching). The fact that PDBselect contains full polypeptide chains rather than domains as CATH and the release date of the two databases should also be taken into account when interpreting database hits.
In the H level-filtered mode a reliability score is also calculated and is presented as a set of stars (). The reliability is calculated by summing the PRIDE2 values obtained with the hits belonging to the same H level and multiplied by the PRIDE value obtained with the corresponding H level average histogram. Thus, if the average PRIDE and individual PRIDE2 scores at a givel H level are high, reliability is also high. Note that reliability depends on the number of structures found (and in the database) at the given H level. The reliability value may be an indication of structural similarity if the associated PRIDE2 score is relatively low, but a high reliability on its own does not prove close structural similarity (it is the PRIDE2 score that does!).
The plain server ouput is also available in the fold identification and PDBselect search modes. It contains the sorted PRIDE2 values (the number of which is determined by the best hits option) in the order of the input structures.