Information

SBASE

Underlying Theory

Home Info Browse Analyse Search Retrieve

The data collection

SBASE is a collection of protein domain sequences collected from the literature, from protein sequence databases and from genomic databases (Vlahovicek et al, 2002). The protein domains are defined by their sequence boundaries given by the publishing authors or in one of the primary sequence databases (Swiss-Prot, PIR, TREMBL etc.). Domain groups are included if they have well defined sequence boundaries, and if they can be distinguished from other sequences using a similarity search technique.

The SBASE database uses a set theoretical approach for representing similarities, which in practical terms is extremely simple. Sequences are considered similar if they are members of a similarity group in which all or most sequences are similar to each other and less similar to other members of the database. Sequences that have an above threshold BLAST similarity score to at least one member of the group is called the neighbourhood of the group. The below sketch shows such a neighborhood; the similarities within the group (self-similarities) and those pointing to non-member neighbours (non-self similarities) are shown in different colours.

The numeric representation of similarity scores for each domain group is based on two quantities. NSD is the number of within group or self-similarities (scores above the default threshold of BLAST). AVS is the average of these values. On can determine the respective quantities for the non-member neighbours of the domain group. If we plot the AVS and NSD values against each other, we obtain a 2-D graph, which is in fact a local representation of the similarity space around the domain group in question. An example is shown here:

SBASE contains the domain sequences as well as various statistical parameters of the domain groups. In terms of pattern recognition, this approach is similar to the memory-based computing paradigm (Stanfill and Waltz, 1986). Since the neighbourhoods of the domain groups are represented as a network of their similarities, it can also be called a similarity network based approach (Murvai et al, 2001a)

Similarity search on SBASE

Similarity search is based on the BLAST program. The results of the BLAST search are processed to give domain similarities. When a protein sequence query is compared with the SBASE, different part of the protein may be similar to various domain groups.

The server will analyze each of the local similarities and will compare them with a knowledge base of the self-and non-self similarities of the known domain groups, using various simple statistical comparisons. If the distribution of the local similarities (dashed lines) are similar to the self-similarities of a group (blue arrows), the domain will be assigned to the query. If they are more similar to the non-self similarities (white arrows), the domain similarity is rejected.

The evaluation is based on the two quantities, NSD and AVS described above. SBASE uses threshold values in NSD as well as AVS for each group, and only those domain-similarities are evaluated that fall above these thresholds. In addition, the distribution of NSD and AVS values are stored for all groups (self-similarities) as well as for their non-member neighbours (non-self similarities). Using a precalculated distribution, one can determine an approximate probability for a score to be a self or non-self value. Using these probability values, a cumulative score is calculated (Murvai et al, 2000), and the query is classified to the group with the highest cumulative score. Alternatively, we can feed the NSD, AVS as well as the probability values into a neural network that will decide about the group membership (Murvai et al, 2001b). Why is this analysis fast? First, only one BLAST search is required, all subsequent calculations are in fact quite short. Second, a subsequence of a query will be similar only to a very few groups, so the detailed comparison will affect only a few groups.


Mail your comments and suggestions to the SBASE team
© 2006 Mircea Pacurar, Laszlo Kajan, Sandor Pongor, ICGEB Trieste