TranScout - Appendix

TRANSCOUT APPENDIX

ABSTRACT
Motivation: The advent of Genomics yields thousands of reading frames in search for function. Identification of conserved functional motifs in protein sequences can be helpful for function prediction.
Results: A database and a classification of reported DNA-binding protein motifs has been designed. A programme (“TranScout”) has been developed for the detection and evaluation of conserved motifs in prokaryotic and eukaryotic sequences of proteins with a gene regulatory function. The efficiency of the programme is shown in a benchmark against a database obtained from SWISS-PROT without the protein sequences used to train the programme. All motifs were detected with a mean average sensitivity of 0.98 and a mean average specificity of 0.92.
Availability: the programme is freely available for use on the Internet at http://bioinf.uab.es/transcout/. The user can find additional information at this site.

BUILDING THE PROFILES
We have an "initial set" of 70 aligned sequences corresponding to a zinc finger of the CCHC type. We wish to create a PSSM (position-specific score matrix) from them. Thus, the following steps must be followed:

a) Randomly choose a sample of 10% of the initial set (7 sequences). We will call this "training set" from now on.

CRCWICNIEGHYANECPN
PVCFNCKKPGHLARQCRD
QPCFRCGKAGHWSRDCTQ
DQCAYCKEKGHWAKDCPK
DQCAYCKEKGHWAKDCPK
CKCYICGQEGHYANQCRN
IKCFNCGKEGHLARNCKA
b) The Voronoi method must be applied to the training set in order to obtain a weight for each one of the sequences. Since the Voronoi method relies strongly on a random sampling, results may differ each time the programme is run using the same set of sequences. For this example, with 1000 iterations, results are:
      sequence      weight
CRCWICNIEGHYANECPN    0.167
PVCFNCKKPGHLARQCRD    0.177
QPCFRCGKAGHWSRDCTQ    0.164
DQCAYCKEKGHWAKDCPK    0.101
DQCAYCKEKGHWAKDCPK    0.101
CKCYICGQEGHYANQCRN    0.141
IKCFNCGKEGHLARNCKA    0.147
From the score we can assert that sequence 2 is the most "uncommon" sequence present in the set (it is given the highest weight). Sequences 4 and 5 are the most "common" ones (they are given the lowest weight). For a set of identical, aligned sequences, they all would have been (ideally) given a weight of 1.
c) A PSSM is made based upon the weights above (Matrix 1). Notice that the frequencies are not the frequencies we would get in a raw frequency matrix and that the gap is taken as the 21st aminoacid. This implies that the BLOSUM62 substitution matrix, which the programme will use when a sequence is submitted, will contain an additional row and column with the score of aligning an aminoacid against a gap. This score, amongst other variables, will get its value in the next step.
d) To know how good this PSSM is, a benchmark needs to be performed: the remaining 90% of the sequences of the initial set will be submitted and the TranScout programme will try to find the motif named CCHC in them using the PSSM we created in Step c). We do know a priori that the motif named CCHC is present in all of the sequences we submit. Hence, the expected number of true positives is known a priori. The initial value for the threshold, the initial gap-aligning score and the gap penalty will be 0. The benchmark will be repeated independently increasing the threshold, the gap-to-aminoacid score and the gap penalty until an optimal combination for all three is found. An algorithm for the process is:
Initialise gap-to-aminoacid score
DO
{
     Initialise gap penalty
     DO
     {
          Initialise threshold
          DO
          {
               check number of positives, max and min scores
               store threshold value
               store number of positives, max and min scores
               increase threshold
          }
          UNTIL (no positives found)
     store gap penalty
     increase gap penalty
     }
     UNTIL (gap<standard gap penalty for gap-opening)
store gap-to-aminoacid score
increase gap-to-aminoacid score
}
UNTIL (one or more false positives are found)
Storing the three variables for each iteration of each loop plus the number of true positives found, the combination of maximum values which together give off the best results can be obtained, i.e., the combination which finds the maximum number of true positives with the highest possible threshold and gap penalty and the lowest possible gap-to-aminoacid score.
e) If the results in the previous step were satisfactory (i.e. all or almost all of the motifs were detected), the PSSM, together with the other values obtained, were stored and became part of the large programme's database. If the best results were unsatisfactory, Step a) was performed again, increasing the random sample by 10%.

WHAT HAPPENS WHEN A QUERY SEQUENCE IS SUBMITED?
Now we have a QQEKTCYACGTAGHLVRDCPSSPN query sequence. The programme will want to know whether the motif named CCHC is present in our query sequence. The process will be:

a) The query sequence will be transformed into a raw frequency matrix. Building a frequency matrix out of a single sequence is very obvious and does not provide any relevant information. It is better to build a matrix for our query sequence from the BLOSUM 62 substitution matrix. See Matrix 2.
b) The next step is obtaining the matrix product of the matrix obtained in Step a) and the PSSM of the motif named CCHC present in the database. The result is the Matrix 3.
c) The last step is to use dynamic programming with the Smith-Waterman algorithm. Since transcription factors may contain several DNA-binding motifs (i.e., several alignments with significant score), the Smith & Waterman algorithm was used, modified in order to obtain all local alignments in the protein sequence instead of the single best local alignment. The resulting is the Matrix 4. The alignment is:
CYACGTAGHLVRDCP
|+ | + ||++++|+
ChpCpppGHhAppCp

CLASSIFICATION OF TRANSCRIPTION FACTORS
Follow this link.