Critical Assessment of Fully
 Automated Structure Prediction


CAFASP2 ALIGNMENT ACCURACY EVALUATION

This document presents the main Fold Recognition evaluation description, and general conclusions, but discloses no information about the the actual targets, models or servers. Due to casp regulations, the latter will only be made available at a later date.

This document contains four sections:

  • 1. Evaluation Description.
  • 2. Main Results.
  • 3. Additional Automatic Evaluations.
  • 4. Conclusions.

DISCLAIMER: As of today, it is widely believed that in general, human-expert, computer-aided protein structure prediction can be more powerful than fully automated predictions. Similarly, we believe that human-expert assessment can be significantly more accurate than automatic evaluation.

All CAFASP-2 evaluations were produced using fully automated methods. Consequently, there are a number of limitations and shortcomings in the evaluations presented below, including the lack of an expert interpretation of the numerical data. However, this was one of the goals of CAFASP: to not only evaluate fully automated predictions, but also carry out the evaluation using fully automated methods. These methods were described in advance before the experiment began, and participation in CAFASP implied that participants agreed with these methods.

One of the advantages of running CAFASP in parallel with CASP, is that CAFASP participants can also benefit from the human-expert assessment provided by the casp assessors. We believe that as of today, the human-expert assessment will in most cases be a better indicator of the servers' performance than the one presented below.

The results of the automatic Fold Recognition evaluation are presented here and are independent of any other human assessment that may be carried out.

1. Evaluation Description

In this section we shortly summarize the evaluation procedure applied:

  • The Scoring Functions used and
    representative galleries of publicly released targets and of CASP4 targets of superpositions of pairs of models and targets at different scores.
  • Multiple models considered for each target.
  • Multi-domain partitioning.
  • Target classification into Homology Modeling Targets, Easy, and Hard Fold Recognition Targets.
  • The "N-1" ranking procedure.
  • The categories of predictions: on-time and late.
  • Links to the targets, models and all the data used in the evaluation.

2. Main Results


WARNING: As with any automated method, our evaluation requires thresholds. Consequently, we expect that a small number of predictions that lie in the border of being considered correct, may not score any points because they fall just below the threshold (see an example in our representative galleries of publicly released targets and of CASP4 targets of MaxSub assignments). And conversely, it is likely that a small number of predictions that could be considered incorrect, may receive a score, because they fall just above the threshold (see an example in our representative galleries of publicly released targets and of CASP4 targets of MaxSub assignments). However, the number of such cases is expected to be low, and this is evidenced by the consistency of the evaluation results using various methods and thresholds. Although our automated methods may experience difficulties in discriminating these borderline cases, we are convinced that our evaluation succeeds in discriminating well the best models and servers from the poorer ones. Because of the relatively small number of targets, the above boundary problem may play a larger role than desired. Thus, to partially correct for these limitations, all rankings were performed using the "N-1" rule described above.

In the results below, the servers are identified using their casp-id number. The name and casp-id number of each server, is given here . These results correspond to the MaxSub evaluation applied on the on-time, first models only. The results are shown in a number of partitions.

The main result of the evaluation is based on the MaxSub evaluation on the 26 Fold Recognition Targets, but we also present the evaluation on the 15 Homology Modeling Targets.

Set # Targets Maximum Correct Best Servers Second Best Servers Third Best Servers Summary
Homology Modeling Targets (All)
15
15
xxx xxx xxx xxx xxx xxx xxx xxx xxx HM Table
Fold Recognition Targets (All)
Main Result
26
5
xxx xxx xxx xxx xxx xxx xxx FR Table


We subdivide the FR Targets into 5 easy targets and 21 hard targets, and present the subtotals for the easy and hard targets separately:

Set # Targets Maximum Correct Best Servers Second Best Servers Third Best Servers Summary
Easy Fold Recognition Targets (Easy).
5
4
xxx xxx xxx xxx xxx xxx Easy FR
Hard Fold Recognition Targets (Hard).
21
2
(results not
significant)
xxx xxx xxx xxx xxx xxx Hard FR

As this subdivision shows, it is hard to derive significant conclusions from the evaluation on the hard FR targets alone, because our evaluation awards credit to very few models. To gain a better insight on the servers' performance on the hard FR targets, other evaluation methods are needed, including those with more lenient thresholds, sequence independent ones, or others (see below).

A further subdivision of the HM targets into 4 harder HM targets and 11 easy HM targets, and a partition of the 4 harder HM targets + the 5 easy FR targets are presented here.

In addition, for each target we present the detailed results of the MaxSub evaluation, which includes individual statistics on all models (including models 2 to 5 and late submissions).


Specificity Analysis. Fold recognition servers usually report the list of top hits for each prediction, along with an assigned score. We used this information to compute the specificity or selectivity of the servers in the lines of CAFASP-1. We classified the hits into correct and incorrect ones using the MaxSub scores (FR Table above). Then, we computed the number of correct hits that a server produced above the server's score of its largest incorrect hit. We did the same for the scores of the following incorrect hits. The specificity or selectivity results show that the servers with the best specificity were able to identify most of the easy FR targets with better scores than their first false positive. The most selective server had 4 correct predictions before its first false positive.
Using the classification of correct and incorrect predictons according to the MaxSub scores at the larger threshold of 5.0 A (see the Additional Automatic Evaluations section below), a similar result is obtained, but at this threshold, two other servers ranked higher. (See also the specificity analysis using a CAFASP1-like scoring system below).


3. Additional Automatic Evaluations


In addition to the above "official" evaluation, we have performed various different evaluations , including:

  • MaxSub evaluation using
    • the best of the top 5 models and
    • normalization by Structural Alignment
    • first models but at a MaxSub threshold of 5.0 A
  • lgscore evaluation
  • CAFASP1-like evaluation
  • touch evaluation (experimental)

The results from these additional evaluations were very similar to the official result above, showing only minor differences in the final ranking of the servers.

4. Conclusions


What did the servers succeed to predict? It is clear that with the strict MaxSub evaluation criteria, which considered only on-time models no.1, good models were predicted for all 15 HM targets and for the 5 easy FR targets. However, there was much lower success among the 21 hard FR targets. When it is clearly determined which of these 21 targets correspond to new folds, then we will be able to know how many of these targets could not possibly be predicted by fold recognition. Nevertheless, even for a target with a known fold, the fact that MaxSub scored a prediction with a zero does not necessarily imply that the prediction is totally wrong. In some cases, a prediction may have identified the correct parent but due to alignment errors, only small regions were modeled accurately. Because of the relatively low sensitivity of our automated evaluation, a "human-expert" evaluation is required to learn more about the prediction capabilities of the servers among the 21 hard FR targets. The casp human assessor may provide some additional insights when he/she assesses the servers' models of these hard FR targets. In addition, we searched for any predictions that may have captured something about the true fold, and we have found that:

  • No potentially valuable prediction could be found for 8 targets.
  • Valuable predictions that may have captured (at least in part) the correct fold or a correct motif were found for other 13 targets.

Here is a list of main conclusions that we can draw from CAFASP2:

  • Hard to assess beyond the 15 HM targets and the 5 easy FR targets.
  • Hard to use automatic evaluation on hard cases and especially when only a few hard targets exist. To discriminate borderline predictions, more accurate automatic evaluation methods are needed, although it is not clear how useful such predictions might be.
  • At this point, the conclusions listed here are mainly based on the evaluation within the easier FR targets.
  • 4 servers better than the rest but 2 others are not much behind. These servers are significantly better than pdbblast, even within the HM targets alone.
  • HM servers not better than FR servers on HM targets.
  • The additional automatic evaluations generally confirmed the above findings, with very minor exceptions.
  • Selectivities as bad as in cafasp1, but the difficulty of targets has increased significantly. Selectivity on the 5 easy FR targets is good.
  • From the new servers that did not participate in CAFASP1, one is approaching the performance of the top 4 servers.
  • One ab initio server appears to give interesting , promising models for the targets where FR fails.
  • For future CAFASP experiments, the raw output will be required to be in PDB format containing at least C-alpha atoms.
  • Taken together, the servers as a group identified roughly double the number of correct targets than the best of the servers. To determine how useful the servers as a group might have been for a human predictor, it would be interesting to evaluate human participants at casp who used the servers' results, as well as the cafasp-consensus group predictions filed at casp.

CAFASP2 url: http://www.cs.bgu.ac.il/~dfischer/CAFASP2