Critical Assessment of Fully
 Automated Structure Prediction


CAFASP2 ALIGNMENT ACCURACY EVALUATION

This document contains four sections:

  • 1. Evaluation Description.
  • 2. Main Results.
  • 3. Additional Automatic Evaluations.
  • 4. Conclusions.

DISCLAIMER

The results of the automatic Fold Recognition evaluation are presented here and are independent of any other human assessment that may be carried out.

1. Evaluation Description

In this section we shortly summarize the evaluation procedure applied:

  • The Scoring Functions used and
    representative galleries of publicly released targets and of CASP4 targets of superpositions of pairs of models and targets at different scores.
  • Multiple models considered for each target.
  • Multi-domain partitioning.
  • Target classification into Homology Modeling Targets, Easy, and Hard Fold Recognition Targets.
  • The "N-1" ranking procedure.
  • The categories of predictions: on-time and late.
  • Links to the targets, models and all the data used in the evaluation.

2. Main Results


WARNING

In the results below, the servers are identified using their casp-id number. The name and casp-id number of each server, is given here . These results correspond to the MaxSub evaluation applied on the on-time, first models only. The results are shown in a number of partitions.

The main result of the evaluation is based on the MaxSub evaluation on the 26 Fold Recognition Targets, but we also present the evaluation on the 15 Homology Modeling Targets.

Set # Targets Maximum Correct Best Servers Second Best Servers Third Best Servers Summary
Homology Modeling Targets (All)
15
15
093 106 107 132 260 108 111 259 103 395B HM Table
Fold Recognition Targets (All)
Main Result
26
5
093 106 108 395B 103 259 132 260 FR Table


Servers Appearing at the top ranks:

  • 093 106 107 108 132 260 395B.

We subdivide the FR Targets into 5 easy targets (T0100, T0101, T0104, T0109 and T0127) and 21 hard targets, and present the subtotals for the easy and hard targets separately:

Set # Targets Maximum Correct Best Servers Second Best Servers Third Best Servers Summary
Easy Fold Recognition Targets (Easy).
5
4
093 106 108 395B 259 260 395 Easy FR
Hard Fold Recognition Targets (Hard).
21
2
(results not
significant)
103 127 132 216 108 280 Hard FR

As this subdivision shows, it is hard to derive significant conclusions from the evaluation on the hard FR targets alone, because our evaluation awards credit to very few models. To gain a better insight on the servers' performance on the hard FR targets, other evaluation methods are needed, including those with more lenient thresholds, sequence independent ones, or others (see below).

A further subdivision of the HM targets into 4 harder HM targets (T0089, T0090 T0092 and T0103) and 11 easy HM targets, and a partition of the 4 harder HM targets + the 5 easy FR targets are presented here.

In addition, for each target we present the detailed results of the MaxSub evaluation, which includes individual statistics on all models (including models 2 to 5 and late submissions).


Specificity Analysis. Fold recognition servers usually report the list of top hits for each prediction, along with an assigned score. We used this information to compute the specificity or selectivity of the servers in the lines of CAFASP-1. We classified the hits into correct and incorrect ones using the MaxSub scores (FR Table above). Then, we computed the number of correct hits that a server produced above the server's score of its largest incorrect hit. We did the same for the scores of the following incorrect hits. The specificity or selectivity results show that the servers with the best specificity (106, 107, 259 and 260) were able to identify most of the easy FR targets with better scores than their first false positive. The most selective server had 4 correct predictions before its first false positive.
Using the classification of correct and incorrect predictons according to the MaxSub scores at the larger threshold of 5.0 A (see the Additional Automatic Evaluations section below), a similar result is obtained, but at this threshold, servers 108 and 395 ranked higher. Allowing one false positive, servers 132, 260 and 395 are at the top with 6 correct answers. (See also the specificity analysis using a CAFASP1-like scoring system below).


3. Additional Automatic Evaluations


In addition to the above "official" evaluation, we have performed various different evaluations , including:

  • MaxSub evaluation using
    • the best of the top 5 models and
    • normalization by Structural Alignment
    • first models but at a MaxSub threshold of 5.0 A
  • lgscore evaluation
  • CAFASP1-like evaluation
  • touch evaluation (experimental)

The results from these additional evaluations were very similar to the official result above, showing only minor differences in the final ranking of the servers.

4. Conclusions


What did the servers succeed to predict? It is clear that with the strict MaxSub evaluation criteria, which considered only on-time models no.1, good models were predicted for all 15 HM targets and for the 5 easy FR targets. However, there was much lower success among the 21 hard FR targets. When it is clearly determined which of these 21 targets correspond to new folds, then we will be able to know how many of these targets could not possibly be predicted by fold recognition. Nevertheless, even for a target with a known fold, the fact that MaxSub scored a prediction with a zero does not necessarily imply that the prediction is totally wrong. In some cases, a prediction may have identified the correct parent but due to alignment errors, only small regions were modeled accurately. Because of the relatively low sensitivity of our automated evaluation, a "human-expert" evaluation is required to learn more about the prediction capabilities of the servers among the 21 hard FR targets. The casp human assessor may provide some additional insights when he/she assesses the servers' models of these hard FR targets. In addition, we searched for any predictions that may have captured something about the true fold, and we have found that:

  • No potentially valuable prediction could be found for targets T0086, T0094, T0096.2, T0105, T0106, T0118, T0120, T0126.
  • Valuable predictions that may have captured (at least in part) the correct fold or a correct motif were found for targets T0087.1 , T0087.2, T0091, T0097, T0098, T0102, T0107, T0108, T0110, T0114, T0115, T0116, T0121.2.
  • Some of the especially interesting predictions not scored in the above tables were:

Here is a list of main conclusions that we can draw from CAFASP2:

  • Hard to assess beyond the 15 HM targets and the 5 easy FR targets.
  • Hard to use automatic evaluation on hard cases and especially when only a few hard targets exist. To discriminate borderline predictions, more accurate automatic evaluation methods are needed, although it is not clear how useful such predictions might be.
  • At this point, the conclusions listed here are mainly based on the evaluation within the easier FR targets.
  • 4 servers better than the rest: ffas (395B), threaders, 3dpssm and inbgus, but fugue is not much behind. These servers are significantly better than pdbblast, even within the HM targets alone. sam-t99 also appears to follow closely after the top 4 servers, also showing excellent performance in the HM targets, although with lgscore it ranked second.
  • HM servers not better than FR servers on HM targets.
  • The additional automatic evaluations generally confirmed the above findings, with very minor exceptions.
  • Selectivities as bad as in cafasp1, but the difficulty of targets has increased significantly. Selectivity on the 5 easy FR targets is good.
  • From the new servers that did not participate in CAFASP1, fugue is approaching the performance of the top 4 servers.
  • The ab initio server isites appears to give interesting , promising models for the targets where FR fails.
  • For future CAFASP experiments, the raw output will be required to be in PDB format containing at least C-alpha atoms.
  • Taken together, the servers as a group identified roughly double the number of correct targets than the best of the servers. To determine how useful the servers as a group might have been for a human predictor, it would be interesting to evaluate human participants at casp who used the servers' results, as well as the cafasp-consensus group predictions filed at casp.

CAFASP2 url: http://www.cs.bgu.ac.il/~dfischer/CAFASP2