CAFASP4 Evaluation Results
Critical Assessment of Fully
 Automated Structure Prediction


CAFASP4 EVALUATION

The results of the automatic CAFASP4 evaluation are presented here and are independent of any other human assessment that may be carried out. The evaluation was carried out following the pre-announced rules, and using all the targets that were submitted to the servers whose native structures were available on November 22, 2004.


This summary contains 5 parts:

  • HM Sensitivity Evaluation
  • FR Sensitivity Evaluation
  • Specificity Evaluation
  • Performance using the "best of top 5" model.
  • Performance comparison of CAFASP servers vs CASP humans
  • What's new?
The evaluation of CAFASP-DP is presented separately.

In the evaluations below, except where otherwise indicated, only the rank-1 model was used.

In the tables below, we include the performance of CAFASP-CONSENSUS. The CAFASP-CONSENSUS predictions published at the CAFASP web pages during the prediction session entail a baseline comparison point which any human CASP predictor could have used.

  • HM Sensitivity Evaluation

    Best Servers - 34 Homology Modeling Targets. See Detailed Table
    N-1 Rank Servers Sum MaxSubDom Score Rank 30 no-canceled
    1 3djuryC1 20.92 1
    2 expm 3djuryC0 20-48-20.58 2-3
    3 CAFASP-CONSENSUS 20.35 3
    4 VICTOR 20.01 5
    5 inub bnmx shgu 3ds3 3djuryB1 spk2 19.60-20.17 5-9 + 17
    6 shub MQAP-CONSENSUS 19.83-19.60 5-6
    7 sfst ace robetta sp_3 19.44-19.99 5-10
    8 raptor 3djuryB0 19.83-19.92 4-10
    11 SOLVX 19.71 11
    ...
    43 pdbblast 18.59 35
    ...
    Meta-servers are listed in italics, pdbblast's rank is included for reference. Notice that due to the N-1 rule, more than one server can occupy the same rank and the Score column may show larger values for servers at lower ranks. The Rank 30 no-canceled column lists the ranks obtained when considering only the non-canceled targets. The rank differences indicate how sensitive the evaluation is with the removal of only 4 targets. This is due to the relatively small test set used.

    The results of the HM evaluation show that the differences among the servers on the HM targets are very small (the top 10 ranks lie within a score difference of 5%). This evaluation considers C-alpha atoms only, and thus it gives no indication of the abilities of the full-atom-producing servers to model the full proteins.

    • Improvement of the best server over CAFASP-CONSENSUS: 3%
    • Improvement of the best server over pdbblast: 12%
    • What are the best autonomous, non-meta servers? expm, inub, bnmx, shgu, spk2 and shub. But expm and bnxm are not available for the public.
    • Surprisingly, three MQAPs, VICTOR, MQAP-CONSENSUS and SOLVX, score at ranks 5 and 10, respectively.


  • FR Sensitivity Evaluation

    Best Servers - on 36 Fold Recognition Targets. See Detailed Table
    N-1 Rank Servers Sum MaxSubDom Score # Correct Rank 33 no-canceled
    1 3djuryC0 7.83 22 1
    2 3djurys: A1 C1 B0 7.47-7.56 20-21 1-3
    3 CAFASP-CONSENSUS 7.49 21 4
    4 3djuryB1 sfst 7.01-7.35 20-21 2-7
    5 pmodel3 7.07 21 1
    6 expm 6.92 19 1
    7 3djuryA0 bnmx VERIFY3D 6.67-6.95 19-20 1-11
    9 pmodel5 6.69 19 13
    11 ace POTR 6.53-6.58 17-19 13-14
    12 pmodel4 6.46 19 10
    ...
    79 pdbblast 0.84 3 81
    ...
    Meta-servers are listed in italics, pdbblast's rank is included for reference. Notice that due to the N-1 rule, more than one server can occupy the same rank and the Score column may show larger values for servers at lower ranks. The Rank 33 no-canceled column lists the ranks obtained when considering only the non-canceled targets. The rank differences indicate how sensitive the evaluation is with the removal of only 3 targets. This is due to the relatively small test set used.

    The results of the FR evaluation show that the test set is too small for a reliable evaluation. At least 11 of the 36 targets had no good predictions. One more or less lucky guess, or a different evaluation method can change the rankings significantly. Thus, these results should be interpreted with care.

    • Improvement of the best server over CAFASP-CONSENSUS: 5%
    • By definition, the improvement of the best server over pdbblast is almost infinite.
    • What are the best autonomous, non-meta servers? sfst, expm and bnmx, but neither is available for the public. The next available autonomous server is at a much lower rank.
    • Surprisingly again, two MQAPs, VERIFY3D and POTR, score at ranks 6 and 10, respectively.



  • Specificity Evaluation (accuracy of confidence estimates)

    Servers usually report the list of top hits for each prediction, along with an assigned score. We used this information to compute the specificity or selectivity of the servers like in LiveBench, using all 70 targets.

    Most Specific Servers. See Detailed Table
    Servers Avg Spec. Score # Correct > 1st False Positive > 3rd F+
    pmodel4 50.17 47 @ 1.66 51 @ 1.15
    3djurys C1 A1 C0 B1 47.67-48.33 41-44 @ various 47-49 @ various
    3ds3 47.5 45 @ 49.72 47 @ 28.47
    expm 46.67 35 @ 25.0 47 @ 6.40
    shub 46.50 44 @ 23.92 47 @ 17.54
    ...
    pdbblast 34.33 33 @ 0.002 35@ 0.03
    ...

    The scores of the 1st and 3rd false positive of each server are indicated after the @-sign. The best specificity was obtained by the meta-predictors, which were able to identify up to 67% of the targets (47 out of 70) with better scores than their first false positive, and up to 73% of the targets (51 out of 70) with 3 false positives.

    The table in the link above contains the scores of the first 5 false positives for all servers. These scores can help users of the servers to interpret the servers' results and can help users to understand the servers' reliabilities.


  • Performance using the "best of top 5" model.

    An alternative view of the performance of servers and MQAPs can be obtained by using the best of the top 5 models reported, instead of only the number 1 model.

    Best Servers, Best-of-5 evaluation on 34 Homology Modeling Targets. See Detailed Table
    N-1 Rank Servers Sum MaxSubDom Score
    1 MQAP-CONSENSUS VERIFY3D 21.61-22.02
    2-4 VICTOR PRQU MODCHECK SOLVX PROSA 21.27-21.82
    6 3djuryC0 3djuryC1 PHYRE POTR PROC 21.29-21.42
    7 BALA SIFR 21.11-21.30
    9 expm 21.28
    13 robetta 21.14

    Best Servers, Best-of-5 evaluation on 36 Fold Recognition Targets. See Detailed Table
    N-1 Rank Servers Sum MaxSubDom Score # Correct
    1 MQAP-CONSENSUS SOLVX VERIFY3D 9.45-9.80 27-28
    2-3 VICTOR PROSA 9.26-9.19 26-28
    4 robetta 9.02 27
    6-7 PHYRE POTR BALA MODC 8.61-8.99 25-28
    9 pmodel4 3djuryC0 8.41-8.43 22-23
    10 MODCHECK pmodel3 PROC 3djuryA1 sfst 8.29-8.49 23-26
    11 expm 8.24 23

    MQAPs are shown in UPPERCASE. Surprisingly, the MQAP performance on selecting good models in the top 5 ranks is outstanding, outperforming all the servers and meta-servers. Note also that the number of correct targets increased to up to 28. In the HM targets, the best MQAP performance is almost comparable to that of the best human predictors.

    In the HM targets, the best performer on the "best of 5" evaluation (MQAP-CONSENSUS), scores 5% higher than the best performer in the "model no. 1" evaluation (3djuryC1). In the FR targets, the best performer on the "best of 5" evaluation (MQAP-CONSENSUS), scores 25% higher than the best performer in the "model no. 1" evaluation (3djuryC0).

  • Performance comparison of CAFASP servers vs CASP humans.

    The CASP human predictions are not publically available, so only a preliminary comparison can be carried out at this point using a handful of human models that were sent to us (as "late"). To carry out the comparison, we only consider the subset of targets that were not canceled at CASP. So far, only 2 or 3 human CASP participants perform better than the best of the CAFASP servers (of which 2 correspond to the best CASP performers). The Tables below list the difference in percentages between the human CASP participants' performance and that of CAFASP-CONSENSUS.

    Top Human CASP participants' performance vs that of CAFASP-CONSENSUS (in percentage difference) among the HM targets. See the full table.
    Human % difference
    krgi +12%
    skyz + 8%
    mqap-con2 + 1%
    3dbe + 1%
    marc - 1%
    sp3later - 1%
    samu - 2%
    ...

    Top Human CASP participants' performance vs that of CAFASP-CONSENSUS (in percentage difference) among the FR targets. See the full table.
    Human % difference
    krgi +71%
    skyz +21%
    marc +14%
    sake +13%
    3dbe + 0%
    mqap-con2 + 0%
    chke - 3%
    ...

  • What is new in the CAFASP4 results?

    • Meta-predictors continue to dominate at the top, but a number of autonomous, non-metas are performing almost as well.
    • New, excellent autonomous servers have been developed.
    • Unfortunately, some of the best autonomous servers are not publically available.
    • The best servers significantly outperform pdbblast, even in the easiest targets.
    • The MQAP meta-selectors perform competitively when considering the no. 1 model, but excell when considering the "best-of-top-5" models (see table).
    • So far we have only found a handful of humans who outperform the best of the servers. However, the best human outperforms the best server by a large fraction: ~10% in HM and ~70% in FR (see table).
    • The test set in CAFASP is too small to be robust, especially among the FR targets.






several