CAFASP4 EVALUATION
The results of the automatic CAFASP4 evaluation are presented here
and are independent of any other human
assessment that may be carried out.
The evaluation was carried out following the
pre-announced rules, and using all the targets that were submitted to the
servers whose native structures were available
on November 22, 2004.
This summary contains 5 parts:
- HM Sensitivity Evaluation
- FR Sensitivity Evaluation
- Specificity Evaluation
- Performance using the "best of top 5" model.
- Performance comparison of CAFASP servers vs CASP humans
- What's new?
The evaluation of
CAFASP-DP is presented separately.
In the evaluations below, except where otherwise indicated, only the rank-1 model was used.
In the tables below, we include the performance of CAFASP-CONSENSUS.
The CAFASP-CONSENSUS predictions published at the CAFASP web pages during
the prediction session entail a
baseline comparison point which any human CASP predictor could have used.
- HM Sensitivity Evaluation
Best Servers - 34 Homology Modeling Targets. See Detailed Table
| N-1 Rank | Servers | Sum MaxSubDom Score | Rank 30 no-canceled |
| 1 | 3djuryC1 | 20.92 | 1 |
| 2 | expm 3djuryC0 | 20-48-20.58 | 2-3 |
| 3 | CAFASP-CONSENSUS | 20.35 | 3 |
| 4 | VICTOR | 20.01 | 5 |
| 5 | inub bnmx shgu 3ds3
3djuryB1 spk2 | 19.60-20.17 | 5-9 + 17 |
| 6 | shub MQAP-CONSENSUS | 19.83-19.60 | 5-6 |
| 7 | sfst ace robetta sp_3 | 19.44-19.99 | 5-10 |
| 8 | raptor 3djuryB0 | 19.83-19.92 | 4-10 |
| 11 | SOLVX | 19.71 | 11 |
| ... | | | |
| 43 | pdbblast | 18.59 | 35 |
| ... | | |
Meta-servers are listed in italics,
pdbblast's rank is included for reference.
Notice that due to the N-1 rule, more than one server can occupy the same rank and
the Score column may show larger values for servers at lower ranks.
The Rank 30 no-canceled column lists the ranks obtained when considering only the
non-canceled targets. The rank differences indicate how sensitive the evaluation is
with the removal of only 4 targets. This is due to the relatively small test set used.
The results of the HM evaluation show that the
differences among the servers on the HM targets are very
small (the top 10 ranks lie within a score difference of 5%). This evaluation considers C-alpha atoms only, and thus
it gives no indication of the abilities of the full-atom-producing servers to model the full proteins.
- Improvement of the best server over CAFASP-CONSENSUS: 3%
- Improvement of the best server over pdbblast: 12%
- What are the best autonomous, non-meta servers? expm, inub, bnmx, shgu, spk2 and shub. But expm and bnxm are not available for the public.
- Surprisingly, three MQAPs, VICTOR, MQAP-CONSENSUS and SOLVX, score at ranks 5 and 10, respectively.
- FR Sensitivity Evaluation
Best Servers - on 36 Fold Recognition Targets. See Detailed Table
| N-1 Rank | Servers | Sum MaxSubDom Score | # Correct | Rank 33 no-canceled |
| 1 | 3djuryC0 | 7.83 | 22 | 1 |
| 2 | 3djurys: A1 C1 B0 | 7.47-7.56 | 20-21 | 1-3 |
| 3 | CAFASP-CONSENSUS | 7.49 | 21 | 4 |
| 4 | 3djuryB1 sfst | 7.01-7.35 | 20-21 | 2-7 |
| 5 | pmodel3 | 7.07 | 21 | 1 |
| 6 | expm | 6.92 | 19 | 1 |
| 7 | 3djuryA0 bnmx VERIFY3D | 6.67-6.95 | 19-20 | 1-11 |
| 9 | pmodel5 | 6.69 | 19 | 13 |
| 11 | ace POTR | 6.53-6.58 | 17-19 | 13-14 |
| 12 | pmodel4 | 6.46 | 19 | 10 |
| ... | | | |
| 79 | pdbblast | 0.84 | 3 | 81 |
| ... | | | |
Meta-servers are listed in italics,
pdbblast's rank is included for reference.
Notice that due to the N-1 rule, more than one server can occupy the same rank and
the Score column may show larger values for servers at lower ranks.
The Rank 33 no-canceled column lists the ranks obtained when considering only the
non-canceled targets. The rank differences indicate how sensitive the evaluation is
with the removal of only 3 targets. This is due to the relatively small test set used.
The results of the FR evaluation show that the test set is too small
for a reliable evaluation. At least 11 of the 36 targets
had no good predictions.
One more or less lucky guess, or a different evaluation method can change the rankings significantly. Thus, these results should be interpreted with care.
- Improvement of the best server over CAFASP-CONSENSUS: 5%
- By definition, the improvement of the best server over pdbblast is almost infinite.
- What are the best autonomous, non-meta servers? sfst, expm and bnmx, but
neither is available for the public. The next available autonomous server is at a much lower rank.
- Surprisingly again, two MQAPs, VERIFY3D and POTR, score at ranks 6 and 10, respectively.
- Specificity Evaluation (accuracy of confidence estimates)
Servers usually report the list of top hits for each prediction, along
with an assigned score. We used this information to compute
the specificity or selectivity of the servers
like in LiveBench, using all 70 targets.
Most Specific Servers. See Detailed Table
| Servers | Avg Spec. Score | # Correct > 1st False Positive | > 3rd F+ |
| pmodel4 | 50.17 | 47 @ 1.66 | 51 @ 1.15 |
| 3djurys C1 A1 C0 B1 | 47.67-48.33 | 41-44 @ various | 47-49 @ various |
| 3ds3 | 47.5 | 45 @ 49.72 | 47 @ 28.47 |
| expm | 46.67 | 35 @ 25.0 | 47 @ 6.40 |
| shub | 46.50 | 44 @ 23.92 | 47 @ 17.54 |
| ... | | | |
| pdbblast | 34.33 | 33 @ 0.002 | 35@ 0.03 |
| ... | | | |
The scores of the 1st and 3rd false positive of each server are indicated
after the @-sign.
The best specificity was obtained by the meta-predictors,
which were able to identify up to 67% of the targets (47 out of 70) with better scores
than their first false positive, and up to 73% of the targets
(51 out of 70) with 3 false positives.
The table in the link above contains the scores of the
first 5 false positives for all servers. These scores can help users of the
servers to interpret the servers' results and can help users to understand the servers' reliabilities.
-
Performance using the "best of top 5" model.
An alternative view of the performance of servers and MQAPs can be
obtained by using the best of the top 5 models reported, instead of only
the number 1 model.
Best Servers, Best-of-5 evaluation on 34 Homology Modeling Targets. See Detailed Table
| N-1 Rank | Servers | Sum MaxSubDom Score | |
| 1 | MQAP-CONSENSUS VERIFY3D | 21.61-22.02 |
| 2-4 | VICTOR PRQU MODCHECK SOLVX PROSA | 21.27-21.82 |
| 6 | 3djuryC0 3djuryC1 PHYRE POTR PROC | 21.29-21.42 |
| 7 | BALA SIFR | 21.11-21.30 |
| 9 | expm | 21.28 |
| 13 | robetta | 21.14 |
Best Servers, Best-of-5 evaluation on 36 Fold Recognition Targets. See Detailed Table
| N-1 Rank | Servers | Sum MaxSubDom Score | # Correct |
| 1 | MQAP-CONSENSUS SOLVX VERIFY3D | 9.45-9.80 | 27-28 |
| 2-3 | VICTOR PROSA | 9.26-9.19 | 26-28 |
| 4 | robetta | 9.02 | 27 |
| 6-7 | PHYRE POTR BALA MODC | 8.61-8.99 | 25-28 |
| 9 | pmodel4 3djuryC0 | 8.41-8.43 | 22-23 |
| 10 | MODCHECK pmodel3 PROC 3djuryA1 sfst | 8.29-8.49 | 23-26 |
| 11 | expm | 8.24 | 23 |
MQAPs are shown in UPPERCASE.
Surprisingly, the MQAP performance on selecting good models in the top 5 ranks is
outstanding, outperforming all the servers and meta-servers. Note also that the number
of correct targets increased to up to 28. In the HM targets,
the best MQAP performance is almost comparable to that of the best human predictors.
In the HM targets, the best performer on the "best of 5" evaluation (MQAP-CONSENSUS),
scores 5% higher than the best performer in the "model no. 1" evaluation (3djuryC1).
In the FR targets, the best performer on the "best of 5" evaluation (MQAP-CONSENSUS),
scores 25% higher than the best performer in the "model no. 1" evaluation (3djuryC0).
- Performance comparison of CAFASP servers vs CASP humans.
The CASP human predictions are not publically available, so only a preliminary
comparison can be carried out at this point using a handful of human models that
were sent to us (as "late"). To carry out the comparison, we only consider the subset of targets
that were not canceled at CASP.
So far, only 2 or 3 human CASP participants perform better than
the best of the CAFASP servers (of which 2 correspond to the best CASP performers).
The Tables below list the difference in percentages between the human CASP
participants' performance and that of CAFASP-CONSENSUS.
Top Human CASP participants' performance vs that of CAFASP-CONSENSUS (in percentage difference) among the HM targets. See the full table.
| Human | % difference |
| krgi | +12% |
| skyz | + 8% |
| mqap-con2 | + 1% |
| 3dbe | + 1% |
| marc | - 1% |
| sp3later | - 1% |
| samu | - 2% |
| ... | | |
Top Human CASP participants' performance vs that of CAFASP-CONSENSUS (in percentage difference) among the FR targets. See the full table.
| Human | % difference |
| krgi | +71% |
| skyz | +21% |
| marc | +14% |
| sake | +13% |
| 3dbe | + 0% |
| mqap-con2 | + 0% |
| chke | - 3% |
| ... | | |
- What is new in the CAFASP4 results?
- Meta-predictors continue to dominate at the top, but a number of autonomous,
non-metas are performing almost as well.
- New, excellent autonomous servers have been developed.
- Unfortunately, some of the best autonomous servers are not publically available.
- The best servers significantly outperform pdbblast, even in the easiest targets.
- The MQAP meta-selectors perform competitively when considering the no. 1 model, but
excell when considering the "best-of-top-5" models (see table).
- So far we have only found a handful of humans who outperform the best of the servers. However, the best human outperforms the best server by a large fraction: ~10% in HM and ~70% in FR (see table).
- The test set in CAFASP is too small to be robust, especially among the FR targets.