CAFASP3 Evaluation Results
Critical Assessment of Fully
 Automated Structure Prediction



The results of the automatic Fold Recognition evaluation are presented here and are independent of any other human assessment that may be carried out. The evaluation was carried out following the pre-announced rules , and using the native structures available on October 17, 2002. Since then a few more structures have been released, and these may be considered in a later evaluation, although we don't expect any significant difference from the current rankings.

See the evaluation description for information on: Models considered for each target; The scoring functions used; Multi-domain partitioning; target classification into Homology Modeling Targets, and Fold Recognition Targets; The "N-1" ranking procedure. The Sensitivity and Specificity computations.

Main Results

  • Representative Models

    In this table a representative selection out of the best models for some of the targets are shown in detail. For each target, all models (upto 10 per server, over 50 servers) were evaluated, and one of the top scoring models was selected from amongst all the ca. 300-400 models. The selection emphasizes that many servers produced some "best" model, and that the latter did not always correspond to the rank-1 model. Please notice that the above selections have no effect in the evaluations below.

    In the evaluations below, only the rank-1 model was used.

  • FR Sensitivity Evaluation

    Best Servers - on 30 Fold Recognition Targets. See Detailed Table
    N-1 Rank Servers Sum MaxSub Score # Correct
    1 3ds5 robetta 5.17-5.25 15-17
    3 pmod 3ds3 pmode3 4.21-4.36 13-14
    4 raptor 3.98 13
    5 shgu 3.93 13
    6 3dsn orfeus 3.64-3.90 12-13
    7 pcons3 3.75 12
    8 fugu3 orf_c 3.39-3.67 11-12
    10 fugsa orf_b 3.44-3.63 10-12
    48 pdbblast 0.00 0
    Meta-servers are listed in italics, pdbblast's rank is included for reference. Notice that due to the N-1 rule, the Score column may show larger values for servers at lower ranks.

    It is clear that the best sensitivity is achieved by some of the meta-predictors. Among the individual servers, raptor and shgu follow closely at the "N-1" ranks 4-5, and orfeus at rank 6.

  • FR Specificity Evaluation (accuracy of confidence estimates)

    Fold recognition servers usually report the list of top hits for each prediction, along with an assigned score. We used this information to compute the specificity or selectivity of the servers like in LiveBench. Only the top performing servers with reported and parseable scores are included here. This means that some good servers may have been excluded because we did not know how to interpret their output (please do report to us if you find such ones). As in LB, for the specificity computation we do not partition into domains and considered all FR and HM-hard targets that have no domains in the HM category. For multi-domain targets (unless otherwise noted), the best domain result was considered. In total, 33 targets are considered here.

    Most Specific Servers. See Detailed Table
    Servers Avg Spec. Score # Correct > 1st False Positive > 3rd F+
    3ds5 24.8 20 @ 4.99 25 @ 4.06
    pmodel 22.6 20 @ 1.34 23 @ 1.25
    3ds3/3dsn 22.2 19 @ 30.32 22 @ 25.97
    pmodel3 22.0 16 @ 2.10 23 @ 1.66
    pcons3 21.6 17 @ 1.79 23 @ 1.34
    shgu 21.4 19 @ 22.87 21 @ 14.18
    inbgu 19.8 17 @ 14.10 20 @ 1.90
    fugu3 19.0 12 @ 8.78 21 @ 4.55
    pdbblast 13.0 11 @ 0.02 13 @ 0.04

    The scores of the 1st and 3rd false positive of each server are indicated after the @-sign. The best specificity was obtained by the meta-predictors, which were able to identify up to 60% of the targets (20 out of 33) with better scores than their first false positive (0% error rate), and up to 75% of the targets (25 out of 33) with 3 false positives (9% error rate). The most specific individual servers were shgu, inbgu and fugu3 with many others following closely. The table in the link above contains the scores of the first 5 false positives for all servers.

  • What is new in the CAFASP3 results?

    It seems that there is little new information in the CAFASP3 results that was not already present from the LiveBench results. All the CAFASP3 top ranking servers are also regular LB participants. LiveBench, as CAFASP3, has demonstrated that current meta-predictors have the best performance, with a handful of individual servers following. The differences among these individual servers is very slight, and thus the ranking differences between them may not be very significant.

    Although no non-LB servers ranked at the very top, their presence, in conjunction with the other LB servers, has been very valuable: their predictions can be of great help to improve human and consensus predictions. This was evidenced by the excellent performance of the 3D-JURY. The participation of dozens of servers, allows for a consensus prediction compiled from the results of all participating servers. In the spirit of the CAFASP-CONSENSUS of CAFASP2, the 3D-JURY predictions were compiled for each target and published in the CAFASP summary pages. 3D-JURY, developed by Leszek Rychlewski, was not a CAFASP participant. However, the advantage of 3D-JURY over the other meta-predictors, is the availability of the results of many CAFASP servers, usually unavailable outside CAFASP to other meta-predictors. In addition, the 3D-JURY results computed in the CAFASP summary pages, included also the predictions of the meta-predictors and the meta-meta-predictor robetta. Thus, 3D-JURY can be considered a meta-meta-meta-predictor. The detailed results of the 3D-JURY will be presented elsewhere. Application of the N-1 rule to 3D-JURY's predictions, shows that 3D-JURY would have received rank 1 (in the FR and the HM categories).

    The 3D-JURY predictions published at the CAFASP summary pages and those of the meta-predictors entail a baseline comparison point which any human CASP predictor could have used (and hopefully improve upon).

  • HM Sensitivity Evaluation

    Best Servers - 32 Homology Modeling Targets. See Detailed Table
    N-1 Rank Servers Sum MaxSub Score # Correct
    1 3ds3 19.89 32
    2 3ds5 19.79 32
    3 orfb samt02 shgu 19.27-19.52 32
    4 pmod 19.37 32
    5 pco3 19.25 32
    6 inbgu fugu3 19.06-19.07 32
    7 3dsn robetta 3dpsm pmod3 18.70-19.04 32
    15 pdbblast 18.56 32
    Meta-servers are listed in italics, pdbblast's rank is included for reference. Notice that due to the N-1 rule, the Score column may show larger values for servers at lower ranks.

    The results of the HM evaluation show that the differences among the servers on the HM targets are very small. This evaluation considers C-alpha atoms only, and thus it gives no indication of the abilities of the full-atom-producing servers to model the full proteins.

    • Sensitivity Subdivision of the HM targets

      Because some servers submitted predictions only for the easy HM targets, we subdivide the HM Targets into easy and hard:

      Best Servers - 20 Easy HM Targets. See Detailed Table
      N-1 Rank Servers Score
      1 samt02 orfb inbgu 14.91-15.1
      2 pco3 3dsn 14.82-14.88
      3 esypred 3ds3 14.87-14.95
      4 orfblast fugu3 14.67-14.77
      17 pdbblast 14.28

      Although it is hard to assess the slight differences of the servers using only C-alpha's, it seems that in the easy HM targets, many meta-predictors did not rank at the top . The "N-1" rank-1 was achieved by the individual servers samt02, orfb and inbgu. A purely HM server, esypred, showed excellent performance at rank 3.

      Best Servers - 12 Hard HM Targets. See Detailed Table
      N-1 Rank Servers Score
      1 3ds5 5.13
      2 3ds3 shgu 4.93-5.02
      4 pmod pmod3 4.60-4.68
      6 orfeus orfb 3dpsm raptor fugu3 pco3 robetta 4.33-4.43
      8 samt02 4.18
      11 pdbblast 4.28

      The hard HM category is where the servers produce interesting models with sufficiently accurate alignments for full homology modelling. After some meta-predictors, many individual servers ranked at the top, including shgu, orfeus, orfb, 3dpsm, raptor, fugu3 and samt02. The differences among these individual servers are also very slight, and their exact ranking is probably not very meaningful.

  • What happens after CAFASP3?

    Continuous benchmarks. The continuous evaluation of LiveBench, EVA and PDB-CAFASP will allow to test new servers and new post-CASP improvements within a few months. Will likely give valuable information for CASP6.

    Evaluation methods. The automatic evaluation methods continue to be controversial, and new methods are being developed. Testing the new methods vs the old ones and vs the human assessment of CASP will hopefully result in better automated methods for future LiveBench and CAFASP experiments.

    TMW - the next ten?

    Two new experiments:

    • PDB-CAFASP , an experiment similar to LB, but, like EVA, it uses as targets pre-release PDB sequences with no close homologues of known structure.
    • MR-CAFASP , is not a competitive experiment, but a community wide effort. The aim is to assess the value of predicted models as aids in the structure determination process. PDB-CAFASP targets with strong predictions will be selected as (blind) test cases.

    And finally, in two years: CAFASP-4.