
CAFASP3 ALIGNMENT ACCURACY EVALUATION
The results of the automatic Fold Recognition evaluation are presented here
and are independent of any other human
assessment that may be carried out.
The evaluation was carried out following the
pre-announced rules , and using the native structures available
on October 17, 2002. Since then a few more structures have been released,
and these may be considered in a later evaluation, although we don't expect
any significant difference from the current rankings.
See the evaluation description for information on:
Models considered for each target; The scoring functions used;
Multi-domain partitioning;
target classification into Homology Modeling Targets, and Fold Recognition Targets;
The "N-1" ranking procedure.
The Sensitivity and Specificity computations.
Main Results
In this table a representative
selection out of the best models
for some of the targets are shown in detail.
For each target, all models (upto 10
per server, over 50 servers)
were evaluated, and one of the top scoring models was selected from amongst all
the ca. 300-400 models. The selection
emphasizes that many servers produced some "best" model, and that
the latter did not always correspond to the rank-1 model.
Please notice that the above selections have no effect in the evaluations below.
In the evaluations below, only the rank-1 model was used.
It is clear that the best sensitivity is achieved by some of the
meta-predictors. Among the individual servers, raptor and shgu follow
closely at the "N-1" ranks 4-5, and orfeus at rank 6.
Fold recognition servers usually report the list of top hits for each prediction, along
with an assigned score. We used this information to compute
the specificity or selectivity of the servers
like in LiveBench. Only the top performing servers with reported and parseable
scores
are included here. This means that some good servers may have been excluded
because we did not know how to interpret their output (please do report to
us if you find such ones).
As in LB, for the specificity computation we do not
partition into domains and considered all FR and HM-hard targets that
have no domains in the HM category. For multi-domain targets
(unless otherwise noted), the best domain result
was considered. In total, 33 targets are considered here.
It seems that there is little new information in the CAFASP3 results
that was not already present from the LiveBench results.
All the CAFASP3 top ranking servers are also regular LB participants.
LiveBench, as CAFASP3, has demonstrated that current meta-predictors have the
best performance, with a handful of individual servers following.
The differences among these individual servers is very slight, and thus
the ranking differences between them may not be very significant.
Although no non-LB servers ranked at the very top, their presence,
in conjunction with the other LB servers,
has been very valuable: their predictions can be of great
help to improve human and consensus predictions.
This was evidenced by the excellent performance of the 3D-JURY.
The participation of dozens of servers, allows for a consensus prediction
compiled from the results of all participating servers. In the
spirit of the CAFASP-CONSENSUS of CAFASP2, the 3D-JURY predictions were
compiled for each target and published in the CAFASP summary pages.
3D-JURY, developed by Leszek Rychlewski, was not a CAFASP participant.
However, the advantage of 3D-JURY over the other meta-predictors, is
the availability
of the results of many CAFASP servers, usually unavailable outside CAFASP to other
meta-predictors. In addition, the 3D-JURY results computed in the CAFASP
summary pages, included also the predictions of the meta-predictors and
the meta-meta-predictor robetta. Thus,
3D-JURY can be considered a meta-meta-meta-predictor.
The detailed results of the 3D-JURY will
be presented elsewhere. Application of the N-1 rule to 3D-JURY's predictions,
shows that 3D-JURY would have received rank 1 (in the FR and the HM categories).
The 3D-JURY predictions published at the CAFASP summary pages and
those of the meta-predictors entail a
baseline comparison point which any human CASP predictor could have used
(and hopefully improve upon).
The results of the HM evaluation show that the
differences among the servers on the HM targets are very
small. This evaluation considers C-alpha atoms only, and thus
it gives no indication of the abilities of the full-atom-producing servers to model the full proteins.
Sensitivity Subdivision of the HM targets
Because some servers submitted predictions only for the easy HM targets, we subdivide the HM Targets into easy and hard: Best Servers - 20 Easy HM Targets. See Detailed Table
Although it is hard to assess the slight differences of the servers using only C-alpha's, it seems that in the easy HM targets, many meta-predictors did not rank at the top . The "N-1" rank-1 was achieved by the individual servers samt02, orfb and inbgu. A purely HM server, esypred, showed excellent performance at rank 3. Best Servers - 12 Hard HM Targets. See Detailed Table
The hard HM category is where the servers produce
interesting models with sufficiently accurate alignments for full homology
modelling. After some meta-predictors, many individual servers ranked at
the top, including shgu, orfeus, orfb, 3dpsm, raptor, fugu3 and samt02.
The differences among these individual servers are also very slight,
and their exact ranking is probably not very meaningful.
Continuous benchmarks. The continuous evaluation of LiveBench, EVA and PDB-CAFASP will allow to test new servers and new post-CASP improvements within a few months. Will likely give valuable information for CASP6. Evaluation methods. The automatic evaluation methods continue to be controversial, and new methods are being developed. Testing the new methods vs the old ones and vs the human assessment of CASP will hopefully result in better automated methods for future LiveBench and CAFASP experiments. TMW - the next ten? Two new experiments:
And finally, in two years: CAFASP-4.
|