CAFASP-1: Critical Assessment of Fully Automated Structure Prediction Methods

Lawrence A. Kelley, Robert M MacCallum & Michael Sternberg
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, England
e-mail: {kelley, r.maccallum, m.sternberg}@icrf.icnet.uk


Kevin Karplus
Computer Engineering
University of California, Santa Cruz 95064, USA
email: karplus@cse.ucsc.edu


Daniel Fischer *
Faculty of Natural Science,
Dept. of Math and Computer Science
Beer-Sheva 84015, Israel
e-mail: dfischer@cs.bgu.ac.il


Arne Elofsson
Dept. of Biochemistry,
Stockholm University, 106 91
Stockholm, Sweden
email: arne@biokemi.su.se


Adam Godzik (A), Leszek Rychlewski (B), Krzysztof Pawłowski(A)
(A) The Burnham Institute, La Jolla, CA 92037, USA
(B) San Diego Supercomputer Center La Jolla, CA 92093
e-mail: adam@burnham-inst.org, leszek@sdsc.edu, krzysiek@burnham-inst.org


David Jones and Kevin Bryson
Department of Biological Sciences
University of Warwick
Coventry, CV4 7AL, United Kingdom
e-mail: {jones,bryson}@globin.bio.warwick.ac.uk


Burkhard Rost
European Molecular Biology Laboratory
69 012 Heidelberg, Germany
e-mail: rost@embl-heidelberg.de


* Corresponding author, Tel: +972-7-647-2714, Fax: +972-7-647-2910
Keywords: Fold Recognition, Threading, Fully Automated Structure Prediction, Critical Assessment, CASP.
Short title: CAFASP-1.









Abstract



The results of the first Critical Assessment of Fully Automated Structure Prediction (CAFASP-1) are presented. The objective was to evaluate the success rates of fully automatic web servers for fold recognition which are available to the community. This study was based on the targets used in the third meeting on the Critical Assessment of Techniques for Protein Structure Prediction (CASP3). However unlike CASP3, the study was not a blind trial as it was held after the structures of the targets were known. The aim was to assess the performance of methods without the user intervention that several groups used in their CASP3 submissions. For each target, groups submitted the top-ranking folds generated from their servers. Only recognition of the correct fold was evaluated and not, as in CASP3, alignment accuracy. The results showed that current fully automated fold recognition servers can often identify remote similarities when pairwise sequence search methods fail. Nevertheless, in only a few cases outside the family-level targets has the score of the top-ranking fold been significant enough to allow for a confident fully automated prediction. Although some performance differences appeared within each of the four target categories used here, overall, no single server has proved markedly superior to the others.

CAFASP-1 has encouraged developers of automated methods to have their programs available to the community in the form of easy-to-use computer servers. Assessing the performance of automated programs can help a human predictor to establish a better strategy on how to use the automated results for his or her particular prediction. Finally, CAFASP-1 has been useful in identifying the requirements for a future blind trial of automated served-based fold prediction.


















Introduction



Fold recognition addresses a subset of the more general problem of prediction of the three-dimensional structure of a protein from its amino acid sequence. Fold-recognition methods search a library of known folds to find the most compatible protein for a given target sequence of unknown structure. The predictive power of such methods was clearly demonstrated in blind tests, such as CASP3, where prediction targets were not known at the time the predictions were made. However, to be accepted as a research tool, such methods must be used by a wider community in actual research. The recent growth of the Internet provided a perfect opportunity for groups developing structure prediction algorithms to make them available to their potential users: scientists interested in seeking structural insights for their particular research problems. Led by the astounding popularity of existing web servers, several groups have made their fold-recognition methods available to the community.

In this manuscript, we demonstrate the performance of fully automated fold-recognition methods on a set of sequences, selected from the prediction targets of the CASP3 experiment. Most of us took part in this meeting and submitted predictions, which in some cases are discussed in other manuscripts in this special issue of Proteins. Therefore, it is important to stress the differences between results presented in this manuscript and those available on the official CASP3 web site and discussed in other manuscripts in this issue.

Results presented here were generated automatically by web servers from submitted sequence data, with no human-expert intervention involved. In this respect, predictions presented here represent raw data that could be available to any user of a fold prediction server. An expert could easily improve such predictions, but his or her work often includes steps that are difficult to describe in a quantitative way and that are not easily reproducible. Such expertise was used extensively for predictions presented at the CASP3 meeting. In many cases extensive work by groups of several people was necessary to prepare a successful prediction. Someone lacking specialized expertise in fold prediction may not be able to achieve the same prediction accuracy as some of the predictions submitted at the CASP3 meeting. On the other hand, results from this manuscript are obtained with a few simple mouse clicks. Another difference between the results presented here and those from CASP3 is that for the latter, all the predictions were blind, whereas in CAFASP1, the experimental structures of all but 4 of the targets were already known. Thus, CAFASP1 is not a blind experiment. In addition, because the goals of CASP3 and CAFASP1 are different, the evaluation procedures used were also different. Consequently, the results shown here are not comparable with those reported in CASP3.

Our insistence on full automation of the fold prediction process is not meant to belittle or to cast any doubt on the importance of the specialized expertise in fold predictions. But for certain purposes, such as the evaluation and comparison of various prediction algorithms and strategies, it is useful to have fully automated and easily reproducible methods. This evaluation assesses the performance of the methods only and not the performance of humans using machines. This is what a non-expert user is most interested in: "Which program(s) should I choose to use in order to predict the structure of this new sequence?". And last, but not least, fully automated methods are necessary to apply fold prediction to large groups of protein sequences, such as those available from genome projects. CAFASP1 attempts to provide an assessment of the capabilities of current fold-recognition servers.

Methods


1. The automated methods (in alphabetical order).


Seven groups actively participated in CAFASP1.
Table I lists for each group, the name of the server evaluated, its url and the corresponding reference; Table II summarizes the main characteristics of the servers.

2. The targets.

For CAFASP1 we have used as benchmark for our methods the CASP3 targets. We have classified the targets into 7 categories:

1. Targets with folds at SCOP's [1] family level.

In this category we chose the five targets with lowest sequence similarity to their corresponding folds from those classified in CASP3 as having homologous at the family level; the other family-level CASP3 targets were excluded from CAFASP1 because they do not pose any challenge to fold-recognition methods. The targets of this category included in CAFASP1 are
T0055 , T0057 , T0068 , T0070 , T0062 (for a full protein description see the assessors' papers in this Issue).

2. Targets with folds at SCOP's superfamily level:

T0074 , T0081 , T0083 , T0063 , T0053 , T0044 , T0054 , T0085 , T0080 .

3. Targets with folds at SCOP's fold level:

T0046 , T0071 , T0043 , T0067 , T0059 .

4. Multi-domain targets.

The domain boundaries for these targets were determined after the structures were known. Predictions for the single domains were included in CAFASP1 to evaluate how performance on single domains changes. The domains considered are: T0083.1 , T0063.1 , T0063.2 , T0071.1 , T0071.2 , T0079.1 , T0079.2 .

5. Targets of unknown fold.

The structures of these targets have not yet been determined and thus they correspond to genuine blind predictions: T0045 , T0051 , T0072 , T0078 .

6. Targets with new folds.

T0052 and T0056 . These targets were included in CAFASP1 for additional evaluation purposes (see below).

7. Targets that could not be evaluated in CAFASP1.

We included in this category those CASP3 targets for which we found that our evaluation procedure could not be applied (see below): T0061, T0075, T0077, T0079.


3. The evaluation method.

In CAFASP1, we used an evaluation scheme that tested only fold recognition, and not alignment. Each server produced a list of top-scoring folds, and if the first correct fold appeared at rank i , then 1/i points were awarded. (This scoring scheme is similar to that used for CASP-1.)

The rationale of this scoring system is as follows: Suppose a program always has the correct answer within the top i ranks; if only a single answer is desired, then, on average, the correct fold will be predicted with probability 1/i.

For structure prediction, evaluating the quality of the sequence-structure alignments is critical, since fold-recognition methods can in some cases produce poor sequence-structure alignments. Unfortunately, for CAFASP1 evaluating alignments was not possible given the time constraints. Thus our evaluation procedure may award points to predictions that in CASP3 were considered incorrect. As progress in evaluation has been observed from CASP1 to CASP2 and CASP3, we hope that for CAFASP similar progress will be observed and alignment quality will be properly assessed in CAFASP2.

Another difficulty with this evaluation method is to identify what is the list of "correct" hits for each target. For the targets in category 1 (family-level) and 2 (superfamily-level) there was almost no difficulty. Any PDB chain with the same fold type in SCOP as the target was considered a correct hit; anything outside the fold type of the target was considered to be wrong. The only exception was T0085, which belongs to a "fold" in SCOP which, according to SCOP, is not a real fold, but only a collection of different folds. Thus, for T0085 we only accepted as correct hits those entries in T0085's superfamily.

Since we used SCOP to determine what folds were to be treated as correct, any reported hits that did not have a SCOP classification were excluded from the ranking before scoring.

In addition, we had to decide how to evaluate multi-domain targets (T0063, T0081, T0083 and T0079), each of which has two domains. For the full-sequence tests, hits belonging to the fold type of either domain were considered correct. The separate domains of these targets were evaluated in the domain-level category (see below).

Unfortunately, we had to exclude T0079 from the full-sequence tests, and include it only in the domain tests. As a whole, it was not easy to select a single fold type as correct for T0079; several fold types can be considered as good hits, but only for the individual domains.

For the targets in category 3 (fold-level) we applied the same criteria (accept as correct hits those entries classified in SCOP as the same fold type). However, there were 3 targets (T0077, T0061 and T0075) for which we could not determine what single fold type should be considered as a correct hit. In addition, the similarities of these targets to known folds is weaker and do not cover the full sequences. Thus, we decided not to evaluate these targets and place them together with T0079 into category of non-evaluated targets (the results of the automated methods on these targets are included in the CAFASP1 web page at http://www.cs.bgu.ac.il/~dfischer/cafasp1/cafasp1.html). The targets in category 7 can be considered as not suitable for our simple evaluation method.

In category 4 we placed the domain predictions. Although determining the exact boundaries of a domain requires knowledge of the structure, we wanted to evaluate our methods also on single domains. In many cases, rough domain boundaries are also known from the sequence alone, as has been demonstrated in several cases in the CASP3 predictions. The evaluation criteria used here was the same as for the above categories.

No evaluation was possible for category 5, as the structures of these targets are still unknown. Thus our predictions for these targets are truly blind predictions. When the structures are released we will be able to evaluate them.

Finally, the two targets considered to be novel folds were placed in category 6. An ideal fold-recognition method should be able to identify also new folds, or at least give a very low score for the top-ranking fold. We filed predictions for targets in this category to observe how high our top-ranking fold scored for novel folds. This can be helpful in setting confidence thresholds for the methods when applied at larger scales.

The lists of folds considered to be correct hits for each target are included in the CAFASP1 web page.

Normalized scores.

To identify some of the remarkable predictions, that is, good predictions for targets that few predictors did well on, we applied the following normalization. Let Sij be the score received by program i on target j (computed as one divided by the rank of the first correct hit). Let Tj be the sum of scores for target j. We define the normalized score as Sij * Sij/Tj. The larger the normalized score, the more remarkable the prediction of program i for target j is.

The normalized scores do not change the overall rankings of the programs by much, and so are not shown, though the information is available on the main web page and the summary results web page .

The grading system.

To allow for a detailed comparison of performance we computed for each category and program a partial total of the inverse-rank scores. For each category we observed which programs obtained the highest total and subsequently added all the scores into an overall grade.

4. Reproducibility and validity of the automated results.

We verified that the results reported here accurately correspond to those that are obtained by the automated programs. Most results were checked by at least one other person (besides the developer of the program). Thus, a reader can submit the sequence of any of the targets and expect to obtain essentially the same results (excluding the differences that will appear due to possibly updated databases). All the programs included in CAFASP1 are available freely through the internet.


Results

Table III is a summary of the results by program and by category. The individual inverse-rank scores by target and program are available in the corresponding tables from our main web page .

Category 1. Family-level targets

Almost all methods did well on the close homology targets, although three methods had some trouble with target T0068. To really distinguish between the performance of the methods in this category, evaluation of the alignments and scores is needed (see Discussion).

Category 2. Superfamily-level targets

This Category includes 9 targets and provides most of the variation in the final ranking of the methods. The two best-scoring programs (GenThreader and SDPMA2) received 5.00 and 4.43 points, respectively (column "SUPERFAMILY" in the table).

GenThreader's performance is remarkable in that its 5.00 points were obtained from correct predictions at rank one in five targets (and zeroes in the other four); fsrvr_SDPMA2 had correct hits in the top 10 ranks for 8 of the 9 targets, but only scored the correct hit first for three of the targets.

GenThreader [2] was originally designed for the purposes of structurally annotating genomic sequences. The method uses a combination of traditional profile-based sequence alignment, a set of threading potentials [3], and a neural network to evaluate the quality of the implied structural model. In the original version of GenThreader the profiles which make up the fold library were generated using a standard multiple sequence alignment method, but in the current version (used here) the profiles are generated in one step by using PSIBLAST [4]. For each sequence-structure alignment, the threading potentials are summed for the implied model, and the energy sums and sequence alignment score are presented to a neural network. The neural network is used to detect favorable combinations of energy sums and alignment scores, and has been trained on known structural similarities found in the CATH structure classification database [5]. In the CASP3 predictions made by the Jones group, GenThreader was only used as a pre-filter to detect superfamily matches. Apart from targets T0074, T0083 and T0085, the GenThreader results were not considered significant, and so most of the results were arrived at using a full threading method (THREADER2).

frsvr_SDPMA2 is a variation of the SDP method previously described [6], which takes as input a multiple alignment of sequences homologous to the target and the predicted secondary structure given by PHD [15]. The multiple alignment or profile can be built using any method, but the current implementation in frsvr uses a very simple approach: it compiles significant hits from SWISSPROT [8] using a single BLAST [9] search and the multiple alignment is built using PROFILEMAKE [10]. The sequence-to-structure compatibility function combines (i) the sequence similarity between the multiple alignment and the sequence of a protein of known structure with (ii) the extent of agreement between the predicted and the observed secondary structures. SDP uses the global-local alignment algorithm [11] for ranking the folds in the library. The library includes full PDB chains and single domains from multi-domain proteins and currently contains over 2000 entries. The top ranks are sorted by their z-scores, which are obtained from the distribution of scores of the folds in the library. The folds of newer (or C-alpha only) PDB entries corresponding to the best matches for some of the CASP3 targets were not in the library; for CAFASP1 these have been included. For filing predictions for CASP3, the Fischer group used human intervention and in some cases, the fold obtained at rank one by frsvr was chosen. In other cases, because the z-score of the rank one result was below a confidence threshold, a different fold was chosen.

The highest normalized scores in this category were also obtained by GenThreader and SDPMA2 for their correct prediction at rank one of target T0085. The second highest normalized score was obtained by 3 programs (BASIC, Karplus2 and Karplus3) in their correct identification at rank one for target T0044.

Category 3. Fold-level targets

The results for category 3 are shown in column "FOLD" of Table III. The best performing methods were Karplus1 and Karplus3 with 2.06 and 1.62 points, respectively, out of a possible maximum of 5. Clearly, the performance of our methods in the "fold-level" targets is not as good as that in the "superfamily" targets. The most outstanding result when observing the normalized scores, was obtained by Karplus1 on target T0043; it was the only program identifying the correct fold at rank one.

The Karplus1 and Karplus3 methods are both SAM methods [12]. In SAM-T98 a hidden Markov model (HMM) is constructed from a single sequence and homologs that are found in a non-redundant protein database. The method alternates between searching the database for homologs using an HMM and realigning the homologs using Baum-Welch [12] training on the HMM. Only sequence information is used, not structure information. All scoring with HMMs was done with local scoring summing over all alignments. For Karplus1, an HMM was built for each fold in the fold library, and the target sequence scored against all the HMMS. For Karplus2, an HMM was built for the target sequence, and the entire PDB database was scored. For Karplus3, the HMM scores for the template and target methods were added. For CASP3, hand-selection among the top few hits and hand-realignment was done, but subsequent analysis indicates that the fully automatic method does about as well overall as after modification by hand.

It is interesting to notice that the best performers in this category were methods based on sequence-information alone. We have no explanation for this phenomenon, and previous tests of the SAM-T98 method indicated that it found the correct fold only when it found the correct superfamily. One possible partial explanation is that the SAM-T98 method relies on local alignment, so one does not need to match the entire fold to find a match. The sum-over-all-alignments scoring makes the method more tolerant of somewhat incorrect alignments than the structure-based methods, which usually require better alignments before they provide good scores. However, it is not possible to arrive at far-reaching conclusions from a sample of only 5 test cases.

When computing the sub-total from categories 1 through 3 the best performers are GenThreader, SDPMA2, and Karplus3 with 11.11, 10.22, and 10.14 points, respectively (column SUBTOTAL in Table III). These are the same top three as for the superfamily category alone, which contributes most of the variation between inverse-rank scores.

Category 4. Domain targets

This category does not strictly belong to a fully automated context because determination of the domain boundaries required previous knowledge. Nevertheless, because in an actual prediction experiment it is often suspected what the boundaries are, we also tested our programs using the exact domain definitions. In this category the best performers were SDPMA and SDPMA2 with 4.78 and 4.60 points, respectively, out of a maximum of 7.00 (data shown on our summary results web page). The SDPMA methods identified the correct fold in rank one for four targets and in rank two for one target. The next best performer in this category was (1D+3D)-PSSM with 3.81 points.

Fold recognition by 3D-PSSMs uses the SCOP database to identify remote homologues that are superposed in 3-dimensions to obtain sequence alignments that could not be obtained from sequence alone. The fold library consists of representative proteins <40% identity from the SCOP (called SCOP40). A master protein in the SCOP40 library is selected (say A0) and using PSI-BLAST a 1D-profile is constructed incorporating sequences with >25% identity. This 1D-profile is directly coverted into a 1D-PSSM using the PSI-BLAST approach. Then from the structural alignment of say protein domains B0 and C0 to A0, multiple sequence alignments are piled up to generate a 3D-profile combining the 1D-profiles of A0, B0 and C0 which is then converted into a PSSM. The search algorithm is a global dynamic programming algorithm with the sequence of the probe matched against the template PSSM together with scoring the similarity between predicted secondary structure for the probe with experimental secondary structure of the template. In 1D-3D-PSSMs, searches are made with the 1D- and the 3D-PSSMs and the results are pooled and sorted by expectation value. The 3D and (3D-1D)-PSSM methods have been developed substantially since they were applied at CASP3.

The most remarkable normalized score achieved in this category was obtained by BASIC on target T0071.2. BASIC identified the correct fold at rank two, whereas only two other methods had the correct fold at ranks eight or higher. The next most remarkable normalized result was obtained by SDPMA and SDPMA2 for identifying the correct fold of target T0063.1 in rank 2, whereas only one other method had the correct fold at rank nine.

Clearly, knowing the exact domain boundaries of a target sequence is contributing significantly to the performance of most methods. For example, for T0083, four methods did better when given the correct domain, for T0063, 6 methods did better and one worse on the domains, and for T0071, 9 methods did better on the domains.

When summing up all scores from all four categories, the top programs are SDPMA2, SDPMA, GenThreader and Karplus3 with 14.82, 14.26, 13.90 and 13.44 points, respectively.

Categories 5-7.

There is no evaluation for targets in these categories, but the automated predictions submitted can be seen on our main web page. We will evaluate the predictions for targets of still unknown structure when the structures are determined, and update the appropriate web pages.

Selectivity of the methods

In addition to the sensitivity of the methods (i.e. the number of correct predictions), we have analyzed their selectivities. For a given threshold score s, we define selectivity here as the number of true positives at rank-1 with scores better than s. To this end, for each method we compiled its rank-1 predictions, including those for the targets with no known fold (T0052 and T0056 in Category 6). Then we set three threshold scores, Th1 , Th2 and Th3 (different for each method), corresponding to the scores obtained by the first, second and third false positives, respectively. Finally, we counted the number of true positives with scores above Th1, Th2 and Th3, shown in table IV. We excluded from this count the 5 family-level targets (Category 1; the number of rank-1 correct predictions above Th1 within Category 1 is shown separately in the last column of the table). The magnitude of the scores of each method vary depending on the scoring system used. For some methods a large score is a good result, and for the others, a small number (or a large negative number) is a good result. Increasing or decreasing values of Th1, Th2 and Th3 give an indication as to what is considered a better score for each method.
Confidence thresholds help the user of an automated method to determine the reliability of a prediction. Table IV shows that for the casp3 targets, the selectivities of the methods were not high. If conservative selectivity thresholds are set based on this small sample, it appears that no method is able to predict much beyond the family-level targets. The implications of this to automated fold recognition are further discussed below.





Discussion



At the very outset it is important to emphasize the difficulty in comparing the server results shown here in CAFASP1 with those from fold recognition in CASP3. First and foremost, CASP3 was a blind trial and any work carried out in retrospect is not. CAFASP1 must be considered as an exercise in benchmarking rather than verifiable blind prediction. Second, in the CASP3 experiment, although up to 5 predictions could be submitted, only the first model was actually assessed. Third, for two cases, targets T0063 and T0083, the sequences were divided into domains according to the boundaries apparent from the observed 3-D structure. Finally, in CASP3 the evaluation considered alignment accuracy in addition to fold assignment. Without consideration of alignment accuracy it is quite possible that some of the predictions assumed to be correct in CAFASP1 may produce very poor models, even to the extent that the correct domain match may be missed entirely. In these cases it must be presumed that the correct fold has been found mostly by chance.

Not every aspect of CASP3 is ideal, however. Submissions in CASP3 often included manual input based on structural or functional interpretation of the results of algorithms. For example, given that a particular target was known to bind DNA, groups were quite free to ignore any highly ranked fold which was not also known to bind DNA. In CASP2, the best fold recognition results were entered by a group that did not use a fold recognition algorithm, but instead relied entirely on evolutionary inferences based on the known function of the target proteins [13]. In CAFASP1, however, the fold assignments have been made exclusively by automatic servers, without any human interpretation of the results.

Accepting the fact that CAFASP1 was not a blind test, entrants in CAFASP1 were permitted to augment their template libraries with an entry from a recently deposited set of protein coordinate entry if that was required to ensure there was a correct hit in their library (although not all the participants took the opportunity to augment their libraries). The evaluation process excluded newer entries that are not included in the latest release of the SCOP database, which roughly corresponds to the structures available at the time CASP3 predictions were filed. The object of CAFASP1 was to evaluate the algorithms and not how recently each group had updated their fold libraries. Of course, in real applications, the ease of updating the library is an important aspect of the utility of each method. Clearly a server which is frequently updated is going to have a significant advantage over a server using an out-of-date template library. Furthermore, entrants have been able to further develop their methods over the six months since the last CASP3 prediction deadline.

One of the conclusions from this study is that no single approach is markedly superior to the others evaluated when considered across the entire range of targets. Some methods performed better at the superfamily level, others at the domain level. All methods performed poorly at the fold level category; the differences between the methods in this category may not be statistically significant. The distinction between superfamily recognition and fold recognition is an important one. It might be expected that the methods which strongly weight sequence similarity would be the most effective methods at recognizing distant evolutionary relationships, whereas methods which emphasize structural information (e.g. predicted secondary structure and statistical potentials) would be most effective at recognizing similar folds in the absence of common ancestry. To some extent this was true in CAFASP1, although the relatively small number of targets does not allow to draw a general conclusion.

One limitation with the variety of methods tested in CAFASP1 is that no results have been included from pair potential-based threading methods (e.g. [14] [15]). Most of such threading methods are not available as servers, but are distributed as standalone software packages which must be installed on the user's own machine. In the assessment of fold recognition results in CASP3, three of the six groups selected to present their results made use of this type of fold recognition (see their corresponding reports in this Issue). Unlike the classic potential-based threading methods, all of the methods in CAFASP1 explicitly make use of the sequence information in one form or another. Several of the methods in CAFASP1 also incorporate some structural information from the available coordinates. The observed secondary structure of an entry in the template library is matched with that predicted for the probe in Topits, frsvr and PSSM. In addition PSSM uses structural similarity to obtain sequence profiles. Although GenTHREADER does not employ 3-D information in the alignment step, and hence is not a true potential-based threading method, it does make use of pair potentials in the evaluation of sequence-structure compatibility. However it is not clear from the CAFASP1 results whether these algorithms have been able to extract so much more information from the available three-dimensional structures that they provide a markedly superior performance to methods such as SAM-T98 and BASIC that exploit the signals provided from multiple sequences alone. However, the top performing methods in CAFASP1 do exploit structural information in addition to sequence information. To what extent their superior performance stems from their use of structural information or from other factors (such as better alignment algorithms or better statistical scoring measures) remains to be determined.

One area in which the CAFASP1 results are of particular interest is that of structural genomics. Automated approaches for fold recognition are essential if the wealth of data in genomes is to be exploited (e.g. [16], [17]). One important aspect of genomic fold assignment, however, is that folds must be assigned with a high degree of confidence. Even if a method frequently ranks correct folds in top place, if the scores for these matches are not significant then the results will be of little use for genome annotation. To assess this aspect of fold assignment it is necessary to evaluate how well a method discriminates correct match scores from incorrect match scores. Table IV shows that automatic fold-recognition methods are just beginning to discriminate correct from incorrect matches. Although a number of true positives were identified above the first false positive threshold (Th1), their scores do not necessarily lie above the methods' confidence thresholds (see last column of Table II). Improvements in this aspect will result in a wider applicability of automated fold-recognition methods at a genomic scale.

Beyond genome analysis, automated fold recognition servers enable the wider community ready access to the software. It is therefore essential that the accuracy of automatic methods of fold recognition are evaluated to allow non-expert users to decide which methods are most reliable. The CASP experiment has already highlighted the value of blind trials; of course this must be extended to CAFASP. Although the results discussed here are not from a blind trial, we consider that one important aspect of this study is to explore what kind of strategy is required for comparative blind trials of automated fold recognition. We intend for CAFASP2 to be just such a blind trial, and will thus provide an invaluable insight into the abilities and limitations of automated protein fold recognition.



Acknowledgments


We thank J. Moult for his interest and support and for inviting us to join this Special Issue of Proteins; the many users of our servers and the many CASP3 participants who encouraged us to bring about this event in very strict time constraints; the experimentalists that permitted their sequences to be used as benchmarks in CASP3 (and CAFASP1); and last, we thank the Internet technology advances that allowed CAFASP1 to take place. L A Kelley is supported by GlaxoWellcome. LAK, RMM and MJES thank Dr Saqi (GlaxoWellcome) for helpful discussions. A.G. and K.P. are supported by NIH grant no. GM60049. KJK supported by NSF Grants BIR-9408579 and DBI-9808077, and DOE grant DE-FG03-95ER62112.






References



[1] Murzin, A.G. and Brenner, S.E. and Hubbard, T . and Chothia, C., SCOP: a structural classification of proteins data base for the investigation of sequences and structures. J. Molecular Biology , 247, 536-540, 1995.

[2] Jones, D.T. GenThreader J. Molecular Biology , In press, 1999.

[3] Jones D.T., Taylor W.R., Thornton J.M. A new approach to protein fold recognition. Nature , 358:86-89, 1992.

[4] Altschul, S.F. et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. , 25,3389-3402, 1997.

[5] Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. & Thornton. J.M. (1997) CATH - a hierarchic classification of protein domain structures. Structure. 5, 1093-1108.

[6] Fischer, D. and Eisenberg, D. Fold Recognition Using Sequence-Derived Predictions. Protein Science, 5:947-955, 1996.

[7] Rost, B. and Sander, C., Prediction of protein secondary structure at better than 70\% accuracy, J. Molecular Biology 232:584-599, 1993.

[8] Bairoch, A. and Boeckmann, B., The Swiss-PROT protein sequence data bank. Nucl. Acids Res. 20,2019-2022, 1992.

[9] Altschul, S.F. and Gish, W. and Miller, W. and Mye rs, E.W. and Lipman, D.J. Basic local alignment tool J. Molecular Biology 215, 403-410, 1990.

[10] Genetic Computer Group, 1991.

[11] Fischer, D. and Elofsson, A. and Rice, D.W. and Eisenberg, D. Assessing the performance of inverted protein folding methods by means of an extensive benchmark. Proc. 1st. Pacific Symposium on Biocomputing 300-318, Jan. 1996.

[12] K. Karplus, C. Barrett, and R. Hughey. Hidden Markov Models for detecting Remote Protein Homologies Bioinformatics to appear 1999.

[13] Murzin, A. G. and Bateman A. Distant Homology Recognition Using Structural Classification of Proteins. Proteins , 105,112, 1997.

[14] Bryant, S.H. and Lawrence, C.E., An empirical energy function for threading protein sequence through folding motif. Proteins. , 16, 92-112, 1993.

[15] Hendlich, M. and Lackner, P. and Weitckus, S. and Floeckner, H. and Froschauer, R. and Gottsbacher, K. and Casari, G. and Sippl, M.J. Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. J. Molecular Biology 216, 167-180, 1990.

[16] Fischer, D. and Eisenberg, D. Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. Proc. Nat. Acad. Sci. 94:11929-11934, 1997.

[17] Rychlewski, L., Zhang, B., Godzik, A., Functional insights from structural predictions: analysis of the Escherichia coli genome Protein Science , in the press, 1999.

[18] Arne Elofsson, Daniel Fischer, Danny W. Rice, Scott M. LeGrand & David Eisenberg, A study of combined structure-sequence profiles. Folding & Design 1, 451-461, 1996 [19] Huynen et al, Homology-based fold predictions for Mycoplasma genitalium Proteins. J. Molecular Biology 280(3):323-6, 1998.

[20] Rost, B., TOPITS: Threading one-dimensional predictions into three-dimensional structures. Proc. Conf. Intelligent Systems in Molecular Biology, ISMB-95 , 314-321, 1995.

[21] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia. Sequence Comparisons Using Multiple Sequences Detect Twice as Many Remote Homologues As Pairwise Methods. J. Molecular Biology 284(4):1201-1210, 1998.