Bhageerath: an energy based web enabled computer software suite for li
http://www.100md.com
《核酸研究医学期刊》
Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi Hauz Khas, New Delhi 110 016, India
*To whom correspondence should be addressed. Tel: +91 11 2659 1505; Fax: +91 11 2658 2037; Email: bjayaram@chemistry.iitd.ac.in
ABSTRACT
We describe here an energy based computer software suite for narrowing down the search space of tertiary structures of small globular proteins. The protocol comprises eight different computational modules that form an automated pipeline. It combines physics based potentials with biophysical filters to arrive at 10 plausible candidate structures starting from sequence and secondary structure information. The methodology has been validated here on 50 small globular proteins consisting of 2–3 helices and strands with known tertiary structures. For each of these proteins, a structure within 3–6 ? RMSD (root mean square deviation) of the native has been obtained in the 10 lowest energy structures. The protocol has been web enabled and is accessible at http://www.scfbio-iitd.res.in/bhageerath.
INTRODUCTION
The tertiary structure prediction of a protein using amino acid sequence information alone is one of the fundamental unsolved problems in computational biology/molecular biophysics (1). The folding of protein molecules with a large number of degrees of freedom spontaneously into a unique three-dimensional (3-D) structure is of scientific interest intrinsically and due to its application in structure based drug design endeavors. The cost and time factors involved in experimental techniques urge for an early in silico solution to protein folding problem (2). The ultimate goal is to use computer algorithms to identify amino acid sequences that not only adopt particular 3-D structures but also perform specific functions i.e. to propose designer proteins (3).
Contemporary approaches for protein structure prediction can be broadly classified under two categories viz. (i) comparative modeling, which includes homology modeling and threading (4–7) and (ii) de novo folding (8–12). The first category of methods utilizes the structures of already solved proteins as templates (either locally or globally, at the sequence level or at the sub-structure level). With large amounts of genome and proteome data accumulating via sequencing projects, comparative modeling has become the method of choice to characterize sequences where related representatives of a family exist in structural databases (13–18). There are several web servers based on comparative modeling approaches such as Swiss Model (4), CPHmodels (19), FAMS (20) and ModWeb (21). The assessors for comparative modeling at CASP6 (Critical Assessment of protein Structure Prediction methods) have noted small improvements in model quality despite increase in the available structures but marginal improvement in alignment accuracy when compared to CASP5 (22). A natural limit for these approaches is the quantity of information available in the structural databases. This highlights the importance of de novo techniques for protein folding.
Significant progress has been made in recent years towards physics-based computation of protein structure, from a knowledge of the amino acid sequence. This approach, commonly referred to as an ab initio method (23–25) is based on the thermodynamic hypothesis formulated by Anfinsen (1973), according to which the native structure of a protein corresponds to the global minimum of its free energy under given conditions (26). Protein structure prediction using ab initio method is accomplished by a search for a conformation corresponding to the global-minimum of an appropriate potential energy function without the use of secondary structure prediction, homology modeling, threading etc. (27). In contrast, methods characterized as de novo use the ab initio strategies partly as well as database information directly or indirectly. Table 1 summarizes different known web servers/groups for protein structure prediction and the function(s) therein. The tertiary structure prediction of protein starting from its sequence has been successfully demonstrated on protein sequences <85 residues in length by Baker's group (28,29) using a fragment assembly methodology. The ProtInfo web server by Samudrala et al. (30) predicts protein tertiary structure for sequences <100 amino acids using de novo methodology, where by structures are generated using simulated annealing search phase which minimizes a target scoring function. Scratch web server by Baldi et al. (31) predicts the protein tertiary structure as well as structural features starting from the sequence information alone. Astro-fold (32) an ab initio structure prediction framework by Klepeis and Floudas employs local interactions and hydrophobicity for the identification of helices and beta-sheets respectively followed by global optimization, stochastic optimization and torsion angle dynamics. De novo structure prediction by simfold energy function with the multi-canonical ensemble fragment assembly has been developed by Fujitsuka et al. (33). The function has been tested on 38 proteins along with the fragment assembly simulations and predicts structures within 6.5 ? RMSD (root mean square deviation) of the native in 12 of the cases. Arriving at structures between 3 and 6 ? RMSD of the native expeditiously using ab initio or de novo methodologies remains a formidable challenge.
Table 1 Some de novo/ab initio servers for protein folding
We have developed a computationally viable de novo strategy for tertiary structure prediction, processing and evaluation. The web server christened Bhageerath takes as input the amino acid sequence and secondary structure information for a query protein and returns 10 candidate structures for the native. In this article, we report the validation and testing of the protein structure prediction web suite Bhageerath with application to 50 small globular proteins. The programs are written in standard C++, with a total of more than 8000 lines of code and are easily portable on any POSIX (UNIX, LINUX, IRIX and AIX) compliant system.
MATERIALS AND METHODS
Bhageerath (www.scfbio-iitd.res.in/bhageerath) software suite for protein tertiary structure prediction narrows down the search space to generate probable candidate structures for the native. The flow chart diagram of Bhageerath is depicted in Figure 1.
Figure 1 The flow of information in Bhageerath web server, starting with the input from the user to the final 10 predictions made available to the user.
The first module involves the formation of a 3-D structure from the amino acid sequence with the secondary structural elements in place. The second module involves generation of a large number of trial structures with a systematic sampling of the conformational space of loop dihedrals. The number of trial structures generated is 128(n–1) where n is the number of secondary structural elements. These structures are generated by choosing seven dihedrals from each of the loops (three at both ends and one dihedral from the middle of the loop) and sampling two conformations for each dihedral. The values assigned for dihedrals , to each amino acid during structure generation are given in supplementary information (Supplementary Table S1). The trial structures generated via dihedral sampling are screened in the third module through persistence length and radius of gyration filters (34), developed for the purpose of reducing the number of improbable candidates. The resultant structures are refined in the fourth module by a Monte Carlo sampling in dihedral space to remove steric clashes and overlaps involving atoms of main chain and side chains. In module five, the structures are energy minimized to further optimize the side chains. The energy minimization is carried out in vacuum with distance dependent dielectric for 200 steps (75 steps steepest descent + 125 steps conjugate gradient). Module six involves ranking of structures using an all atom energy based empirical scoring function (35) followed by selection of the 100 lowest energy structures. Module seven reduces the probable candidates based on the protein regularity index of the and dihedral values based on the threshold value of 1.5 for and 4.0 for (Thukral et al., manuscript accepted in J. Biosci.). Module eight further reduces the structures selected in the previous module to 10 using topological equivalence criterion and the accessible surface area . The above eight modules are configured to work in a conduit.
Overview of the organization of the suite
Bhageerath is a fully automated web enabled protein structure prediction software suite that is made available through a convenient user interface which returns 10 predictions for a given protein query sequence. A click on the Bhageerath server opens into a window wherein a user can paste a query protein sequence in FASTA format. The current version supports continuous sequences up to 100 amino acids. The user is prompted for amino acid range as secondary structural input. Upon submission the user receives an unique job id for his/her sequence. User has the option to provide an email ID to receive an output link which contains 10 lowest energy candidate structures.
RESULTS
We present here a performance appraisal of the protein tertiary structure prediction software suite on 50 globular proteins with known structures. All the proteins have been extracted from the Protein Data Bank (PDB) (37) and are functionally diverse. We have extracted 8000 unique proteins from the PDB at 50% sequence similarity or less. From these, 8000 unique proteins, we obtained 329 proteins satisfying the criterion that the number of residues is <100 and the number of secondary structural elements varies between two and three. We have selected our test set of 50 proteins randomly from these 329 proteins. The length of the polypeptide chain varies from 17 to 70 and the total number of helices and strands ranges between two and three.
The results obtained for the 50 globular proteins with the web server are shown in Table 2. The table gives the PDB ID, the number of amino acids in the sequence as well as the number and type of secondary structural elements present in each protein in columns (i)–(iii). The number of structures obtained after the persistence length and radius of gyration filters are given in column (iv) of Table 2. The lowest RMSD obtained in the 100 structures along with its energy rank are provided in the next two columns, (v) and (vi). This is followed by the number of structures selected by ProRegIn filter in column (vii). The number in parenthesis in column (vii) indicates the number of structures with RMSD < 6 ? in the selected structures. The lowest RMSD and the corresponding energy rank after selection with ProRegIn filter are reported in column (viii) and (ix). The structures selected after the Topology filter are reported in column (x) and the number in parenthesis indicates the number of structures with RMSD <6 ? in the final 10 structures. The last two columns of Table 2 show the lowest RMSD with respect to the native obtained from amongst the 10 predicted structures along with the energy rank of the structure. For all the 50 test proteins, irrespective of the nature of secondary structural elements and the length of intervening loops, it may be noted that a few topologically correct structures within an RMSD of 3–6 ? from the native structure are obtained in the final 10 predicted structures. Thus, the ‘needle in a haystack’ problem can be reduced to finding a solution in the best 10 structures at least for small proteins.
Table 2 A performance appraisal of Bhageerath web server for 50 small globular proteins
Figure 2 shows a superimposition of the lowest RMSD structure with the respective native structures for all the 50 globular test proteins.
Figure 2 The superimposed lowest RMSD structures for the 50 small globular test proteins used for the validation of Bhageerath web server. The PDB ID's are shown underneath each structure. The predicted structure is shown in red color and the native in blue.
A comparison of the structures obtained with the protein structure prediction web server presented here was carried out with six freely available homology modeling servers: CPHmodels (19), Swiss Model (4), EsyPred3D (38), ModWeb (21), Geno3D (39) and 3Djigsaw (40). While SwissModel, EsyPred3D, Geno3D and 3Djigsaw provide an option for template selection the other two servers are automatic. For the 50 test proteins validated, we have first carried out sequence alignment using PSI BLAST (41) and the templates were selected such that the sequence similarity of the template is >30% and the template is not from the same family. For most of the proteins there was very less sequence similarity with proteins of other families and the templates were restricted to the same family. In such cases the quality of model built is quite high and the RMSD with respect to the native is <1 ? in few cases. The proteins where the templates are selected from different families result in RMSDs comparable to those obtained with Bhageerath web server. Table 3 shows the RMSD of the structures obtained by homology modeling from the respective web servers for all the 50 globular proteins. The template ID, percentage sequence similarity and alignment of the target-template sequence for each method and each structure therein is provided in supplementary information (Supplementary Tables S2–S7). Thus, for new sequences with no known sequence homologues, the Bhageerath web server has the potential to predict a structure to within 3–6 ? RMSD of the native structure with accuracies comparable to the homology modeling servers.
Table 3 A comparison of protein tertiary structure prediction accuracies with different homology modeling servers available in public domain
Further comparison of the 10 structures obtained from Bhageerath was carried out with the five candidate structures obtained from the ProtInfo web server (30) and 10 structures obtained with ROBETTA software (28) configured locally. The results shown in Table 4 indicate that the server described here is able to predict structures with RMSDs comparable to those obtained by ProtInfo web server and ROBETTA software. Supplementary Table S8 in the supplementary information provides the comparison of the GDT_TS scores obtained using LGA server (42) for structures obtained with Bhageerath and ProtInfo web servers and ROBETTA software. The GDT_TS scores are also found to be comparable for structures obtained from these three different structure prediction methodologies.
Table 4 A comparison of protein tertiary structure prediction accuracy with ProtInfo web server and ROBETTA software available in the public domain for 50 test proteins
DISCUSSION
We describe here an energy based computational web server Bhageerath, for an automated candidate tertiary structure prediction. The web server permits predictive folding with moderate computational resources. The validation of the computational protocol on 50 globular proteins has shown that the web server selects one or more candidate structures within an RMSD of 3–6 ? with respect to the native in the 10 lowest energy structures. The results presented are for proteins having 2–3 secondary elements with , ? and /? structures and are obtained solely from the amino acid sequence and secondary structure information (without the aid of multiple sequence alignment, or fold recognition). The results provide a benchmark as to the level of model accuracy one can expect from this web server.
All of the eight modules are currently being executed on a cluster with 32 dedicated UltraSparc III 900 MHz processors. In contrast to typical short return times (ranging from 1 to 10 min) for receiving results from comparative modeling servers, the expected prediction time with Bhageerath web server for two helix systems is 4–5 min while for three helix systems it is 2–3 h. However, this depends on the length of the sequence, number of secondary structure elements and the number of structures accepted after the biophysical filters for processing the energetics of each trial structure at the atomic level. It is currently able to process 4–5 normally sized jobs per day on 32 processors.
The current version of the web server elicits secondary structure information from the user. For new sequences where secondary structure information is not available, web based secondary structure prediction tools can be employed. We have characterized the results obtained from five different freely available secondary structure prediction servers (43–47) available on the web for the 50 test proteins. The predictions are provided in the supplementary information (Supplementary Table S9). We envisage the introduction of a secondary structure predictor in module one shortly. For larger systems, i.e. those containing more than 100 amino acid residues and those with more than three secondary structural elements, we conceive the introduction of loop filters to control the combinatorial explosion in the number of trial structures. We have utilized two biophysical filters presently in module three for trial structure selection and plan to utilize a few more such as hydrophobicity and packing fraction at later stages. Also one could profitably employ constraints on strands for sheet formation, constraints on metal ions to cluster residues and disulphide bridges as filters for reducing the number of trial structures. The all atom empirical energy function utilized in module six was tested previously and was seen to separate native from the decoy structures in 67 of the 69 protein sequences from among 61 640 decoys studied (35). The scoring function calculates the non-bonded energy of each trial structure as a sum of the electrostatics, van der Waals and hydrophobicity. There is scope for improvement in the scoring function particularly in describing the hydrophobicity component. Work on the above mentioned lines as also on a Flexible Monte Carlo simulation strategy to bring down the RMSD < 3 ? of the native is in progress.
The individual modules of Bhageerath are web enabled for free access. These include the four biophysical filters (persistence length, radius of gyration, hydrophobicity ratio and packing fraction), a protein structure optimizer, an all-atom empirical energy based scoring function and ProRegIn utility. These are listed in Table 5 along with their corresponding URL's.
Table 5 A list of modules of Bhageerath converted to independent web utilities with their respective URL's
SUPPLEMENTARY DATA
Supplementary Data are available at NAR online.
ACKNOWLEDGEMENTS
Funding from the Department of Biotechnology is gratefully acknowledged. Ms Kumkum Bhushan is a recipient of the Senior Research Fellow award from the Council of Scientific & Industrial Research (CSIR), India. Help
REFERENCES
Liwo, A., Khalili, M., Scheraga, H.A. (2005) Ab initio simulation of protein-folding pathways by molecular dynamics with united residue model of polypeptide chains Proc. Natl Acad. Sci. USA, 102, 2362–2367 .
Baker, D. (2000) A surprising simplicity to protein folding Nature, 405, 39–42 .
Klepeis, J.L. and Floudas, C.A. (2004) In silico protein design: a combinatorial and global optimization approach SIAM News, 37, 1 .
Guex, N. and Peitsch, M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling Electrophoresis, 18, 2714–2723 .
Sánchez, R. and ali, A. (1997) Evaluation of comparative protein structure modeling by MODELLER-3 Proteins, 29, 50–58 .
Panchenko, A.R., Marcbr-Bauer, A.E., Bryant, S.H. (2000) Combination of threading potentials and sequence profiles improves fold recognition J. Mol. Biol, . 296, 1319–1331 .
Skolnick, J.E. and Kihara, D. (2001) Defrosting the frozen approximation: PROSPECTOR-a new approach to threading Proteins, 42, 319–331 .
Aszodi, A., Gradwell, M.J., Taylor, W.R. (1995) Global fold determination from a small number of distance restrains J. Mol. Biol, . 251, 308–326 .
Kolinski, A., Jaroszewski, L., Rotkiewicz, P., Skolnick, J. (1998) An efficient Monte Carlo model of protein chains. Modeling the short-range correlations between side group centers of mass J. Phys Chem, 102, 4628–4637 .
Ortiz, A.R., Kolinski, A., Skolnick, J. (1998) Fold assembly of small proteins using Monte Carlo simulations driven by restraints derived from multiple sequence alignments J. Mol. Biol, . 277, 419–448 .
Huang, E.S., Samudrala, R., Ponder, J.W. (1999) Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions J. Mol. Biol, . 290, 267–281 .
Simons, K.T., Strauss, C., Baker, D. (2001) Prospects for ab initio protein structural genomics J. Mol. Biol, . 306, 1191–1199 .
Rost, B. and Sander, C. (1996) Bridging the protein sequence-structure gap by structure predictions Annu. Rev. Biophys. Biomol. Struct, . 25, 113–136 .
Guex, N., Diemand, A., Peitsch, M.C. (1999) Protein modeling for all Trends Biochem. Sci, . 24, 364–367 .
Moult, J. (1999) Predicting protein three-dimensional structure Curr. Opin. Biotechnol, . 10, 583–588 .
Al-Lazikani, B., Jung, J., Xiang, Z., Honig, B. (2001) Protein structure prediction Curr. Opin. Struct. Biol, . 5, 51–56 .
Venclovas, C. (2001) Comparative modeling of CASP4 target proteins: Combining results of sequence search with three-dimensional structure assessment Proteins, 45, 47–54 .
Tramontanoa, A. and Morea, V. (2003) Assessment of homology based predictions in CASP5 Proteins, 53, 352–368 .
Lund, O., Nielsen, M., Lundegaard, C., Worning, P. (2002) X3M a computer program to extract 3D models Abstract at the CASP5 conference, A102 .
Ogata, K. and Umeyama, H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing J. Mol. Graph Model, 18, 305–306 .
Sali, A. and Blundell, T. (1993) Comparative protein modeling by satisfaction of spatial restraints J. Mol. Biol, . 234, 779–815 .
Tress, M., Ezkurdia, I., Gra?a, O., Lopez, G., Valencia, A. (2005) Assessment of predictions submitted for the CASP6 comparative modeling category Proteins, 61, 27–45 .
Scheraga, H.A. (1992) Some approaches to the multiple-minima problem in the calculation of polypeptide and protein structures Int. J. Quantum Chem, . 42, 1529–1536 .
Scheraga, H.A. (1996) Recent developments in the theory of protein folding: searching for the global energy minimum Biophys. Chem, . 59, 329–339 .
Vasquez, M., Nemethy, G., Scheraga, H.A. (1994) Conformational energy calculations on polypeptides and proteins Chem. Rev, . 94, 2183 .
Anfinsen, C.B. (1973) Principles that govern the folding of protein chains Science, 181, 223 .
Pillardy, J. (2001) Recent improvements in prediction of protein structure by global optimization of a potential energy function Proc. Natl Acad. Sci. USA, 98, 2329–2333 .
Kim, D.E., Chivian, D., Baker, D. (2004) Protein structure prediction and analysis using the Robetta server Nucleic Acids Res, . 32, W526–W531 .
Bradley, P., Misura, K.M.S., Baker, D. (2005) Towards high-resolution de novo structure prediction for small proteins Science, 309, 1868–1871 .
Hung, L.-H., Ngan, S.-C., Liu, T., Samudrala, R. (2005) PROTINFO: new algorithms for enhanced protein structure predictions Nucleic Acids Res, . 33, W77–W80 .
Cheng, J., Randall, A.Z., Sweredoski, M.J., Baldi, P. (2005) SCRATCH: a protein structure and structural feature prediction server Nucleic Acids Res, . 33, W72–W76 .
Klepeis, J.L. and Floudas, C.A. (2003) ASTRO_FOLD: A combinatorial and global optimization framework for ab initio prediction of three-dimensional structures of proteins from the amino acid sequence Biophys. J, . 85, 2119–2146 .
Fujitsuka, Y., Chikenji, G., Takada, S. (2005) SimFold energy function for de novo protein structure prediction: consensus with Rosetta Proteins, 62, 381–398 .
Narang, P., Bhushan, K., Bose, S., Jayaram, B. (2005) A computational pathway for bracketing native-like structures for small alpha helical globular proteins Phys. Chem. Chem. Phys, . 7, 2364–2375 .
Narang, P., Bhushan, K., Bose, S., Jayaram, B. (2006) Protein structure evaluation using an all-atom energy based empirical scoring function J. Biomol. Struct. Dyn, . 23, 385–406 .
Hubbard, S.J. and Thornton, J.M. ‘NACCESS’, Computer Program, (1993) UK Department of Biochemistry and Molecular Biology, University College London .
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E. (2000) The Protein Data Bank Nucleic Acids Res, . 28, 235–242 .
Lambert, C., Leonard, N., De Bolle, X., Depiereux, E. (2002) EsyPred3D: Prediction of proteins 3D structures Bioinformatics, 18, 1250–1256 .
Combet, C., Jambon, M., Deleage, G., Geourjon, C. (2002) Geno3D: Automatic comparative molecular modeling of protein Bioinformatics, 18, 213–214 .
Bates, P.A., Kelley, L.A., MacCallum, R.M., Sternberg, M.J.E. (2001) Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM Proteins, 45, 39–46 .
Altschul, S.F., Madden, T.L., Sch?ffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402 .
Zemla, A. (2003) LGA - a method for finding 3D similarities in protein structures Nucleic Acids Res, . 31, 3370–3374 .
Bryson, K., McGuffin, L.J., Marsden, R.L., Ward, J.J., Sodhi, J.S., Jones, D.T. (2005) Protein structure prediction servers at University College London Nucleic Acids Res, . 33, W36–W38 .
Rost, B., Yachdav, G., Liu, J. (2003) The PredictProtein server Nucleic Acids Res, . 32, W321–W326 .
Cuff, J.A., Clamp, M.E., Siddiqui, A.S., Finlay, M., Barton, G.J. (1998) Jpred: a consensus secondary structure prediction server Bioinformatics, 14, 892–893 .
Sen, T.Z., Jernigan, R.L., Garnier, J., Kloczkowski, A. (2005) GOR V server for protein secondary structure prediction Bioinformatics, 21, 2787–2788 .
Frishman, D. and Argos, P. (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence Protein Eng, . 9, 133–142 .(B. Jayaram*, Kumkum Bhushan, Sandhya R. )
*To whom correspondence should be addressed. Tel: +91 11 2659 1505; Fax: +91 11 2658 2037; Email: bjayaram@chemistry.iitd.ac.in
ABSTRACT
We describe here an energy based computer software suite for narrowing down the search space of tertiary structures of small globular proteins. The protocol comprises eight different computational modules that form an automated pipeline. It combines physics based potentials with biophysical filters to arrive at 10 plausible candidate structures starting from sequence and secondary structure information. The methodology has been validated here on 50 small globular proteins consisting of 2–3 helices and strands with known tertiary structures. For each of these proteins, a structure within 3–6 ? RMSD (root mean square deviation) of the native has been obtained in the 10 lowest energy structures. The protocol has been web enabled and is accessible at http://www.scfbio-iitd.res.in/bhageerath.
INTRODUCTION
The tertiary structure prediction of a protein using amino acid sequence information alone is one of the fundamental unsolved problems in computational biology/molecular biophysics (1). The folding of protein molecules with a large number of degrees of freedom spontaneously into a unique three-dimensional (3-D) structure is of scientific interest intrinsically and due to its application in structure based drug design endeavors. The cost and time factors involved in experimental techniques urge for an early in silico solution to protein folding problem (2). The ultimate goal is to use computer algorithms to identify amino acid sequences that not only adopt particular 3-D structures but also perform specific functions i.e. to propose designer proteins (3).
Contemporary approaches for protein structure prediction can be broadly classified under two categories viz. (i) comparative modeling, which includes homology modeling and threading (4–7) and (ii) de novo folding (8–12). The first category of methods utilizes the structures of already solved proteins as templates (either locally or globally, at the sequence level or at the sub-structure level). With large amounts of genome and proteome data accumulating via sequencing projects, comparative modeling has become the method of choice to characterize sequences where related representatives of a family exist in structural databases (13–18). There are several web servers based on comparative modeling approaches such as Swiss Model (4), CPHmodels (19), FAMS (20) and ModWeb (21). The assessors for comparative modeling at CASP6 (Critical Assessment of protein Structure Prediction methods) have noted small improvements in model quality despite increase in the available structures but marginal improvement in alignment accuracy when compared to CASP5 (22). A natural limit for these approaches is the quantity of information available in the structural databases. This highlights the importance of de novo techniques for protein folding.
Significant progress has been made in recent years towards physics-based computation of protein structure, from a knowledge of the amino acid sequence. This approach, commonly referred to as an ab initio method (23–25) is based on the thermodynamic hypothesis formulated by Anfinsen (1973), according to which the native structure of a protein corresponds to the global minimum of its free energy under given conditions (26). Protein structure prediction using ab initio method is accomplished by a search for a conformation corresponding to the global-minimum of an appropriate potential energy function without the use of secondary structure prediction, homology modeling, threading etc. (27). In contrast, methods characterized as de novo use the ab initio strategies partly as well as database information directly or indirectly. Table 1 summarizes different known web servers/groups for protein structure prediction and the function(s) therein. The tertiary structure prediction of protein starting from its sequence has been successfully demonstrated on protein sequences <85 residues in length by Baker's group (28,29) using a fragment assembly methodology. The ProtInfo web server by Samudrala et al. (30) predicts protein tertiary structure for sequences <100 amino acids using de novo methodology, where by structures are generated using simulated annealing search phase which minimizes a target scoring function. Scratch web server by Baldi et al. (31) predicts the protein tertiary structure as well as structural features starting from the sequence information alone. Astro-fold (32) an ab initio structure prediction framework by Klepeis and Floudas employs local interactions and hydrophobicity for the identification of helices and beta-sheets respectively followed by global optimization, stochastic optimization and torsion angle dynamics. De novo structure prediction by simfold energy function with the multi-canonical ensemble fragment assembly has been developed by Fujitsuka et al. (33). The function has been tested on 38 proteins along with the fragment assembly simulations and predicts structures within 6.5 ? RMSD (root mean square deviation) of the native in 12 of the cases. Arriving at structures between 3 and 6 ? RMSD of the native expeditiously using ab initio or de novo methodologies remains a formidable challenge.
Table 1 Some de novo/ab initio servers for protein folding
We have developed a computationally viable de novo strategy for tertiary structure prediction, processing and evaluation. The web server christened Bhageerath takes as input the amino acid sequence and secondary structure information for a query protein and returns 10 candidate structures for the native. In this article, we report the validation and testing of the protein structure prediction web suite Bhageerath with application to 50 small globular proteins. The programs are written in standard C++, with a total of more than 8000 lines of code and are easily portable on any POSIX (UNIX, LINUX, IRIX and AIX) compliant system.
MATERIALS AND METHODS
Bhageerath (www.scfbio-iitd.res.in/bhageerath) software suite for protein tertiary structure prediction narrows down the search space to generate probable candidate structures for the native. The flow chart diagram of Bhageerath is depicted in Figure 1.
Figure 1 The flow of information in Bhageerath web server, starting with the input from the user to the final 10 predictions made available to the user.
The first module involves the formation of a 3-D structure from the amino acid sequence with the secondary structural elements in place. The second module involves generation of a large number of trial structures with a systematic sampling of the conformational space of loop dihedrals. The number of trial structures generated is 128(n–1) where n is the number of secondary structural elements. These structures are generated by choosing seven dihedrals from each of the loops (three at both ends and one dihedral from the middle of the loop) and sampling two conformations for each dihedral. The values assigned for dihedrals , to each amino acid during structure generation are given in supplementary information (Supplementary Table S1). The trial structures generated via dihedral sampling are screened in the third module through persistence length and radius of gyration filters (34), developed for the purpose of reducing the number of improbable candidates. The resultant structures are refined in the fourth module by a Monte Carlo sampling in dihedral space to remove steric clashes and overlaps involving atoms of main chain and side chains. In module five, the structures are energy minimized to further optimize the side chains. The energy minimization is carried out in vacuum with distance dependent dielectric for 200 steps (75 steps steepest descent + 125 steps conjugate gradient). Module six involves ranking of structures using an all atom energy based empirical scoring function (35) followed by selection of the 100 lowest energy structures. Module seven reduces the probable candidates based on the protein regularity index of the and dihedral values based on the threshold value of 1.5 for and 4.0 for (Thukral et al., manuscript accepted in J. Biosci.). Module eight further reduces the structures selected in the previous module to 10 using topological equivalence criterion and the accessible surface area . The above eight modules are configured to work in a conduit.
Overview of the organization of the suite
Bhageerath is a fully automated web enabled protein structure prediction software suite that is made available through a convenient user interface which returns 10 predictions for a given protein query sequence. A click on the Bhageerath server opens into a window wherein a user can paste a query protein sequence in FASTA format. The current version supports continuous sequences up to 100 amino acids. The user is prompted for amino acid range as secondary structural input. Upon submission the user receives an unique job id for his/her sequence. User has the option to provide an email ID to receive an output link which contains 10 lowest energy candidate structures.
RESULTS
We present here a performance appraisal of the protein tertiary structure prediction software suite on 50 globular proteins with known structures. All the proteins have been extracted from the Protein Data Bank (PDB) (37) and are functionally diverse. We have extracted 8000 unique proteins from the PDB at 50% sequence similarity or less. From these, 8000 unique proteins, we obtained 329 proteins satisfying the criterion that the number of residues is <100 and the number of secondary structural elements varies between two and three. We have selected our test set of 50 proteins randomly from these 329 proteins. The length of the polypeptide chain varies from 17 to 70 and the total number of helices and strands ranges between two and three.
The results obtained for the 50 globular proteins with the web server are shown in Table 2. The table gives the PDB ID, the number of amino acids in the sequence as well as the number and type of secondary structural elements present in each protein in columns (i)–(iii). The number of structures obtained after the persistence length and radius of gyration filters are given in column (iv) of Table 2. The lowest RMSD obtained in the 100 structures along with its energy rank are provided in the next two columns, (v) and (vi). This is followed by the number of structures selected by ProRegIn filter in column (vii). The number in parenthesis in column (vii) indicates the number of structures with RMSD < 6 ? in the selected structures. The lowest RMSD and the corresponding energy rank after selection with ProRegIn filter are reported in column (viii) and (ix). The structures selected after the Topology filter are reported in column (x) and the number in parenthesis indicates the number of structures with RMSD <6 ? in the final 10 structures. The last two columns of Table 2 show the lowest RMSD with respect to the native obtained from amongst the 10 predicted structures along with the energy rank of the structure. For all the 50 test proteins, irrespective of the nature of secondary structural elements and the length of intervening loops, it may be noted that a few topologically correct structures within an RMSD of 3–6 ? from the native structure are obtained in the final 10 predicted structures. Thus, the ‘needle in a haystack’ problem can be reduced to finding a solution in the best 10 structures at least for small proteins.
Table 2 A performance appraisal of Bhageerath web server for 50 small globular proteins
Figure 2 shows a superimposition of the lowest RMSD structure with the respective native structures for all the 50 globular test proteins.
Figure 2 The superimposed lowest RMSD structures for the 50 small globular test proteins used for the validation of Bhageerath web server. The PDB ID's are shown underneath each structure. The predicted structure is shown in red color and the native in blue.
A comparison of the structures obtained with the protein structure prediction web server presented here was carried out with six freely available homology modeling servers: CPHmodels (19), Swiss Model (4), EsyPred3D (38), ModWeb (21), Geno3D (39) and 3Djigsaw (40). While SwissModel, EsyPred3D, Geno3D and 3Djigsaw provide an option for template selection the other two servers are automatic. For the 50 test proteins validated, we have first carried out sequence alignment using PSI BLAST (41) and the templates were selected such that the sequence similarity of the template is >30% and the template is not from the same family. For most of the proteins there was very less sequence similarity with proteins of other families and the templates were restricted to the same family. In such cases the quality of model built is quite high and the RMSD with respect to the native is <1 ? in few cases. The proteins where the templates are selected from different families result in RMSDs comparable to those obtained with Bhageerath web server. Table 3 shows the RMSD of the structures obtained by homology modeling from the respective web servers for all the 50 globular proteins. The template ID, percentage sequence similarity and alignment of the target-template sequence for each method and each structure therein is provided in supplementary information (Supplementary Tables S2–S7). Thus, for new sequences with no known sequence homologues, the Bhageerath web server has the potential to predict a structure to within 3–6 ? RMSD of the native structure with accuracies comparable to the homology modeling servers.
Table 3 A comparison of protein tertiary structure prediction accuracies with different homology modeling servers available in public domain
Further comparison of the 10 structures obtained from Bhageerath was carried out with the five candidate structures obtained from the ProtInfo web server (30) and 10 structures obtained with ROBETTA software (28) configured locally. The results shown in Table 4 indicate that the server described here is able to predict structures with RMSDs comparable to those obtained by ProtInfo web server and ROBETTA software. Supplementary Table S8 in the supplementary information provides the comparison of the GDT_TS scores obtained using LGA server (42) for structures obtained with Bhageerath and ProtInfo web servers and ROBETTA software. The GDT_TS scores are also found to be comparable for structures obtained from these three different structure prediction methodologies.
Table 4 A comparison of protein tertiary structure prediction accuracy with ProtInfo web server and ROBETTA software available in the public domain for 50 test proteins
DISCUSSION
We describe here an energy based computational web server Bhageerath, for an automated candidate tertiary structure prediction. The web server permits predictive folding with moderate computational resources. The validation of the computational protocol on 50 globular proteins has shown that the web server selects one or more candidate structures within an RMSD of 3–6 ? with respect to the native in the 10 lowest energy structures. The results presented are for proteins having 2–3 secondary elements with , ? and /? structures and are obtained solely from the amino acid sequence and secondary structure information (without the aid of multiple sequence alignment, or fold recognition). The results provide a benchmark as to the level of model accuracy one can expect from this web server.
All of the eight modules are currently being executed on a cluster with 32 dedicated UltraSparc III 900 MHz processors. In contrast to typical short return times (ranging from 1 to 10 min) for receiving results from comparative modeling servers, the expected prediction time with Bhageerath web server for two helix systems is 4–5 min while for three helix systems it is 2–3 h. However, this depends on the length of the sequence, number of secondary structure elements and the number of structures accepted after the biophysical filters for processing the energetics of each trial structure at the atomic level. It is currently able to process 4–5 normally sized jobs per day on 32 processors.
The current version of the web server elicits secondary structure information from the user. For new sequences where secondary structure information is not available, web based secondary structure prediction tools can be employed. We have characterized the results obtained from five different freely available secondary structure prediction servers (43–47) available on the web for the 50 test proteins. The predictions are provided in the supplementary information (Supplementary Table S9). We envisage the introduction of a secondary structure predictor in module one shortly. For larger systems, i.e. those containing more than 100 amino acid residues and those with more than three secondary structural elements, we conceive the introduction of loop filters to control the combinatorial explosion in the number of trial structures. We have utilized two biophysical filters presently in module three for trial structure selection and plan to utilize a few more such as hydrophobicity and packing fraction at later stages. Also one could profitably employ constraints on strands for sheet formation, constraints on metal ions to cluster residues and disulphide bridges as filters for reducing the number of trial structures. The all atom empirical energy function utilized in module six was tested previously and was seen to separate native from the decoy structures in 67 of the 69 protein sequences from among 61 640 decoys studied (35). The scoring function calculates the non-bonded energy of each trial structure as a sum of the electrostatics, van der Waals and hydrophobicity. There is scope for improvement in the scoring function particularly in describing the hydrophobicity component. Work on the above mentioned lines as also on a Flexible Monte Carlo simulation strategy to bring down the RMSD < 3 ? of the native is in progress.
The individual modules of Bhageerath are web enabled for free access. These include the four biophysical filters (persistence length, radius of gyration, hydrophobicity ratio and packing fraction), a protein structure optimizer, an all-atom empirical energy based scoring function and ProRegIn utility. These are listed in Table 5 along with their corresponding URL's.
Table 5 A list of modules of Bhageerath converted to independent web utilities with their respective URL's
SUPPLEMENTARY DATA
Supplementary Data are available at NAR online.
ACKNOWLEDGEMENTS
Funding from the Department of Biotechnology is gratefully acknowledged. Ms Kumkum Bhushan is a recipient of the Senior Research Fellow award from the Council of Scientific & Industrial Research (CSIR), India. Help
REFERENCES
Liwo, A., Khalili, M., Scheraga, H.A. (2005) Ab initio simulation of protein-folding pathways by molecular dynamics with united residue model of polypeptide chains Proc. Natl Acad. Sci. USA, 102, 2362–2367 .
Baker, D. (2000) A surprising simplicity to protein folding Nature, 405, 39–42 .
Klepeis, J.L. and Floudas, C.A. (2004) In silico protein design: a combinatorial and global optimization approach SIAM News, 37, 1 .
Guex, N. and Peitsch, M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling Electrophoresis, 18, 2714–2723 .
Sánchez, R. and ali, A. (1997) Evaluation of comparative protein structure modeling by MODELLER-3 Proteins, 29, 50–58 .
Panchenko, A.R., Marcbr-Bauer, A.E., Bryant, S.H. (2000) Combination of threading potentials and sequence profiles improves fold recognition J. Mol. Biol, . 296, 1319–1331 .
Skolnick, J.E. and Kihara, D. (2001) Defrosting the frozen approximation: PROSPECTOR-a new approach to threading Proteins, 42, 319–331 .
Aszodi, A., Gradwell, M.J., Taylor, W.R. (1995) Global fold determination from a small number of distance restrains J. Mol. Biol, . 251, 308–326 .
Kolinski, A., Jaroszewski, L., Rotkiewicz, P., Skolnick, J. (1998) An efficient Monte Carlo model of protein chains. Modeling the short-range correlations between side group centers of mass J. Phys Chem, 102, 4628–4637 .
Ortiz, A.R., Kolinski, A., Skolnick, J. (1998) Fold assembly of small proteins using Monte Carlo simulations driven by restraints derived from multiple sequence alignments J. Mol. Biol, . 277, 419–448 .
Huang, E.S., Samudrala, R., Ponder, J.W. (1999) Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions J. Mol. Biol, . 290, 267–281 .
Simons, K.T., Strauss, C., Baker, D. (2001) Prospects for ab initio protein structural genomics J. Mol. Biol, . 306, 1191–1199 .
Rost, B. and Sander, C. (1996) Bridging the protein sequence-structure gap by structure predictions Annu. Rev. Biophys. Biomol. Struct, . 25, 113–136 .
Guex, N., Diemand, A., Peitsch, M.C. (1999) Protein modeling for all Trends Biochem. Sci, . 24, 364–367 .
Moult, J. (1999) Predicting protein three-dimensional structure Curr. Opin. Biotechnol, . 10, 583–588 .
Al-Lazikani, B., Jung, J., Xiang, Z., Honig, B. (2001) Protein structure prediction Curr. Opin. Struct. Biol, . 5, 51–56 .
Venclovas, C. (2001) Comparative modeling of CASP4 target proteins: Combining results of sequence search with three-dimensional structure assessment Proteins, 45, 47–54 .
Tramontanoa, A. and Morea, V. (2003) Assessment of homology based predictions in CASP5 Proteins, 53, 352–368 .
Lund, O., Nielsen, M., Lundegaard, C., Worning, P. (2002) X3M a computer program to extract 3D models Abstract at the CASP5 conference, A102 .
Ogata, K. and Umeyama, H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing J. Mol. Graph Model, 18, 305–306 .
Sali, A. and Blundell, T. (1993) Comparative protein modeling by satisfaction of spatial restraints J. Mol. Biol, . 234, 779–815 .
Tress, M., Ezkurdia, I., Gra?a, O., Lopez, G., Valencia, A. (2005) Assessment of predictions submitted for the CASP6 comparative modeling category Proteins, 61, 27–45 .
Scheraga, H.A. (1992) Some approaches to the multiple-minima problem in the calculation of polypeptide and protein structures Int. J. Quantum Chem, . 42, 1529–1536 .
Scheraga, H.A. (1996) Recent developments in the theory of protein folding: searching for the global energy minimum Biophys. Chem, . 59, 329–339 .
Vasquez, M., Nemethy, G., Scheraga, H.A. (1994) Conformational energy calculations on polypeptides and proteins Chem. Rev, . 94, 2183 .
Anfinsen, C.B. (1973) Principles that govern the folding of protein chains Science, 181, 223 .
Pillardy, J. (2001) Recent improvements in prediction of protein structure by global optimization of a potential energy function Proc. Natl Acad. Sci. USA, 98, 2329–2333 .
Kim, D.E., Chivian, D., Baker, D. (2004) Protein structure prediction and analysis using the Robetta server Nucleic Acids Res, . 32, W526–W531 .
Bradley, P., Misura, K.M.S., Baker, D. (2005) Towards high-resolution de novo structure prediction for small proteins Science, 309, 1868–1871 .
Hung, L.-H., Ngan, S.-C., Liu, T., Samudrala, R. (2005) PROTINFO: new algorithms for enhanced protein structure predictions Nucleic Acids Res, . 33, W77–W80 .
Cheng, J., Randall, A.Z., Sweredoski, M.J., Baldi, P. (2005) SCRATCH: a protein structure and structural feature prediction server Nucleic Acids Res, . 33, W72–W76 .
Klepeis, J.L. and Floudas, C.A. (2003) ASTRO_FOLD: A combinatorial and global optimization framework for ab initio prediction of three-dimensional structures of proteins from the amino acid sequence Biophys. J, . 85, 2119–2146 .
Fujitsuka, Y., Chikenji, G., Takada, S. (2005) SimFold energy function for de novo protein structure prediction: consensus with Rosetta Proteins, 62, 381–398 .
Narang, P., Bhushan, K., Bose, S., Jayaram, B. (2005) A computational pathway for bracketing native-like structures for small alpha helical globular proteins Phys. Chem. Chem. Phys, . 7, 2364–2375 .
Narang, P., Bhushan, K., Bose, S., Jayaram, B. (2006) Protein structure evaluation using an all-atom energy based empirical scoring function J. Biomol. Struct. Dyn, . 23, 385–406 .
Hubbard, S.J. and Thornton, J.M. ‘NACCESS’, Computer Program, (1993) UK Department of Biochemistry and Molecular Biology, University College London .
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E. (2000) The Protein Data Bank Nucleic Acids Res, . 28, 235–242 .
Lambert, C., Leonard, N., De Bolle, X., Depiereux, E. (2002) EsyPred3D: Prediction of proteins 3D structures Bioinformatics, 18, 1250–1256 .
Combet, C., Jambon, M., Deleage, G., Geourjon, C. (2002) Geno3D: Automatic comparative molecular modeling of protein Bioinformatics, 18, 213–214 .
Bates, P.A., Kelley, L.A., MacCallum, R.M., Sternberg, M.J.E. (2001) Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM Proteins, 45, 39–46 .
Altschul, S.F., Madden, T.L., Sch?ffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402 .
Zemla, A. (2003) LGA - a method for finding 3D similarities in protein structures Nucleic Acids Res, . 31, 3370–3374 .
Bryson, K., McGuffin, L.J., Marsden, R.L., Ward, J.J., Sodhi, J.S., Jones, D.T. (2005) Protein structure prediction servers at University College London Nucleic Acids Res, . 33, W36–W38 .
Rost, B., Yachdav, G., Liu, J. (2003) The PredictProtein server Nucleic Acids Res, . 32, W321–W326 .
Cuff, J.A., Clamp, M.E., Siddiqui, A.S., Finlay, M., Barton, G.J. (1998) Jpred: a consensus secondary structure prediction server Bioinformatics, 14, 892–893 .
Sen, T.Z., Jernigan, R.L., Garnier, J., Kloczkowski, A. (2005) GOR V server for protein secondary structure prediction Bioinformatics, 21, 2787–2788 .
Frishman, D. and Argos, P. (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence Protein Eng, . 9, 133–142 .(B. Jayaram*, Kumkum Bhushan, Sandhya R. )