当前位置: 首页 > 期刊 > 《核酸研究》 > 2006年第Da期 > 正文
编号:11366864
PPD v1.0—an integrated, web-accessible database of experimentally dete
http://www.100md.com 《核酸研究医学期刊》
     Edward Jenner Institute for Vaccine Research Compton, Berkshire, RG20 7NN, UK

    *To whom correspondence should be addressed. Tel: +44 1635 577954; Fax: +44 1635 577901 577908; Email: darren.flower@jenner.ac.uk

    ABSTRACT

    The Protein pKa Database (PPD) v1.0 provides a compendium of protein residue-specific ionization equilibria (pKa values), as collated from the primary literature, in the form of a web-accessible postgreSQL relational database. Ionizable residues play key roles in the molecular mechanisms that underlie many biological phenomena, including protein folding and enzyme catalysis. The PPD serves as a general protein pKa archive and as a source of data that allows for the development and improvement of pKa prediction systems. The database is accessed through an HTML interface, which offers two fast, efficient search methods: an amino acid-based query and a Basic Local Alignment Search Tool search. Entries also give details of experimental techniques and links to other key databases, such as National Center for Biotechnology Information and the Protein Data Bank, providing the user with considerable background information. The database can be found at the following URL: http://www.jenner.ac.uk/PPD.

    INTRODUCTION

    A significant proportion of chemical reactions involving proteins are mediated through electrostatic interactions of their ionizable residues (1). Such residues greatly influence the conformation of a protein and therefore its function (2,3), as demonstrated by their folding mechanisms (4–6), enzyme catalysis and protein–protein interactions (7). With respect to enzyme catalysis, residues can act as proton donors and acceptors within the catalytic site and help stabilize transition states, with a concomitant influence on the rate of reaction (8,9).

    The dissociation constant (Ka) is a measure of the acidity of a compound, i.e. its ability to donate a proton. Ka values range widely from 1010 for the strongest acids, such as sulphuric, to 10–50 for the weakest, such as methane. Therefore a negative logarithmic scale is usually applied (pKa = –log10 Ka), whereby Ka values for sulphuric acid and methane would become pKa values of –10 and 50, respectively. Generally, more negative pKa values correspond to stronger acids. The pKa values of individual amino acid residues in proteins are determined by the ionization of their side-chain groups. For the 20 natural amino acids, pKa values range from 4.0 for the side-chain carboxyl of aspartate to 12.0 for the side-chain guanididium group of arginine. Main-chain groups are not ionizable, although two additional ionizable groups exist at the N- and C-termini. Residues within proteins have pKa values that are moderated by their micro-environments, the nature of their near neighbours, the extent of hydrogen bonding and so on and can take on a range of values different from that of a model residue.

    NMR spectroscopy is the most widely used method for determining the pKa values of individual residues, with an accuracy of 0.1 pH units. Although many NMR methods are available, most entries in the Protein pKa Database (PPD) are derived using 1H, 13C and 15N experiments. Inaccuracies in NMR experiments stem from the range of pH values tested, variations in ionic strength and the reversibility of the titration (10). In light of this, new combination methods are being used based on NMR spectroscopy coupled with site-directed mutagenesis, which leads to more accurate pKa values (10,11).

    The functional importance of ionizable residues has led to numerous attempts to predict individual residue-specific pKa values (12–16). pKa values are usually calculated from 3D structures using the Poisson–Boltzmann equation. However, variations occur between calculated and experimentally measured pKa values (13). Molecular dynamic simulations have also been used for such predictions, although this only gives rise to a marginal increase in accuracy (17).

    As only a small handful of reviews have attempted to compile residue-specific protein pKa values (10,18,19), it was decided to develop a database that would serve as a standard compendium against which to compare new experimental or theoretical results. The PPD v1.0 contains >1400 amino acid pKa values, sourced from experimental data. Cross-references to several external databases—the Protein Data Bank (PDB) (20), the Enzyme Nomenclature and Classification database (21) and the National Center for Biotechnology Information (NCBI) Entrez-Protein—have also been incorporated into the database.

    DATABASE DEVELOPMENT

    PPD v1.0 has been implemented using a postgreSQL relational database, which provides an appropriate infrastructure for all foreseeable future developments of the archive. The data were initially compiled in a Microsoft ACCESS database after exhaustive searching of the primary literature, which included using keyword searches of the NCBI PubMed database (http://www.ncbi.nlm.nih.gov/pubmed). The postgreSQL database is structured into seven normalized tables, populated from a flat-file export of the ACCESS database using PERL scripts integrated with SQL. As data are continually accumulating, archiving data is an on-going process: automatic, periodic updates will be made to the postgreSQL database.

    The PPD user interface is provided by a series of HTML pages. There are two searchable forms available within the PPD site. One offers either a broad or focussed PPD search. The other searches PPD using Basic Local Alignment Search Tool (BLAST). These forms target either a PERL/SQL script or a CGI script which in turn queries the database. The bespoke search engine facilitates fast, efficient and flexible data retrieval (Searching the Database). PPD is freely available on the world wide web (http://www.jenner.ac.uk/PPD).

    DATABASE CONTENT

    The data within PPD was sourced from the primary literature to give >1400 entries, containing pKa values for >160 proteins (Table 1). The database contains pKa values for amino acid side-chains, as well as the N- and C-termini. Data are archived for all amino acid residues, with the exception of methionine. However most entries focus on glutamate, lysine, histidine and aspartate, which together account for >75% of the data. As these four are all key ionizable residues, the apparent bias is not driven by our selection, but by the available experimental data. Very little data are currently available for arginine: its pKa value (12) essentially precludes measurement by titration as proteins will denature at such a high basic pH.

    Table 1 Database summary

    Cross-references to key external databases are also included. These provide links to the protein sequence, using NCBI Entrez-Protein, and any relevant protein structure in the PDB (20). If applicable, the enzyme classification is also given, with links to the Enzyme Nomenclature and Classification Database, developed in line with the International Union of Biochemistry and Molecular Biology (21), providing details of the enzyme reactions. In addition, a link is given to the original literature reference via the NCBI PubMed journals database. These links provide key background knowledge associated with each archived protein. A full description of the database fields is given in Table 2.

    Table 2 Content of the database entries

    The ability to carry out accurate predictions of pKa values depends on having access to a high quality source of data; a principal aim of PPD is to provide such a source. Only experimentally determined pKa values are cited in PPD; predicted pKa values are not included. The quality of data contained in PPD v1.0 is largely dependent upon the accuracy of each experimental determination, thus it contains only values from certain selected techniques: NMR spectroscopy, Raman Difference spectroscopy and UV spectroscopy.

    Protein pKa values are dependent on both intrinsic and extrinsic factors. Intrinsic factors include invariant properties of the protein investigated, such as sequence and structure. Extrinsic factors include the experimental conditions used, such as the temperature, the range of pH tested, protein concentrations as well as the experimental method. Thus we attempt to record all relevant experimental conditions when available. As logistic considerations preclude us from undertaking independent verification of the data, we are obliged to trust the values reported in the literature. It should be noted that the phenomenon of cooperative deprotonation can create circumstances under which pKa values can not be used as a parameter that describes the ionization behaviour of the corresponding group (22–24).

    SEARCHING THE DATABASE

    Two methods to search PPD are available: an amino acid query-based interface (Figure 1) and a BLAST (25) interface. The implementation of a bespoke search system allows the user to perform extensive or focussed searches from a single user interface. The simplest search, using the amino acid query interface, would specify one amino acid residue only. A complex search would accommodate up to four amino acids and pKa ranges, along with experimental method, protein name and species. The search engine allows the choice of how results are presented. The default option returns amino acids and their associated properties (Figure 1B); while the second option returns proteins which contain the specified amino acids (Figure 1C).

    Figure 1 Overview of the amino acid query search. The amino acid nominations are entered in (A). (B) shows the default result presentation, from which the pKa data (D) for the specified residues can be accessed. (C) shows the alternative presentation, with the display of proteins containing the nominated amino acid(s).

    The alternative search interface is based on BLAST (25). A local database of protein sequences found in PPD was compiled from SwissProt (26) and an additional postgreSQL table was created to hold this data. The local database is searched using the NCBI BLASTP and BLASTX programs (25), allowing input of either protein or nucleotide sequences. The HTML front-end connects to a web server-based PL/CGI script which interacts with the BLASTP or BLASTX programs. The output contains links to PPD entries, which are created using SwissProt (26) accession codes.

    FUTURE WORK

    There is an obvious need to extend the number of entries through continuous addition of data from new, and newly-identified, publications. The database also needs to be maintained, ensuring links to external databases remain current. Initially, as with all databases, random errors will occur owing to human error during data acquisition or will be extant within the original experimental data. The database will be assessed for errors and inconsistencies, thus maintaining, as far as possible, the overall veracity of our data. As mentioned, we have tried to maintain a high degree of accuracy, through rigorous data selection; however, user feedback will foment improvements. Moreover, feedback focussing on the search interfaces and the general infrastructure will allow us to develop appropriately both the database and its interface in an efficient and ergonomic manner.

    DISCUSSION AND CONCLUSIONS

    The PPD is a unique compilation of protein pKa values sourced from experimental data only. PPD is novel: no database of its kind currently exists. Compared with other post-genomic databases, the size of PPD is limited, but this reflects its highly focused nature: the burgeoning of such focussed databases is a continuing trend in modern bioinformatics (27,28). The relatively modest size of the database will increase as new data is published.

    Access to PPD data is given through an interface available via the world wide web and includes both a BLAST search and an amino acid query search system. The BLAST search, which is linked to pKa entries and external databases, allows PPD to be a cohesive and integrated source of protein information. PPD facilitates data-driven in silico prediction methods addressing the relationship between ionizable groups and protein function, be that protein–protein interaction, protein folding or enzyme catalysis.

    A brief summary of pKa data for each amino acid is shown in Table 3, which also includes both the mean and SD of the corresponding measured pKa values. From the PPD data, we have shown the distribution of pKa values for the six most frequent residues: glutamic acid, lysine, tyrosine, aspartic acid, histidine and cysteine (Figure 2). Certain residues (aspartate, glutamate, lysine and histidine) have pKa values which show relatively narrow distributions, while other residues (cysteine and tyrosine) show a wider dispersion of values; however, this may only be a reflection of the amount of data available for these residues. While it is clear that mean values approximate closely model values, the corresponding SDs are high, reflecting the wide distribution of ionization states in actual proteins. Aspartate, for example, has a mean pKa of 3.6 versus a model value of 4.0, yet the SD is 1.4. As the data for each residue increases, trends in residue-specific pKa data will become more evident and more certain.

    Table 3 pKa data associated with each amino acid

    Figure 2 Distribution pattern of pKa values. Each column represents a count of pKa values for the specified amino acid and pKa.

    In recent years, there has been an impetus to accumulate data on all scales from the atomic to the genomic; this has led to a rapid increase in the number of databases. Databases are increasingly forming the backbone of science in general and post-genomic biology in particular. PPD v1.0 was developed to provide an easily accessible compilation of protein pKa values. Despite the small size of PPD, the data it contains has utility throughout many different disciplines and, we may hope, the database will grow, through time, into a comprehensive protein pKa resource.

    ACKNOWLEDGEMENTS

    We should like to thank Andrew Worth for his technical assistance and Martin Blythe for programming advice. The Edward Jenner Institute for Vaccine Research wishes to thank its sponsors: GlaxoSmithKline, the Medical Research Council, the Biotechnology and Biological Sciences Research Council, and the UK Department of Health. Funding to pay the Open Access publication charges for this article was provided by the sponsors of the EJIVR.

    REFERENCES

    Honig, B. and Nicholls, A. (1995) Classical electrostatics in biology and chemistry Science, 268, 1144–1149 .

    Warshel, A. (1978) Energetics of enzyme catalysis Proc. Natl Acad. Sci. USA, 75, 5250–5254 .

    Antosiewicz, J., McCammon, J.A., Gilson, M.K. (1994) Prediction of pH-dependent properties of proteins J. Mol. Biol, . 238, 415–436 .

    Lambeir, A.M., Backmann, J., Ruiz-Sanz, J., Filimonov, V., Nielsen, J.E., Kursula, I., Norledge, B.V., Wierenga, R.K. (2000) The ionization of a buried glutamic acid is thermodynamically linked to the stability of Leishmania mexicana triose phosphate isomerase Eur. J. Biochem, . 267, 2516–2524 .

    Horng, J.C., Cho, J.H., Raleigh, D.P. (2005) Analysis of the pH-dependent folding and stability of histidine point mutants allows characterization of the denatured state and transition state for protein folding J. Mol. Biol, . 345, 163–173 .

    Jamin, M., Geierstanger, B., Baldwin, R.L. (2001) The pKa of His-24 in the folding transition state of apomyoglobin Proc. Natl Acad. Sci. USA, 98, 6127–6131 .

    Norel, R., Sheinerman, F., Petrey, D., Honig, B. (2001) Electrostatic contributions to protein–protein interactions: fast energetic filters for docking and their physical basis Protein Sci, . 10, 2147–2161 .

    Nielsen, J.E. and McCammon, J.A. (2003) Calculating pKa values in enzyme active sites Protein Sci, . 12, 1894–1901 .

    Gerratana, B., Cleland, W.W., Frey, P.A. (2001) Mechanistic roles of Thr134, Tyr160, and Lys 164 in the reaction catalyzed by dTDP-glucose 4, 6-dehydratase Biochemistry, 40, 9187–9195 .

    Forsyth, W.R., Antosiewicz, J.M., Robertson, A.D. (2002) Empirical relationships between protein structure and carboxyl pKa values in proteins Proteins, 48, 388–403 .

    Forsyth, W.R. and Robertson, A.D. (2000) Insensitivity of perturbed carboxyl pK(a) values in the ovomucoid third domain to charge replacement at a neighboring residue Biochemistry, 39, 8067–8072 .

    Warshel, A. (1981) Calculations of enzymatic reactions: calculations of pKa, proton transfer reactions, and general acid catalysis reactions in enzymes Biochemistry, 20, 3167–3177 .

    Antosiewicz, J., McCammon, J.A., Gilson, M.K. (1996) The determinants of pKas in proteins Biochemistry, 35, 7819–7833 .

    Gogliettino, M.A., Tanfani, F., Scire, A., Ursby, T., Adinolfi, B.S., Cacciamani, T., De Vendittis, E. (2004) The role of Tyr41 and His155 in the functional properties of superoxide dismutase from the archaeon Sulfolobus solfataricus Biochemistry, 43, 2199–2208 .

    Georgescu, R.E., Alexov, E.G., Gunner, M.R. (2002) Combining conformational flexibility and continuum electrostatics for calculating pKas in proteins Biophys. J, . 83, 1731–1748 .

    Warwicker, J. (2004) Improved pKa calculations through flexibility based sampling of a water-dominated interaction scheme Protein Sci, . 13, 2793–2805 .

    Bashford, D. and Gerwert, K. (1992) Electrostatic calculations of the pKa values of ionizable groups in bacteriorhodopsin J. Mol. Biol, . 224, 473–486 .

    Edgcomb, S.P. and Murphy, K.P. (2002) Variability in the pKa of Histidine side-chains correlates with burial within proteins Proteins, 49, 1–6 .

    Nielsen, J.E. and McCammon, J.A. (2003) On the evaluation and optimization of protein x-ray structures for pKa calculations Protein Sci, . 12, 313–326 .

    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E. (2000) The Protein Data Bank Nucleic Acids Res, . 28, 235–242 .

    IUBMB. Enzyme Nomenclature 1992, (1992) San Diego Academic Press .

    Spitzner, N., Lohr, F., Pfeiffer, S., Koumanov, A., Karshikoff, A., Ruterjans, H. (2001) Ionization properties of titratable groups in ribonuclease T1. I. pKa values in the native state determined by two-dimensional heteronuclear NMR spectroscopy Eur. Biophys. J, . 30, 186–197 .

    Koumanov, A., Spitzner, N., Rüterjans, H., Karshikoff, A. (2001) Ionization properties of titratable groups in ribonuclease T1. II. Electrostatic analysis Eur. Biophys. J, . 30, 198–206 .

    Koumanov, A., Rüterjans, H., Karshikoff, A. (2002) Continuum electrostatic analysis of irregular ionization and proton allocation in proteins Proteins, 46, 85–96 .

    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402 .

    Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 Nucleic Acids Res, . 31, 365–370 .

    Chalk, A.M., Warfinge, R.E., Georgii-Hemming, P., Sonnhammer, E.L.L. (2005) siRNAdb: a database of siRNA sequences Nucleic Acids Res, . 33, D131–D134 .

    Mika, S. and Rost, B. (2005) NMPdb: database of nuclear matrix proteins Nucleic Acids Res, . 33, D160–D163 .(Christopher P. Toseland, Helen McSparron)