当前位置: 首页 > 期刊 > 《分子生物学进展》 > 2005年第3期 > 正文
编号:11176509
Codon Volatility As an Indicator of Positive Selection: Data from Eukaryotic Genome Comparisons
http://www.100md.com 《分子生物学进展》
     Department of Biological Sciences, University of South Carolina, Columbia

    Correspondence: E-mail: austin@biol.sc.edu.

    Abstract

    It has been suggested that codon volatility (the proportion of the point-mutation neighbors of a codon that encode different amino acids) can be used as an index of past positive selection. We compared codon volatility with patterns of synonymous and nonsynonymous nucleotide substitution in genome-wide comparisons of orthologous genes between three pairs of related genomes: (1) the protists Plasmodium falciparum and P. yoelii, (2) the fungi Saccharomyces cerevisiae and S. paradoxus, and (3) the mammals mouse and rat. Codon volatility was not consistently associated with an elevated rate of nonsynonymous substitution, as would be expected under positive selection. Rather, the most consistent and powerful correlate of elevated codon volatility was nucleotide content at the second codon position, as expected, given the nature of the genetic code.

    Key Words: codon volatility ? nucleotide content ? positive selection ? synonymous site ? nonsynonymous site

    Introduction

    Plotkin, Dushoff, and Fraser (2004) introduced the concept of codon volatility and proposed that it provides a method of testing for evidence of past positive selection that can be applied to a single genome. This concept is based on the observation that codons differ with respect to the likelihood that a point mutation will cause an amino acid change (nonsynonymous mutation). For a given codon, its "point-mutation neighbors" constitute the set of codons reachable from that codon by a single point mutation. The volatility of the codon is defined as the proportion of its point-mutation neighbors that encode different amino acids (Plotkin, Dushoff, and Fraser 2004). Plotkin, Dushoff, and Fraser (2004) argue that a protein-coding region that has undergone an unusually high number of amino acid substitutions will have a high frequency of volatile codons in comparison with the rest of the genome. The statistical significance of observed codon volatilities for each gene is then calculated by comparing it to a bootstrap distribution of alternate synonymous sequences, generated from the codon usage observed in the genome (Plotkin, Dushoff, and Fraser 2004).

    Widely used methods of testing for evidence of positive selection involve comparison of the number of synonymous nucleotide substitutions per synonymous site (dS) and the number of nonsynonymous substitutions per nonsynonymous site (dN) (Hughes and Nei 1988; Goldman and Yang 1994; Hughes 1999). The neutral theory predicts that dS will exceed dN in most genes because most nonsynonymous mutations are harmful to protein structure and are, therefore, eliminated (Kimura 1977). The opposite pattern is evidence that natural selection has acted to favor changes at the amino acid level (Hughes and Nei 1988). Estimation of dS and dN involves comparison of at least two related sequences, and this method may be unable to detect positive selection if the two sequences are distantly related (Hughes 1999). Because a closely related sequence may be unavailable for comparison—particularly in the case of model organisms for which complete genome sequences are available—a method to detect a signature of positive selection in a single genome has considerable appeal.

    Furthermore, in most well-studied cases of positive selection, positive selection affects only a small proportion of codons in the gene (Hughes 1999). In some cases, biological knowledge makes it possible to predict which codons are likely to be subject to positive selection (Hughes and Nei 1988), but often, such information is lacking. In these cases, additional statistical methods to identify a signal of past selection on individual codons may be desirable.

    Codon volatility bears an obvious relationship to the estimation of dS and dN that was not noted by Plotkin, Dushoff, and Fraser (2004). Simple methods of estimating dS and dN, such as that of Nei and Gojobori (1986), involve counting the numbers of synonymous and nonsynonymous sites in coding sequences, and this process, like the computation of codon volatility, involves taking into account the proportion of point mutations that will give rise to an amino acid change. In Nei and Gojobori's (1986) method, for each of two sequences compared, one counts S (the number of synonymous sites) and N (the number of nonsynonymous sites). Next, one computes pS (the proportion of synonymous differences per synonymous site) and pN (the proportion of nonsynonymous differences per nonsynonymous site), and these quantities are corrected for multiple hits (Nei and Gojobori 1986). It is obvious that codon volatility bears a close relationship to N, the number of nonsynonymous sites; in fact, codon volatility should be essentially equivalent to N/(N+S). However, if positive selection is expected to increase dN, it is not intuitive at all that an increase in codon volatility—representing in essence the denominator of dN—should be a hallmark of positive selection.

    Plotkin, Dushoff, and Fraser (2004) reported weak but significant negative correlations between the P value of the test for significance of codon volatility and dN in comparisons between different genomes of Mycobacterium tuberculosis and between M. tuberculosis and other Mycobacterium species. In addition, they noted that surface antigens such as the EMP1 family of Plasmodium falciparum and PE/PPE of M. tuberculosis showed significantly high volatility (Plotkin, Dushoff, and Fraser 2004). However, because this method was applied to only a few cases and because of the problematic relationship between codon volatility and N/(N+S), it is unclear whether codon volatility is a good indicator of positive selection in general. Therefore, we decided to test this method further by comparing codon volatility with estimates of dS and dN between orthologous loci from three pairs of closely related eukaryotic genomes: (1) the protists Plasmodium falciparum and P. yoelii, (2) the fungi Saccharomyces cerevisiae and S. paradoxus, and (3) the mammals mouse (Mus musculus) and rat (Rattus norvegicus).

    Methods

    We obtained the coding sequences for the following organisms from the databases indicated: Plasmodium falciparum (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/p_falciparum/); Plasmodium yoelii (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/p_yoelii/); Saccharomyces cerevisiae (www.broad.mit.edu/ftp/pub/annotation/fungi/comp_yeasts/S1c.Alignments/S1c_ORF_alignments.tar.gz); Saccharomyces paradoxus (www.broad.mit.edu/ftp/pub/annotation/fungi/comp_yeasts/S1c.Alignments/S1c_ORF_alignments.tar.gz); mouse Mus musculus (www.ensembl.org; v16.30); and rat Rattus norvegicus (www.ensembl.org; v17.2) Protein families shared by the two Plasmodium genomes and by the mouse and rat genomes were identified by homology and a single-linkage method implemented in the BlastClust software (Altschul et al. 1997). Sequence homology was established by identifying matches using a conservative E-value of 10–6 with a minimum of 40% sequence identity across at least 60% of the length of two sequences. Analyses of other data sets have shown that this is a conservative set of criteria for establishing gene family membership (Hughes and Friedman 2003). Orthologous gene pairs were then identified by families having exactly one member in each of the two genomes compared. In the case of the two yeast (Saccharomyces) species, we used orthologs from the genomic alignments provided by the database.

    Homologous sequences were aligned at the amino acid level using the ClustalW program (Thompson, Higgins, and Gibson 1994), and this alignment was imposed on the DNA sequences. The number of synonymous nucleotide substitutions per synonymous site (dS) and the number of nonsynonymous nucleotide substitutions per nonsynonymous site (dN) were estimated by a maximum-likelihood method (Yang and Nielsen 2000) using the PAML software package (Yang 1997). The number of synonymous sites (S) and the number of nonsynonymous sites (N) were counted by the Nei-Gojobori (1986) method. Codon volatility and the test for significance of codon volatility were computed according to the method in Plotkin, Dushoff, and Fraser (2004), using the software provided by those authors.

    Because the distribution of most of the variables analyzed deviated significantly from the normal distribution (Kolmogorov-Smirnov test), we used nonparametric methods for all statistical testing. G + C content at the three codon positions and codon volatility were highly correlated between the two species compared for all three sets of comparisons (Spearman's rank correlation coefficient rS > 0.90; P < 0.001 in every case). Therefore, in comparisons among genes, we used the means of these quantities for the two genomes compared.

    Results

    Plasmodium falciparum and P. yoelii were considerably more distant than the yeast or mammal comparisons, as reflected by much higher mean and median values of both dS and dN in Plasmodium (table 1). Saccharomyces cerevisiae and S. paradoxus were slightly more divergent at synonymous sites than were mouse and rat but slightly less divergent at nonsynonymous sites (table 1). Both mean and median of codon volatility were highest in Plasmodium and lowest in the mouse-rat comparison (table 1). Although these differences appeared small, there was a highly significant difference in median codon volatility among the three comparisons (Kruskal-Wallis test; P < 0.001). The three pairs of species compared differed also with respect to nucleotide content (table 2). G + C content at third codon positions (GC3) showed more substantial differences than that at first (GC1) or second (GC2) positions (table 2), as is expected because most third-position mutations are synonymous. Plasmodium showed a strong A + T bias, yeast was moderately A + T biased, and mouse and rat were G + C biased (table 2).

    Table 1 Mean (± S.E.), Median and Range of Number of Synonymous Substitutions per Synonymous Site (ds), Nonsynonymous Substitutions per Nonsynonymous Site (dn) and Codon Volatility

    Table 2 Mean (± S.E.), Median, and Range of Number of Mean G + C Content at Each Codon Position in Compared Genes

    For each of the three pairs of species, table 3 compares genes in which the test of codon volatility showed a significant value (at the 5% level) in either of the two species compared with genes that showed significant codon volatility in neither of the species compared. In each comparison, median values of N/(N+S) were very similar to median values of codon volatility (table 3). In fact, codon volatility and N/(N+S) were highly correlated in all three data sets. Spearman's rank correlation coefficient between codon volatility and N/(N+S) was 0.954 (P < 0.001) in Plasmodium, 0.955 (P < 0.001) in yeast, 0.975 (P < 0.001) in mouse and rat, and 0.981 (P < 0.001) for all three species pairs combined.

    [in this window]

    [in a new window]

    Table 3 Medians of Variables for Genes with Significantly Elevated Codon Volatility Compared with Those for Other Genes

    In Plasmodium, genes with significant codon volatility had significantly higher median values of both dS and dN than did other genes (table 3). However the ratio dN/dS, which is usually taken as the best indicator of natural selection, did not differ between genes with significant codon volatility and other genes (table 3). In yeast, a contrary pattern was seen: median values of dS, dN,, and dN/dS were significantly lower in genes with significant codon volatility than in other genes (table 3). In the mouse-rat comparison, by contrast, median dS did not differ between genes with significant codon volatility and other genes, but medians of both dN and dN/dS were significantly higher in genes with significant codon volatility than in other genes (table 3).

    The variables examined in table 3 were intercorrelated in complex ways (data not shown). We, therefore, used partial correlation analysis (applied to rank correlation coefficients) to reveal the effect of each of a set of independent variables on a dependent variable, while simultaneously controlling for the effect of the remaining independent variables. We examined two dependent variables in these analyses: (1) codon volatility and (2) the minimum significance level of the test (Plotkin, Dushoff, and Fraser 2004) of codon volatility. We computed fifth-order partial rank correlation coefficients between each of a set of six independent variables and codon volatility, simultaneously controlling for the other five independent variables (table 4). Likewise, we computed fifth-order partial rank correlation coefficients between the same variables and the minimum significance level; that is, the lower of the two P values obtained when the test of significance of codon volatility was applied to each of the two genomes (table 4).

    Table 4 Fifth-Order Partial Rank Correlation Coefficients Between a Set of Independent Variables Relating to Nucleotide Substitution and Composition and Dependent Variables Relating to Codon Volatility

    In all three pairs of species, the strongest relationship with codon volatility was a very strong negative partial correlation between GC2 and codon volatility (table 4). In each pair of species, this partial correlation was significantly different from the partial correlations of each other variable with codon volatility (two-tailed test; P < 0.0001 in every case). Moreover, when the data from the three species pairs were combined, there was a strong negative correlation (rS = –0.851; P < 0.001) between codon volatility and GC2 (fig. 1). GC1 was also significantly negatively correlated with codon volatility in all species pairs (table 4). GC3 was significantly positively correlated with codon volatility in Plasmodium, uncorrelated in yeast, and significantly negatively correlated in the mouse-rat comparison (table 4).

    FIG. 1.— Codon volatility as a function of G + C content at the second position in comparisons (N = 7,905) between orthologous loci of P. falciparum and P. yoelii, between S. cerevisiae and S. paradoxus, and between mouse and rat (rS = –0.851; P < 0.001).

    In Plasmodium, both dS and dN showed modest but significant postive partial correlations with codon volatility, but dN/dS was not correlated with codon volatility (table 4). In yeast, dS, dN, and dN/dS were not significantly correlated with codon volatility, whereas in the mouse-rat comparison, only dS showed a significant postive partial correlation with codon volatility (table 4). In all three species comparisons, dS, dN,, and dN/dS were not significantly correlated with the minimum significance level, whereas GC2 was positively correlated with the minimum significance level in every case (table 4). In the yeast and mammal comparisons, GC1 and GC3 were also positively correlated with the minimum significance level; by contrast, in Plasmodium, GC3 was negatively correlated with the minimum significance level (table 4).

    Thus, the results indicated that nucleotide content was in general a more powerful predictor of codon volatility and of the significance level of the test of codon volatitility than were dN or dN/dS. One surprising finding, however, was that in Plasmodium, unlike the other comparisons, the correlations with GC2 and GC3 were opposite in direction (table 4). One possible factor in this difference was that Plasmodium showed a unique pattern of correlation between G + C content values at the three codon positions. In the other species compared, GC1, GC2, and GC3 were all significantly positively correlated with each other; and the same was true when data from all three species comparisons were pooled (table 5). However, in Plasmodium, GC3 was not significantly correlated with GC1 or GC2 (table 5).

    Table 5 Rank Correlation Coefficients Among G + C Content Values at the First (GC1), Second (GC2), and Third (GC3) Codon Positions

    Discussion

    Plotkin, Dushoff, and Fraser (2004) proposed that codon volatility can be used as an indicator of past positive selection, but our analysis of data from comparison of orthologous genes from three pairs of related species cast doubt on the existence of any simple relationship between codon volatility and natural selection. There was no evidence that genes with atypically high codon volatility are likely to have undergone a greater rate of nonsynonymous substitution, as maintained by Plotkin, Dushoff, and Fraser (2004). In general, the relationships between codon volatility or significance level of the test of codon volatility and other measures of positive selection, such as dN or dN/dS, were inconsistent from one data set to another. It might be argued that codon volatility may capture a signal of positive selection that is not captured by estimations of dN or dN/dS, but the lack of consistency across data sets is problematic for the hypothesis that codon volatility provides a repeatable signal of past positive selection.

    Rather, codon volatility seemed to reflect largely nucleotide composition, particularly at the second codon position. Why this should be so is easily seen from a consideration of the genetic code. In the universal code, the two most volatile codons are ATG (Met) and TGG (Trp). The next most volatile codons are the group of two-fold degenerate codons, at which two of the possible third-position mutations are nonsynonymous. An examination of the 18 two-fold degenerate codons shows that 14 have A in the second position, and two have T in the second position. All sense codons with A at the second position are two-fold degenerate. Thus, rather than being a measure of positive selection, the codon volatility of a gene is a measure largely of the percentage of A at second positions in the gene.

    Most two-fold degenerate codons with A in the second position encode highly hydrophilic amino acids, including the charged residues His, Lys, Asp, and Glu. Such amino acids are often found on the surface of globular proteins, where they are often subject to a relatively low level of functional constraint (Kimura and Ohta 1973). Thus, in globular proteins, high codon volatility may be associated with elevated dN, not as a result of positive selection but simply as a consequence of the relaxation of purifying selection. On the other hand, charged residues are often involved in recognition molecules, a category that includes many of the best-documented cases of positive selection at the molecular level (Hughes 1999).

    In the case of Plasmodium, a number of surface antigens are characterized by amino acid repeat arrays with a high frequency of hydrophilic residues (Verra and Hughes 1999; Hughes 2004). This may be an adaptation to attract an ineffective T cell–independent immune response on the part of the vertebrate host (Kemp, Coppel, and Anders 1987). In some cases, nonrepeat regions of the same proteins have been shown to be subject to positive selection, probably driven by host T cell recognition (Hughes 1991, 1992; Hughes and Hughes 1995). Similarly, high frequencies of hydrophilic residues are found in repeat arrays of the PE and PPE families in Mycobacterium tuberculosis, particularly in regions shown to be recognized by host immunoglobulins (Okkels et al. 2003). The fact that Plotkin,. Dushoff, and Fraser (2004) reported high codon volatility at antigen loci of Plasmodium and Mycobacterium, thus, apparently reflects hydrophilic amino acid composition, which in these cases is fortuitously associated with positive selection.

    Acknowledgements

    This research was supported by grant GM43940 from the National Institutes of Health to A.L.H.

    References

    Altschul S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.

    Goldman, N., and Z. Yang. 1994. Codon-based models of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725–736.

    Hughes, A. L. 1991. Circumsporozoite protein genes of malaria parasites (Plasmodium spp.): evidence for positive selection on immunogenic regions. Genetics 127:345–353.

    ———. 1992. Positive selection and interallelic recombination at the merozoite surface antigen-1 (MSA-1) locus of Plasmodium falciparum. Mol. Biol. Evol. 9:381–393.

    ———. 1999. Adaptive evolution of genes and genomes. Oxford University Press, New York.

    ———. 2004. The evolution of amino acid repeat arrays in Plasmodium and other organisms. J. Mol. Evol. 59:528–535.

    Hughes, A. L., and R. Friedman. 2003. Genome-wide survey for genes horizontally transferred from cellular organisms to baculoviruses. Mol. Biol. Evol. 20:979–987.

    Hughes, A. L., and M. Nei. 1988. Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. Nature 335:167–170.

    Hughes, M. K., and A. L. Hughes. 1995. Natural selection on Plasmodium surface proteins. Mol. Biochem. Parasitol. 71:99–113.

    Kemp, D. J., R. L. Coppel, and R. F. Anders. 1987. Repetitive genes and proteins of malaria. Ann. Rev. Microbiol. 41:181–208.

    Kimura, M. 1977. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275–276.

    Kimura, M., and T. Ohta. 1973. Mutation and evolution at the molecular level. Genetics 73(Suppl.):19–35.

    Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418–426.

    Okkels, L. M., I. Brock, F. Follmann, E. A. Agger, S. M. Arend, T. H. M. Ottenhoff, F. Oftung, I. Rosenkrands, and P. Andersen. 2003. PPE protein (Rv3873) from DNA segment of RD1 of Mycobacterium tuberculosis: strong recognition of both specific T-cell epitopes and epitopes conserved within the PPE family. Infect. Immunu. 71:6116–6123.

    Plotkin, J. B., J. Dushoff, and H. B. Fraser. 2004. Detecting selection using a single genome sequence of M. tuberculosis and P. falciparum. Nature 428:942–945.

    Thompson, J. D., D. G. Higgins, and T. Gibson. 1994. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.

    Verra, F., and A. L. Hughes. 1999. Biased amino acid composition in repeat regions of Plasmodium antigens. Mol. Biol. Evol. 16:627–633.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555–556.

    Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32–43.(Robert Friedman and Austi)