Gene "Volatility" Is Most Unlikely to Reveal Adaptation
http://www.100md.com
《分子生物学进展》
Institute of Genetics, University of Nottingham, Nottingham, UK
Correspondence: E-mail: paul@evol.nott.ac.uk.
Abstract
It has recently been claimed that adaptive molecular evolution can be detected within single genome sequences by use of gene "volatility" scores. However, the approach used was entirely based on the assumption that synonymous codon usage is normally shaped by selection for low volatility; this is most unlikely to be true. Furthermore, even if that assumption could be justified, the method would clearly lack power, detecting only genes where a very large number of nonsynonymous substitutions had occurred. Volatility scores are susceptible to other influences. The unusually high volatilities of the Mycobacterium tuberculosis and Plasmodium falciparum genes that were identified as putatively having undergone adaptive changes were largely the result of internally repetitive structures, in which unusual codon usage was caused by the mechanisms that generated this repetition rather than by adaptive changes.
Key Words: volatility ? adaptation ? codon usage ? Mycobacterium tuberculosis ? Plasmodium falciparum
Adaptive molecular evolution is normally detected by comparative analyses of homologous sequences (Sharp 1997; Yang 2002), but recently, Plotkin, Dushoff, and Fraser (2004) claimed to have detected adaptation in Mycobacterium tuberculosis and Plasmodium falciparum by use of single genome sequences. However, their approach was based on an unjustified assumption. They looked for genes with unusually high "volatility." The volatility of a codon, v(c), was defined as the fraction of single nucleotide substitutions that would be nonsynonymous, and the volatility of a gene, v(G), was defined as the sum of its codon values. Plotkin, Dushoff, and Fraser (2004) tested whether v(G) is unusually high by comparing the value to those for genes encoding the same protein sequence but with codon usage drawn randomly from that of the genome as a whole. Thus, the method tested for unusual synonymous codon usage, biased in a particular way. It assumed that codon usage is normally subject to selection for low volatility and that nonsynonymous mutations yield codons of higher volatility. The suggestion that codon usage might be selected to reduce potential damage caused by mutation is not new (Modiano, Battistuzzi, and Motulski 1981; Golding and Strobeck 1982), but because the selective pressure is likely to be of the same order as the mutation rate, it is most unlikely to be effective (Golding and Strobeck 1982; Kimura 1983). Indeed, studies of synonymous codon usage in M. tuberculosis (Andersson and Sharp 1996) and P. falciparum (Piexoto, Fernandez, and Musto 2004) have revealed evidence of selection for translationally optimal codons in genes expressed at high levels but no obvious trend towards avoidance of high volatility codons.
It is also apparent that the volatility test lacks power to detect nonsynonymous substitutions. First, the method only considers amino acids where synonyms vary in their volatility. Consideration of the genetic code (table 1) reveals that this applies to only four amino acids: serine, leucine, arginine, and glycine (SLRG). Any nonsynonymous mutations to the other 16 amino acids are undetected. Second, only a fraction of mutations to SLRG yield an increase in volatility. For example, 57% of the possible nonsynonymous mutations to serine codons yield UCN codons with volatility scores in the range 0.571 to 0.667, whereas serine codons in genes with high volatility values are dominated by AGY triplets, with volatility scores of 0.889 (table 1). Third, a large number of volatility-enhancing nonsynonymous mutations are needed. To raise the v(G) of an average M. tuberculosis gene to a significant value requires at least 20 of these specific nonsynonymous substitutions to SLRG codons. Also, although it was claimed that the method "controls for the gene's length and amino acid composition," it is clear that for two genes with the same average v(c), the longer gene (in fact the gene with more SLRG codons) has the greater chance of yielding a significant v(G) value. Taken together, these various factors indicate that very many nonsynonymous substitutions are required before a gene can be detected by this approach, and these must all have occurred within a short enough time period that codon usage has not been ameliorated by the (presumed) selection for lower volatility.
[in this window]
[in a new window]
Table 1 Codon Volatility Scores
Very few genes had significantly high volatility values. Plotkin, Dushoff, and Fraser (2004) estimated the probability of volatility values for 5,440 P. falciparum genes. Given this multiplicity of tests, a Bonferroni approach suggests that P < 0.00001 is required for significance at the 5% level; less than 1% (53) of genes had such low values. Among the 4,099 M. tuberculosis genes, only 12 had values significant by the same criterion. Clearly, it would be extremely surprising if so few genes had undergone adaptive evolution.
Given that some genes were detected as having significantly high volatility, what might be the explanation? In both of the genomes examined, most of the genes identified contain obvious repetitive regions. The v(G) values of these genes are inflated by the presence of high volatility codons, particularly the maximally volatile Ser codons AGU and AGC, within these repeats. Five of the simplest examples from P. falciparum are illustrated in table 2. For example, gene PF10_0356 contains 68 copies of a 17-codon repeat, and it appears that the initial element contained a single serine codon, which happened to be AGC. Thus, 67 copies contain an AGC codon at the same position, and there are only four other serine codons (all UCU) in this region, whereas in the remainder of the gene there are 32 serine codons, six AGY and 26 UCN. Consequently, within the repetitive region the average v(c) is 0.737, whereas in the surrounding unique sequences the value is 0.684, compared with a genome average of 0.702. All five of the genes in table 2 were among the 37 given the highest significance scores (probability values of zero) in the P. falciparum analysis (Plotkin, Dushoff, and Fraser 2004). Most of the other genes with significantly high volatility values also seem to have internal repetitions, although the structures are more complex than the examples given in table 2, for example involving several areas of repetition of different basic units.
Table 2 Volatility Values in Repeat Regions of Plasmodium falciparum Genes
Among the 12 M. tuberculosis genes with significantly high volatility, 10 encode PE_PGRS or PPE family proteins, which are long and have unusual amino acid (and codon) composition because of the presence of many repeats (Cole 1999). It is also worth noting that the genome average v(c) in M. tuberculosis is only 0.650 (analyzing only the relevant amino acids, SLRG); among 80 genomes from diverse bacterial species that I examined, M. tuberculosis ranked fifth lowest in this regard. Variation of synonymous codon usage among bacterial species is primarily influenced by genomic G+C content, presumably reflecting fundamental mutation biases (Sharp et al. 1993). Among the 80 species, genome average volatility scores were strongly negatively correlated with genomic G+C content. As a consequence, in G+C-rich species such as M. tuberculosis, any genes horizontally transferred from more A+T-rich species are likely to stand out as having unusually volatile codon usage.
The high volatility values of the genes identified by Plotkin, Dushoff, and Fraser (2004) were primarily caused by unusual codon usage within repetitive regions. This can be easily explained by sequence duplication by recombination and/or slippage mechanisms, whereas coincidental selection of the same individual nonsynonymous mutations in numerous copies of the repeats seems extremely unlikely. The volatility test assumes that each codon can be treated independently, whereas among repeated regions, codons share common origins. Presumably, there are many other genes with repetitive regions in these genomes but where the codons in the starting unit happened not have high volatility scores. In conclusion, the volatility test detects genes with highly unusual SLRG codon usage but is most unlikely to detect adaptive evolution.
Other authors have recently explored the volatility test from other angles, also concluding that this methodology does not detect adaptive evolution, and thus that the conclusions reached by Plotkin, Dushoff, and Fraser (2004) were erroneous (Dagan and Graur 2005; Friedman and Hughes 2005; Zhang 2005).
Acknowledgements
I am grateful to John Armour, John Brookfield, and Bryan Clarke for discussion of these issues.
References
Andersson, S. G. E., and P. M. Sharp. 1996. Codon usage in the Mycobacterium tuberculosis complex. Microbiology 142:915–925.
Cole, S. T. 1999. Learning from the genome sequence of Mycobacterium tuberculosis H37Rv. FEBS Lett. 452:7–10.
Dagan, T., and D. Graur. 2005. The comparative method rules! Codon volatility cannot detect positive Darwinian selection using a single genome sequence. Mol. Biol. Evol. (in press).
Friedman, R., and A. L. Hughes. 2005. Codon volatility as an indicator of positive selection: data from eukaryotic genome comparisons. Mol. Biol. Evol. (in press).
Golding, G. B., and C. Strobeck. 1982. Expected frequencies of codon use as a function of mutation rates and codon fitnesses. J. Mol. Evol. 18:379–386.
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge.
Modiano, G., G. Battistuzzi, and A. G. Motulski. 1981. Nonrandom patterns of codon usage and of nucleotide substitutions in human alpha and beta globin genes: an evolutionary strategy reducing the rate of mutations with drastic effects? Proc. Natl. Acad. Sci. USA 78:1110–1114.
Piexoto, L., V. Fernandez, and H. Musto. 2004. The effect of expression levels on codon usage in Plasmodium falciparum. Parasitology 128:245–251.
Plotkin, J. B., J. Dushoff, and H. B. Fraser. 2004. Detecting selection using a single genome sequence of M. tuberculosis and P. falciparum. Nature 428:942–945.
Sharp, P. M. 1997. In search of molecular darwinisim. Nature 385:111–112.
Sharp, P. M., M. Stenico, J. F. Peden, and A. T. Lloyd. 1993. Codon usage: mutational bias, translational selection, or both? Biochem. Soc. Trans. 21:835–841.
Yang, Z. 2002. Inference of selection from multiple sequence alignments. Curr. Opin. Genet. Dev. 12:688–694.
Zhang, J. 2005. On the evolution of codon volatility. Genetics (in press).(Paul M. Sharp)
Correspondence: E-mail: paul@evol.nott.ac.uk.
Abstract
It has recently been claimed that adaptive molecular evolution can be detected within single genome sequences by use of gene "volatility" scores. However, the approach used was entirely based on the assumption that synonymous codon usage is normally shaped by selection for low volatility; this is most unlikely to be true. Furthermore, even if that assumption could be justified, the method would clearly lack power, detecting only genes where a very large number of nonsynonymous substitutions had occurred. Volatility scores are susceptible to other influences. The unusually high volatilities of the Mycobacterium tuberculosis and Plasmodium falciparum genes that were identified as putatively having undergone adaptive changes were largely the result of internally repetitive structures, in which unusual codon usage was caused by the mechanisms that generated this repetition rather than by adaptive changes.
Key Words: volatility ? adaptation ? codon usage ? Mycobacterium tuberculosis ? Plasmodium falciparum
Adaptive molecular evolution is normally detected by comparative analyses of homologous sequences (Sharp 1997; Yang 2002), but recently, Plotkin, Dushoff, and Fraser (2004) claimed to have detected adaptation in Mycobacterium tuberculosis and Plasmodium falciparum by use of single genome sequences. However, their approach was based on an unjustified assumption. They looked for genes with unusually high "volatility." The volatility of a codon, v(c), was defined as the fraction of single nucleotide substitutions that would be nonsynonymous, and the volatility of a gene, v(G), was defined as the sum of its codon values. Plotkin, Dushoff, and Fraser (2004) tested whether v(G) is unusually high by comparing the value to those for genes encoding the same protein sequence but with codon usage drawn randomly from that of the genome as a whole. Thus, the method tested for unusual synonymous codon usage, biased in a particular way. It assumed that codon usage is normally subject to selection for low volatility and that nonsynonymous mutations yield codons of higher volatility. The suggestion that codon usage might be selected to reduce potential damage caused by mutation is not new (Modiano, Battistuzzi, and Motulski 1981; Golding and Strobeck 1982), but because the selective pressure is likely to be of the same order as the mutation rate, it is most unlikely to be effective (Golding and Strobeck 1982; Kimura 1983). Indeed, studies of synonymous codon usage in M. tuberculosis (Andersson and Sharp 1996) and P. falciparum (Piexoto, Fernandez, and Musto 2004) have revealed evidence of selection for translationally optimal codons in genes expressed at high levels but no obvious trend towards avoidance of high volatility codons.
It is also apparent that the volatility test lacks power to detect nonsynonymous substitutions. First, the method only considers amino acids where synonyms vary in their volatility. Consideration of the genetic code (table 1) reveals that this applies to only four amino acids: serine, leucine, arginine, and glycine (SLRG). Any nonsynonymous mutations to the other 16 amino acids are undetected. Second, only a fraction of mutations to SLRG yield an increase in volatility. For example, 57% of the possible nonsynonymous mutations to serine codons yield UCN codons with volatility scores in the range 0.571 to 0.667, whereas serine codons in genes with high volatility values are dominated by AGY triplets, with volatility scores of 0.889 (table 1). Third, a large number of volatility-enhancing nonsynonymous mutations are needed. To raise the v(G) of an average M. tuberculosis gene to a significant value requires at least 20 of these specific nonsynonymous substitutions to SLRG codons. Also, although it was claimed that the method "controls for the gene's length and amino acid composition," it is clear that for two genes with the same average v(c), the longer gene (in fact the gene with more SLRG codons) has the greater chance of yielding a significant v(G) value. Taken together, these various factors indicate that very many nonsynonymous substitutions are required before a gene can be detected by this approach, and these must all have occurred within a short enough time period that codon usage has not been ameliorated by the (presumed) selection for lower volatility.
[in this window]
[in a new window]
Table 1 Codon Volatility Scores
Very few genes had significantly high volatility values. Plotkin, Dushoff, and Fraser (2004) estimated the probability of volatility values for 5,440 P. falciparum genes. Given this multiplicity of tests, a Bonferroni approach suggests that P < 0.00001 is required for significance at the 5% level; less than 1% (53) of genes had such low values. Among the 4,099 M. tuberculosis genes, only 12 had values significant by the same criterion. Clearly, it would be extremely surprising if so few genes had undergone adaptive evolution.
Given that some genes were detected as having significantly high volatility, what might be the explanation? In both of the genomes examined, most of the genes identified contain obvious repetitive regions. The v(G) values of these genes are inflated by the presence of high volatility codons, particularly the maximally volatile Ser codons AGU and AGC, within these repeats. Five of the simplest examples from P. falciparum are illustrated in table 2. For example, gene PF10_0356 contains 68 copies of a 17-codon repeat, and it appears that the initial element contained a single serine codon, which happened to be AGC. Thus, 67 copies contain an AGC codon at the same position, and there are only four other serine codons (all UCU) in this region, whereas in the remainder of the gene there are 32 serine codons, six AGY and 26 UCN. Consequently, within the repetitive region the average v(c) is 0.737, whereas in the surrounding unique sequences the value is 0.684, compared with a genome average of 0.702. All five of the genes in table 2 were among the 37 given the highest significance scores (probability values of zero) in the P. falciparum analysis (Plotkin, Dushoff, and Fraser 2004). Most of the other genes with significantly high volatility values also seem to have internal repetitions, although the structures are more complex than the examples given in table 2, for example involving several areas of repetition of different basic units.
Table 2 Volatility Values in Repeat Regions of Plasmodium falciparum Genes
Among the 12 M. tuberculosis genes with significantly high volatility, 10 encode PE_PGRS or PPE family proteins, which are long and have unusual amino acid (and codon) composition because of the presence of many repeats (Cole 1999). It is also worth noting that the genome average v(c) in M. tuberculosis is only 0.650 (analyzing only the relevant amino acids, SLRG); among 80 genomes from diverse bacterial species that I examined, M. tuberculosis ranked fifth lowest in this regard. Variation of synonymous codon usage among bacterial species is primarily influenced by genomic G+C content, presumably reflecting fundamental mutation biases (Sharp et al. 1993). Among the 80 species, genome average volatility scores were strongly negatively correlated with genomic G+C content. As a consequence, in G+C-rich species such as M. tuberculosis, any genes horizontally transferred from more A+T-rich species are likely to stand out as having unusually volatile codon usage.
The high volatility values of the genes identified by Plotkin, Dushoff, and Fraser (2004) were primarily caused by unusual codon usage within repetitive regions. This can be easily explained by sequence duplication by recombination and/or slippage mechanisms, whereas coincidental selection of the same individual nonsynonymous mutations in numerous copies of the repeats seems extremely unlikely. The volatility test assumes that each codon can be treated independently, whereas among repeated regions, codons share common origins. Presumably, there are many other genes with repetitive regions in these genomes but where the codons in the starting unit happened not have high volatility scores. In conclusion, the volatility test detects genes with highly unusual SLRG codon usage but is most unlikely to detect adaptive evolution.
Other authors have recently explored the volatility test from other angles, also concluding that this methodology does not detect adaptive evolution, and thus that the conclusions reached by Plotkin, Dushoff, and Fraser (2004) were erroneous (Dagan and Graur 2005; Friedman and Hughes 2005; Zhang 2005).
Acknowledgements
I am grateful to John Armour, John Brookfield, and Bryan Clarke for discussion of these issues.
References
Andersson, S. G. E., and P. M. Sharp. 1996. Codon usage in the Mycobacterium tuberculosis complex. Microbiology 142:915–925.
Cole, S. T. 1999. Learning from the genome sequence of Mycobacterium tuberculosis H37Rv. FEBS Lett. 452:7–10.
Dagan, T., and D. Graur. 2005. The comparative method rules! Codon volatility cannot detect positive Darwinian selection using a single genome sequence. Mol. Biol. Evol. (in press).
Friedman, R., and A. L. Hughes. 2005. Codon volatility as an indicator of positive selection: data from eukaryotic genome comparisons. Mol. Biol. Evol. (in press).
Golding, G. B., and C. Strobeck. 1982. Expected frequencies of codon use as a function of mutation rates and codon fitnesses. J. Mol. Evol. 18:379–386.
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge.
Modiano, G., G. Battistuzzi, and A. G. Motulski. 1981. Nonrandom patterns of codon usage and of nucleotide substitutions in human alpha and beta globin genes: an evolutionary strategy reducing the rate of mutations with drastic effects? Proc. Natl. Acad. Sci. USA 78:1110–1114.
Piexoto, L., V. Fernandez, and H. Musto. 2004. The effect of expression levels on codon usage in Plasmodium falciparum. Parasitology 128:245–251.
Plotkin, J. B., J. Dushoff, and H. B. Fraser. 2004. Detecting selection using a single genome sequence of M. tuberculosis and P. falciparum. Nature 428:942–945.
Sharp, P. M. 1997. In search of molecular darwinisim. Nature 385:111–112.
Sharp, P. M., M. Stenico, J. F. Peden, and A. T. Lloyd. 1993. Codon usage: mutational bias, translational selection, or both? Biochem. Soc. Trans. 21:835–841.
Yang, Z. 2002. Inference of selection from multiple sequence alignments. Curr. Opin. Genet. Dev. 12:688–694.
Zhang, J. 2005. On the evolution of codon volatility. Genetics (in press).(Paul M. Sharp)