当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第2期 > 正文
编号:11259334
Effect of Strong Directional Selection on Weakly Selected Mutations at Linked Sites: Implication for Synonymous Codon Usage
     Department of Biological Statistics and Computational Biology, Cornell University

    E-mail: ykim@mail.rochester.edu.

    Abstract

    The fixation of weakly selected mutations can be greatly influenced by strong directional selection at linked loci. Here, I investigate a two-locus model in which weakly selected, reversible mutations occur at one locus and recurrent strong directional selection occurs at the other locus. This model is analogous to selection on codon usage at synonymous sites linked to nonsynonymous sites under strong directional selection. Two approximations obtained here describe the expected frequency of the weakly selected preferred alleles at equilibrium. These approximations, as well as simulation results, show that the level of codon bias declines with an increasing rate of substitution at the strongly selected locus, as expected from the well-understood theory that selection at one locus reduces the efficacy of selection at linked loci. These solutions are used to examine whether the negative correlation between codon bias and nonsynonymous substitution rates recently observed in Drosophila can be explained by this hitchhiking effect. It is shown that this observation can be reasonably well accounted for if a large fraction of the nonsynonymous substitutions on genes in the data set are driven by strong directional selection.

    Key Words: codon bias ? linkage ? interference ? hitchhiking ? Drosophila

    Introduction

    Recent studies suggested that the dynamics of a weakly selected variant can be greatly influenced by strong selection at closely linked loci (Barton 1995; Gillespie 2001). With little recombination between loci, the frequency change of a weakly selected allele is determined by its initial association with the strongly selected allele. For example, if a weakly selected mutation occurs on a chromosome carrying a strongly beneficial mutation, it will increase in frequency along with the beneficial mutation, regardless of the direction of the weak selection. It is analogous to the effect of directional selection on linked neutral variants, or the "hitchhiking" effect (Maynard Smith and Haigh 1974). However, whereas the hitchhiking effect does not change the average rate of substitutions at neutral loci (Birky and Walsh 1988), it does change the rate of substitution at weakly selected loci (Birky and Walsh 1988; Gillespie 2001). The average fixation probability of weakly beneficial (deleterious) alleles is decreased (increased) by the hitchhiking effect of strong selection at linked loci. These changes in fixation probabilities are generally interpreted as a reduction of the efficacy of selection at one locus because of selection at other linked sites, commonly referred to as Hill-Robertson effects (Hill and Robertson 1966). The fixation probability of a weakly beneficial mutation affected by linked selection was studied by Barton (1995) and by Gerrish and Lenski (1998), and that for a weakly deleterious mutation with complete linkage was studied by Gillespie (2001). The present study obtains slightly different solutions for the fixation probability of newly arising weakly selected alleles, either deleterious or beneficial, linked to a locus under recurrent directional selection. Analytic solutions given here are intended to describe the dynamics of molecular evolution at synonymous sites in protein-coding sequences.

    A number of studies indicate that synonymous sites in protein-coding sequences are under weak selection (Sharp and Li 1987; Shields et al. 1988). For each amino acid, a certain synonymous codon ("optimal" or "preferred" codon) is used more frequently than others. It is believed that preferred codons are selectively advantageous over others because of the differences in translational efficiency and accuracy among alternative codons. The strength of selection on alternative codons was found to be very low; that is, selection coefficients are estimated to be on the order of 1/2Ne (Akashi 1995; Akashi and Schaeffer 1997; McVean and Vieira 2001). Therefore, the dynamics of synonymous substitutions may be greatly influenced by moderate or strong directional selection at linked loci. The hitchhiking effect will increase the fixation probability of "deleterious" unpreferred codons and decrease the fixation probability of "beneficial" preferred codons. Therefore, the level of codon bias should decrease with increasing rate of positively selected substitutions at linked sites. This prediction is strongly supported by recent observations in Drosophila. Betancourt and Presgraves (2002) analyzed 257 genes from D. melanogaster and D. simulans (102 from GenBank and 153 from a D. simulans male-specific EST screen) and found a strong negative correlation between the frequency of optimal codon usage and the rate of nonsynonymous substitutions (dN). Assuming that dN is indicative of the rate of positive selection, they concluded that the Hill-Robertson interference between selection acting on codon usage bias and amino acid substitutions caused the decline of optimal codon usage with increasing dN. However, there are several alternative hypotheses (see Discussion) to explain the reduction of codon bias where amino acid sequence is not well conserved, which has been observed earlier (Ticher and Grauer 1989; Akashi 1994). For example, a relaxation of functional constraints may both elevate dN and depress the frequency of optimal codons. Therefore, to accept the hypothesis of interference as a satisfying explanation for the observed correlation between codon bias and dN, one needs to demonstrate that the observed correlation can be obtained by reasonable parameter values of directional selection in Drosophila species.

    Model and Simulation Method

    A two-locus model in which weak selection occurs at one locus and strong selection at the other is assumed. These loci are referred to as the "weak" locus and the "strong" locus throughout this paper. At the weak locus, which models the synonymous site of a twofold degenerate codon, mutations occur from allele A to a with rate μ10 and from a to A with rate μ01 each generation. The relative fitness of A and a is given by 1 and 1 – sw, respectively. At the strong locus, the wild-type allele b mutates to the beneficial allele B with rate μs per generation. The relative fitness of b and B is given by 1 and 1 + ss, respectively. If B is fixed, this allele becomes the new wild-type allele. Therefore, immediately after fixation, all copies of B revert to b, and, subsequently, a new mutation from b to B can occur. To simplify the model, a haploid population of 2N chromosomes is assumed. Then the "weak" and "strong" selection refer to the condition that sw = O(1/N) and ss >> sw. The recombination rate between the two loci is given by r per generation.

    There are four possible haplotypes in this model: AB, Ab, aB, and ab. The dynamics of the system is therefore simply described by the changes of four haplotype frequencies, which are x1, x2, x3, and x4, respectively. The frequency change in each generation is generated by a set of equations that determines the effect of selection, recombination, and mutation in an infinite population. The derivation of these equations is straightforward (Ewens 1979). The solution of these equations represents the pool of gametes from which the next generation is produced according to the Wright-Fisher model. The multinomial sampling is simulated using the random binomial number generator of Press et al. (1992). This approach correctly simulated a population at mutation-selection-drift balance (Kim and Stephan 2000). Simulations start with x1 = x3 = 0 and x2 = x4 = 0.5. An initial phase of 4N generations allows the system to reach the mutation-selection-drift equilibrium. During the second phase, which is 5 x 109 generations long, the allele frequency of A (= x1 + x2) is monitored. The frequency of the optimal codon usage (Fop [see below]) is estimated by the average frequency of A observed over the entire period. The number of fixation events at the strong locus is also counted.

    Theory

    A formula describing the frequency of the optimal codon usage (Fop) is obtained considering a twofold degenerate amino acid site. I assume a very low mutation rate at the synonymous site (the weak locus in the model above) such that the site is fixed with either the preferred (A) or unpreferred (a) codon at any given time with high probability. Fop is defined as the proportion of sites fixed for A. Then, assuming that the flux of substitutions between A and a is at equilibrium, it has been shown that

    where 01 (10) is the rate of substitution from allele a to A (A to a) (Bulmer 1991, McVean, and Charlesworth 1999). The fixation probability of a weakly selected mutation experiencing genetic drift with effective population size Ne is given by

    (Ewens 1979), where s is the selection coefficient, and p0 = 1/(2N) is the initial frequency of the mutation. Therefore, in the absence of interference from the strong locus, the expectation of Fop is obtained by equation 1 using Ne = N, 01 = 2Nμ01u(sw), and 10 = 2Nμ10u(–sw). With recurrent directional selection at the strong locus and low recombination, fixation probabilities cannot be obtained by equation 2. I therefore attempt to derive substitution rates at the weak locus under this situation.

    First, no recombination is assumed between the weak and the strong loci in the model described above. The fate of a new mutant allele at the weak locus (a or A) depends only on genetic drift and weak selection until a substitution at the strong locus occurs. Assume that the strongly selected allele, B, which eventually goes to fixation, appears t generations after the mutation occurred at the weak locus. This mutant at the weak locus may go to fixation before B arises or, if it is not yet fixed at time t, through genetic hitchhiking with B. In either case, B arises on a chromosome carrying this mutant with probability p, the frequency of the mutant at the time of mutation at the strong locus (Gillespie 2001). Therefore, the fixation probability of the weakly selected mutant conditional on the waiting time t until B arises is given by

    where (p, t) is the frequency density of mutant allele after t generations of genetic drift and weak selection. Et[p], the mean allele frequency, should satisfy E0[p] = p0 = 0.5/N and E[p] = u(s) (Ne = N), where s is the selection coefficient of the mutant (sw for A and –sw for a). One may derive Et[p] using a well-known diffusion equation

    where g(p) is a function of p, and μ(p, t) and 2(p, t) are the infinitesimal drift and diffusion parameters (Stephan, Wiehe, and Lenz 1992). Using g(p) = p, μ(p, t) = sp(1 – p), and 2(p, t) = p(1 – p)/2N, we obtain

    Therefore, to solve for Et[p], one needs the solution of Et[p2]. As this moment expansion does not stop, I rather assume that the average decay of heterozygosity, Et[2p(1 – p)], is not much different from that under neutral evolution. This assumption is not unreasonable for small t, because the behavior of a weakly selected allele still at low frequency is similar to that of a neutral allele. Previous studies of the neutral model showed that Et[p(1 – p)] = p0(1 – p0)exp(–t/2N) (Ewens 1979). After using this approximation for equation (3), we obtain Et[p] p0 + s(1 – e–t/2N). For large t, this solution is slightly smaller than u(s) with s 1/2N, but is significantly smaller than u(s) for large t and s –1/2N. Therefore, final approximation is given by

    Next, the recurrent fixation of B at the strong locus is assumed to occur with rate per generation. With small and strong selection (2Nss >> 1), substitutions at this locus may be approximated as a Poisson process. Then the waiting time until the hitchhiking event, t, is approximately exponentially distributed. Therefore, the fixation probability of a weakly selected allele is given by

    Then the expected frequency of allele A at the weak locus (Fop) is obtained by equation 1 using 01 = 2Nμ01f(sw, ) and 10 = 2Nμ10f(–sw, ). Therefore, under hitchhiking with zero recombination, the expected level of codon bias is given by

    01 and 10 defined in this way are decreasing and increasing functions of , respectively. Figure 1A shows that the predicted levels of Fop by equation 6 agree well with simulation results for values of 2Nsw ranging from 0.5 to 4.

    FIG. 1. Decreasing Fop with increasing rate of substitutions at the strong locus. Pairs of simulation result (joined by dashed lines) and its theoretical prediction (continuous curve) are shown for four levels of the strength of selection at the weak locus (from top to bottom, 2Nsw = 4, 2, 1, and 0.5, respectively). N = 104, 2Nμ01 = 0.002, 2Nμ10 = 0.006, 2Nss = 1,000, and r = 0. (A) Theoretical predictions are given by equation 6. (B) Theoretical prediction is given by equation 8 in which h is given by equation 19 of Stephan, Wiehe, and Lenz (1992)

    The decline of Fop shown in figure 1 can be interpreted as a decrease in the efficacy of selection with increasing interference between selected alleles from two loci (Hill and Robertson 1966). This reduced efficacy of selection may be summarized by a reduced effective population size (Ne), which is inversely proportional to the strength of genetic drift experienced by segregating mutants, since the strength of selection diminishes with increasing degree of genetic drift. Therefore, one may expect that the fixation probabilities of weakly selected alleles would be approximated by replacing Ne in equation 2 with a proper value that takes interference from the strong locus into account. Under the model of recurrent hitchhiking, previous studies showed that

    where h is the relative heterozygosity at a neutral locus immediately after the fixation of the beneficial mutation (Stephan, Wiehe, and Lenz 1992; Wiehe and Stephan 1993, Gillespie 2000a). Under the assumption of no recombination, h = 0. It is not clear whether it is correct to replace Ne in equation 2 with the above formula, since Ne defined in equation 2 is the determinant of the sampling variance of allele frequency change between generations in the Wright-Fisher model. On the other hand, equation 7 was derived based on the effect of hitchhiking on the coalescent. Gillespie (2000a) demonstrated the equivalence of these two definitions of Ne for neutral variants under recurrent hitchhiking. However, it is not known whether this equivalence can still be applied to the genetic drift of weakly selected alleles. I examined the validity of the approximation using equation 7 by comparing results from simulation and theory. The new approximation for the expected Fop is

    where u'(.) is given by equation 2 but using Ne defined by equation 7. Figure 1B shows that equation 8 and simulation results agree very well for small values of 2Nsw, but the approximation gets worse with increasing 2Nsw. A possible explanation for the discrepancy in the latter case is discussed below.

    With nonzero recombination between the weak and strong loci, the hitchhiking effect on codon bias may be predicted by equation 8 using an appropriate solution for h in equation 7, reasoning that this approximation should be valid for small 2Nsw as in the case of zero recombination. Figure 2 shows the fit of equation 8, where h is given by equation 19 of Stephan, Wiehe, and Lenz (1992), to simulation results with 2Nsw = 1 for various recombination rates (fig. 2A) and various strengths of selection at the strong locus (fig. 2B).

    FIG. 2. Decreasing Fop with increasing rate of substitutions at the strong locus. Pairs of simulation result (joined by dashed lines) and its theoretical prediction (continuous curve) are shown for three recombination rates between loci. Theoretical prediction is given by equation 8 in which h is given by equation 19 of Stephan, Wiehe, and Lenz (1992). (A) Four different recombination rates (from top to bottom, r/ss = 2, 0.2 and 0.02, respectively). N = 104, 2Nμ01 = 0.002, 2Nμ10 = 0.006, 2Nsw = 1, and 2Nss = 1,000. (B) Three different strengths of selection at the strong locus (from top to bottom, 2Nss = 100, 250, and 1,000, respectively). N = 104, 2Nμ01 = 0.002, 2Nμ10 = 0.006, 2Nsw = 1, and 4Nr = 20

    Application to Data

    Next, I examined whether this theory based on the two-locus model can explain the negative correlation between codon bias (Fop) and dN observed among genes sampled from D. melanogaster and D. simulans (Betancourt and Presgraves 2002). It is beyond the scope of this paper to try to estimate the parameters of directional selection by a rigorous statistical analysis because there are a number of contributing factors that are not well known for Drosophila species. Instead, reasonable ranges of parameter values regarding positive selection in Drosophila species are explored to generate the relationship that is not qualitatively different from the data. Table 1 summarizes the notations used in this section. Definition of some parameters introduced earlier has been adjusted for data analysis.

    Table 1 Summary of Notation for Data Analysis.

    Before applying the theory to data, a complicating factor that was not modeled above but is obvious in the evolution of an actual coding sequence needs to be examined first. An optimal codon for one amino acid may change to a nonoptimal codon for a different amino acid, or vice versa, by a single nonsynonymous substitution. If a large proportion of nonsynonymous substitutions cause changes from preferred to unpreferred codons and they remain unpreferred for a long period, a negative correlation between dN and Fop can arise. I examined how much this process alone can explain the data using the following model. Let c10 be the probability that a preferred allele at a synonymous site changes its status to an unpreferred allele by a nonsynonymous substitution at the same codon. The probability for the other direction, c01, is similarly defined. Then, in the presence of weak selection for optimal codons, the expected Fop is obtained by modifying equation 1 into

    where ne is the average effective number of nonsynonymous sites per codon and kN is the rate of nonsysnonymous substitution per site per generation. c01 and c10 may be obtained by examining the table of preferred codons in D. melanogaster (table 1 of Akashi [1994]), where there are 22 preferred and 37 unpreferred codons (Met, Trp, and termination codons are excluded). Of 137 possible one-step nonsynonymous mutations from preferred codons, 35 mutations cause switches into unpreferred codons. Therefore, c10 is estimated to be 0.255 (= 35/137) if the direction of amino acid substitution is assumed to be random. Similarly, c01 = 0.156 (35 switching mutations out of 223 nonsynonymous mutations from unpreferred codons). kN can be estimated directly from dN, assuming that dN obtained from the D. melanogaster–D. simulans pair is an indicator of the long-term rate of nonsynonymous substitution beyond the common ancestor of these species (see below); namely, kN = dN/(2T), where T is the divergence time in generations between D. melanogaster and simulans. T is assumed to be 3 x 107 using 3 million years of divergence and 10 generations per year (McVean and Vieira 2001). I use ne = 2.25. It is obvious from equation 9 that mutation rate at a synonymous site is a critical parameter because the preference status can be quickly reversed by synonymous site substitution. McVean and Vieira (2001) estimated that the 95% credibility interval for the mutation rate per site per generation in noncoding DNA in Drosophila is 10–9 to 2.5 x 10–9. The estimated rate per synonymous site was slightly higher (McVean and Vieira 2001). Therefore, similar values should be assumed for μ01 and μ10. N is chosen to be 106. However, up to a 5-fold increase in N does not change the result as long as 2Nsw remains constant (data not shown). Using these assumptions, the expected relationship between Fop and dN is plotted in figure 3, along with the actual data for various synonymous mutation rates and strengths of selection. Even with very small synonymous mutation rates (curve "d," μ01 = 2 x 10–10 and μ10 = 6 x 10–10), the expected Fop is larger than observed values for most genes of high dN. Therefore, the erosion of optimal codon usage at the site of nonsynonymous substitutions alone is unable to account for the observed decline of Fop with increasing dN. However, the contribution of this process to the decline of Fop may not be ignored. In the following analysis, this cause of codon bias change is included along with hitchhiking effects.

    FIG. 3. Comparison of the data from Betancourt and Presgraves (2002) and predictions by equation 9. Fop and dN estimated for 253 D. simulans genes are shown as grey points. Five curves (labeled a to e) are drawn by equation 9 using the following values: a. 2Nsw = 1.2 and μ01 = 5 x 10–10. b. 2Nsw = 0.8 and μ01 = 10–9. c. 2Nsw = 0.8 and μ01 = 5 x 10–10. d. 2Nsw = 0.8 and μ01 = 2 x 10–10. e. 2Nsw = 0.5 and μ01 = 5 x 10–10. Other parameters are N = 106, ne = 2.25, c01 = 0.156, c10 = 0.255, and μ10 = 3μ01

    A few assumptions are needed for applying the hitchhiking model described above to the data. First, the theory is applicable when the substitutions between the preferred and unpreferred codons are in equilibrium such that the codon bias at a gene remains constant throughout time. It has been inferred that the frequency of unpreferred codons has been increasing, presumably because of recent relaxation of selective pressure on codon usage, since the split of D. melanogaster and D. simulans lineages (Akashi 1996; McVean and Vieira 2001). However, the departure of Fop from equilibrium during this period alone can be ignored because only 5% of synonymous sites have been subject to substitutions after the split (Betancourt and Presgraves 2002). There is evidence that selection on codon bias has been maintained for a much longer period, predating the speciation of D. melanogaster and D. simulans (Powell and Moriyama 1997; McVean and Vieira 2001). Therefore, it may not be unreasonable to assume that the current pattern of Fop in these species has been shaped mainly by a long-term mutation-selection-drift balance that once attained near equilibrium. Second, I assume that dN estimated between D. melanogaster and D. simulans genes is proportional to the rate of positively selected substitutions not only in these lineages but also in the period predating the common ancestor of these species; that is, the relative rate of adaptive evolution among genes is assumed to remain constant through time.

    As the cumulative effects of selected substitutions at many nonsynonymous sites determine the level of codon bias at a given synonymous site, the number and spatial structure of nonsynonymous sites for each gene are important factors. However, for many genes included in the data of Betancourt and Presgraves (2002), this information is not available. I therefore simplify the analysis by considering an "ideal" gene that consists of one coding region without introns. The expected Fop for a synonymous site located in the middle of this gene is calculated. Assume that there are L nonsynonymous sites on each side of this synonymous site and that the rate of recombination with the ith closest nonsynonymous site is given by ir per generation. For r = 0, the expected Fop can be obtained by modifying equation 6 with = 2Lk, where k is the number of strongly selected substitution per nonsynonymous site per generation. For r > 0, the effective population size for calculating fixation probabilities is now

    (Kim and Stephan 2000). Then the expected Fop is given by modifying equation 8 into

    where u'(.) is given by equation 2 using Ne by equation 10. For the convenience, I use h(ir) = 1–(4Nss)–ir/s, which well approximates the diffusion solution of Stephan, Wiehe, and Lenz (1992). k is obtained from dN, assuming that a fraction of nonsynonymous substitutions is driven by positive selection. Namely, if the two species are separated for T generations since speciation, k is given by dN/(2T). Smith and Eyre-Walker (2002) and Fay, Wyckoff, and Wu (2002) suggested that for Drosophila genes is around 0.45 to 0.5. However, it is unrealistic to expect to be constant over different genes. For genes whose dN is greater than the rate of synonymous substitutions, dS, most nonsynonymous substitutions are likely to have been fixed by positive selection. It is therefore expected that itself is an increasing function of dN. I consider a simple relationship = 1 – Exp(–dN/S), where S (= 0.1) is the mean synonymous substitution rate observed in the data. The observed mean number of codons, averaged over 236 genes for which this number is available, is 571. Given that many of these are partial EST sequences, the true average number of codons should be higher than this value. Therefore, a value of L between 500 and 1,000 might be chosen to represent the data, assuming that about 75% of sites in an exon are effectively nonsynonymous. Mean recombination rate per nucleotide estimated from the data is 3 x 10–8 (= 0.003 cM/kb) (Betancourt and Presgraves 2002). A higher value of r than this should be used, considering that the presence of introns is ignored in the model. T is assumed to be 3 x 107 generations (see above).

    Examination of the solutions derived above reveals that Fop at equilibrium mainly depends on the ratio of μ01 and μ10 but not their absolute values, if the hitchhiking effect is the major force reducing codon bias. I define ? = μ10/μ01. Although analytic solutions were obtained for twofold degenerate sites, they are also applicable to fourfold degenerate sites by distinguishing preferred from all unpreferred alleles. Considering also the AT-biased mutation in Drosophila (Petrov and Hartl 1999), which causes more substitutions toward unpreferred alleles (Akashi 1996; Powell and Moriyama 1997), ? ranging from 2 to 4 would be reasonable when applying the theory to the data.

    In figure 4, expectations of Fop as functions of dN for various sets of parameter values are plotted against the data for D. simulans. Population size N, which corresponds to the hypothetical effective population size for D. simulans after eliminating strong directional selection over the entire genome, is assumed to be 5 x 106. The observed mean of Fop for dN = 0 (around 0.65) and that for high dN (around 0.3) allow quite narrow ranges of ? (3 to 4) and 2Nsw (0.5 to 1.5) to fit the data. With combinations of the parameter values considered here, a reasonable fit is obtained only when quite strong selection (2Nss = 200 – 4000) for nonsynonymous sites is assumed. If a larger value of r is used, a proportionally larger value of 2Nss needs to be used.

    FIG. 4. Comparison of the data and predictions by equation 11. Five different curves (labeled a to e) are drawn by equation 11 using the following parameter sets: a. ? = 3.5, 2Nsw = 0.8, 2Nss = 200, L = 500, and r = 10–7. b. ? = 3, 2Nsw = 1, 2Nss = 600, L = 1000, and r = 5 x 10–8. c. ? = 3.5, 2Nsw = 0.8, 2Nss = 2,000, L = 500, and r = 10–7. d. ? = 4, 2Nsw = 1.4, 2Nss = 4,000, L = 1,000, and r = 10–7. e. ? = 4, 2Nsw = 0.6, 2Nss = 1,000, L = 1,000, and r = 3 x 10–8. Other parameters are N = 5 x 106, μ01 = 10–9, c01 = 0.156, c10 = 0.255, and ne = 2.25

    Discussion

    Two forms of simple approximations were found for the effect of strong directional selection on substitutions of weakly selected mutants at linked sites. The first solution (equation 6) explicitly models the fixation process of a weakly selected allele linked to a beneficial allele. It uses a similar approach as Gillespie (2001), which, however, ignores genetic drift in finite populations. This method of derivation was possible only for zero recombination. The second solution (equation 8), which allows recombination and provides a good approximation for small 2Nsw, is theoretically less complete because it simply exploits the fact that the fixation probability depends on a certain form of effective population size (Ne). Agreement of the simulation results and equation 8 (figs. 1B and 2) indicates that the "coalescent effective" population size for neutral variants (Gillespie 2000b) is approximately equal to the "fixation effective" population size (Otto and Whitlock 1997) for weakly selected allles (2Nsw 1) under the model considered. As the perturbation of allele frequency caused by hitchhiking is similar to that of a population bottleneck (Barton 1998), the nature of the stochastic force under recurrent hitchhiking should be similar to that under population size fluctuation. Then the failure of equation 8 for moderately selected alleles (2Nsw > 1) may be understood by the arguments of Otto and Whitlock (1997). They showed that a simple approximation for the fixation of a beneficial allele with a cyclically varying population size, using the harmonic mean of changing population size, is possible only when the population cycle is sufficiently faster than the time scale of the fixation process (on the order of 1/s). Therefore, harmonic mean approximation works best for a very weakly selected allele. If selection is strong and thus the time scale is short, the fixation probability will not depend on the long-term population size change but on the specific direction of change at the beginning of the substitution. Similarly, the probability of fixation at the weak locus in the hitchhiking model depends on its time scale of fixation and the rate of substitutions at the strong locus. If selection at the weak locus is not so weak, the time scale of fixation becomes short compared with that of a neural allele, and, thus, one cannot expect that the approximation using equation 7, which is based on the stochastic behavior of a neutral allele, is still valid.

    Codon usage bias has largely been investigated under single-site models in which the frequency of the preferred allele is determined by mutation bias, weak selection (on translation efficiency), and genetic drift (or effective population size) (Bulmer 1991; McVean and Charlesworth 1999). This paradigm of codon bias prompted numerous investigations to find and evaluate major predictors of codon bias in Drosophila genome, such as local recombination rate, gene length, and gene expression level (Kliman and Hey 1993; Comeron et al. 1999; Marais, Mouchiroud, and Duret 2001). It is not straightforward to find a direct role of the nonsynonymous substitution rate (dN) in a simple model. However, it was previously reported that the codon usage is more biased for amino acids that are more conserved between species (Ticher and Grauer 1989; Akashi 1994). Using a larger data set in Drosophila, Betancourt and Presgraves (2002) showed that dN is not only an additional contributing factor to codon bias but also one whose correlation with codon bias is much stronger than those with recombination rate and gene length. This correlation with dN might be explained in many ways. Akashi (1994) argued that the selection for translational accuracy maintains a high frequency of preferred codons for highly conserved amino acids for which the cost of misincorporation is higher. Therefore, a lower codon bias is expected at amino acid sites under relaxed constraints. In a similar argument, constraints on amino acid sites and the strength of selection on the optimal codon usage may be correlated within a gene. These hypotheses were not well supported, because the exclusion of divergent amino acid sites does not change the degree of correlation between dN and codon bias, and because the rapid evolution of genes in the data set does not appear to be caused by relaxed purifying selection (Betancourt and Presgraves 2002). It is also possible that the correlation of dN and codon bias is a secondary product of correlation with other unobserved predictors, such as gene expression level. Although these explanations cannot be ruled out, this study focuses on the plausibility of the hitchhiking effect of nonsynonymous substitutions as an explanation for the correlation found by Betancourt and Presgraves (2002). Because the effective population size is a critical component in the single-site model of codon bias, and selection on neighboring sites changes the effective population size, it is not difficult to predict the effect of interference (Hill and Robertson 1966; Barton 1995; Gillespie 2001). Akashi (1996) also considered this effect as one of possible explanations for fast protein evolution and reduced codon bias in D. melanogaster relative to D. simulans lineage.

    Using an analytic approximation based on a two-locus interaction between weak and strong selection, this study demonstrated that the observed correlation of codon bias and dN can be well accounted for by the hitchhiking effect if a significant fraction of amino acid substitutions is driven by strong directional selection. Although parameter estimation using a rigorous statistical method was not conducted, using simplifying assumptions about numerous genomic parameters, the average strength of selection (2Ns) required to produce the observed pattern of codon bias was inferred to be at least on the order of 100 (fig. 4). Candidate genes for male accessory gland proteins, which comprise 24.4% of the genes in the data set of Betancourt and Presgraves (2002), are known to be under strong positive selection (Swanson et al. 2001). However, it is not known in general whether other new adaptive amino acid variants are under such strong positive selection in Drosophila. As the required strength of selection to explain a given level of codon bias critically depends on the number of linked nonsynonymous sites and recombination rates, a more accurate inference of this parameter will be possible if structure and local recombination rate for each individual gene are taken into account in the analysis. Although it was assumed above that a given synonymous site is under the influence of strong selection at linked nonsynonymous sites in the same gene only, positive selection acting on a flanking regulatory region and even strong selection on a neighboring gene may further reduce the level of codon bias by hitchhiking effects. If the rate of adaptive change in promoter region, for example, increases along with the rate of positive selection in coding region, this could lead to an overestimation of the strength of selection at nonsynonymous sites.

    As good approximation for codon bias under the two-locus model was obtained through equation 7, not only strong directional selection but also any other selective force that reduces the effective population size at a linked site is expected to lower codon bias. Recent studies have considered the effect of weak selection at linked sites. McVean and Charlesworth (2000) and Comeron and Kreitman (2002) showed that the interaction of many segregating alleles at tightly linked synonymous sites ("weak selection Hill-Roberton [wsHR] interference" or "interference selection") lowers the level of codon bias from the expectation of the single-site mutation-selection-drift model. However, the preferred allele frequency under wsHR interference deviates significantly from the standard model only when there are a large number of synonymous sites in tight linkage. At this point, the relative importance of wsHR interference and the hitchhiking effect of strongly selected nonsynonymous substitutions in determining the degree of codon bias cannot be evaluated. Most likely, these two forces should act simultaneously in nature. However, unless there is a correlation between the strength of wsHR and dN, the effect of wsHR may simply be regarded as a factor lowering 2Nsw in the model above.

    Surveys of protein-coding sequences from numerous species revealed that the degree of codon bias is similar among species of very different census population sizes, such as E. coli, yeast and Drosophila (Powell and Moriyama 1997). This is a puzzling observation in the light of single-site mutation-selection-drift model of codon bias, which allows a very narrow range of 2Nsw. It was suggested that wsHR interference lowers codon bias from the level predicted by the single-site model and maintains intermediate levels of codon bias over several orders of magnitude in population size (McVean and Charlesworth 2000). Hitchhiking effects of strong positive selection on codon bias as modeled in this study may also reduce the dependency of codon bias on population size. Figure 5 compares the expected levels of optimal codon usage predicted by equation 6 (two-locus model) with and without the hitchhiking effect. Population size is varying from 104 to 108 while selection coefficients and mutation rates for weak and strong loci remain constant. The reduction of codon bias caused by hitchhiking shows a rather complex pattern with increasing population size. After population size exceeds the point where preferred alleles are predicted to reach fixation in the single-site model, expected codon bias decreases with increasing population size. Obviously, increasing input of positively selected mutations with increasing population size accelerates hitchhiking effects, which overpowers the increasing intensity of selection (2Nsw) at the weak locus. A similar curve was obtained by Gillespie (2001) when he examined the substitution rate of weakly selected alleles linked to a strongly selected locus with varying population size. Therefore, the uniformity of codon bias over different species might be explained at least partially by theory that places the hitchhiking effect as a dominant stochastic force governing molecular evolution and thus suggests "population size may not be relevant to a species' evolution" (Gillespie 2001).

    FIG. 5. Frequency of optimal codon usage as a function of population size N. Fop is given by equation 6 with μ10/μ01 = 2. Selection coefficients for both loci are constant: sw = 10–5 and ss = 10–3. Mutation rate at the strong locus is fixed as μs = 10–8. Therefore, = 2Nμsu(ss) (continuous curve). Dashed curve is drawn for = 0 (single-site model)

    Acknowledgements

    I thank Andrea Betancourt and Daven Presgraves for their great support and help in the analysis of the Drosophila data. I also thank Adam Eyre-Walker, Rasmus Nielsen, Molly Przeworski, Wolfgang Stephan, and two anonymous reviewers for their insights and comments that greatly improved the manuscript. This research was supported by National Science Foundation grant DEB-0089487 to Rasmus Nielsen.

    Literature Cited

    Akashi, H. 1994. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics 136:927-935.

    Akashi, H. 1995. Inferring weak selection from patterns of polymorphism and divergence at "silent sites in Drosophila DNA. Genetics 139:1067-1076.

    Akashi, H. 1996. Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution, and larger proteins in D. melanogaster. Genetics 144:1297-1307.

    Akashi, H., and S. W. Schaeffer. 1997. Natural selection and the frequency distributions of "silent" DNA polymorphism in Drosophila. Genetics 146:295-307.

    Barton, N. H. 1995. Linkage and the limits to natural selection. Genetics 140:821-884.

    Barton, N. H. 1998. The effect of hitch-hiking on neutral genealogies. Genet. Res. 72:123-133.

    Betancourt, A. J., and D. C. Presgraves. 2002. Linkage limits the power of natural selection in Drosophila. Proc. Natl. Acad. Sci. USA 99:13616-13620.

    Birky, C. W., and J. B. Walsh. 1988. Effects of linkage on rates of molecular evolution. Proc. Natl. Acad. Sci. USA 85:6414-6418.

    Bulmer, M. 1991. The selection-mutation-drift theory of synonymous codon usage. Genetics 129:897-907.

    Comeron, J. M., and M. Kreitman. 2002. Population, evolutionary and genomic consequences of interference selection. Genetics 161:389-410.

    Comeron, J. M., M. Kreitman, and M. Aguadé. 1999. Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics 151:239-249.

    Ewens, W. J. 1979. Mathematical population genetics. Springer-Verlag, New York.

    Fay, J. C., G. J. Wyckoff, and C.-I. Wu. 2002. Testing the neutral theory of molecular evolution with genomic data from Drosophila. Nature 415:1024-1026.

    Gerrish, P. J., and R. E. Lenski. 1998. The fate of competing beneficial mutations in an asexual population. Genetica 102/103:127-144.

    Gillespie, J. H. 2000a. Genetic drift in an infinite population: the pseudohitchiking model. Genetics 155:909-919.

    Gillespie, J. H. 2000b. The neutral theory in an infinite population. Gene 261:11-18.

    Gillespie, J. H. 2001. Is the population size of a species relevant to its evolution? Evolution 55:2161-2169.

    Hill, W. G., and A. Robertson. 1966. The effect of linkage on the limits to artificial selection. Genet. Res. 8:269-294.

    Kim, Y., and W. Stephan. 2000. Joint effects of genetic hitchhiking and background selection on neutral variation. Genetics 155:1415-1427.

    Kliman, R. M., and J. Hey. 1993. Reduced natural selection associated with low recombination in Drosophila melanogaster. Mol. Biol. Evol. 10:1239-1258.

    Marais, G., D. Mouchiroud, and L. Duret. 2001. Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes. Proc. Natl. Acad. Sci. USA 98:5688-5692.

    Maynard Smith J., and J. Haigh. 1974. The hitch-hiking effect of a favourable gene. Genet. Res. 23:23-35.

    McVean, G. A. T., and B. Charlesworth. 1999. A population genetic model for the evolution of synonymous codon usage: patterns and predictions. Genet. Res. 74:145-158.

    McVean, G. A. T., and B. Charlesworth. 2000. The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics 155:929-944.

    McVean, G. A. T., and J. Vieira. 2001. Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics 157:245-257.

    Otto, S. P., and M. C. Whitlock. 1997. The probability of fixation in populations of changing size. Genetics 146:723-733.

    Petrov, D. A., and D. L. Hartl. 1999. Patterns of nucleotide substitution in Drosophila and mammalian genomes. Proc. Natl. Acad. Sci. USA 96:1475-1479.

    Powell, J. R., and E. N. Moriyama. 1997. Evolution of codon usage bias in Drosophila. Proc. Natl. Acad. Sci. USA 94:7784-7790.

    Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical recipes in C. Cambridge University Press, Cambridge, U.K.

    Sharp, P. M., and W.-H. Li. 1987. The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol. Biol. Evol. 4:222-230.

    Shields, D. C., P. M. Sharp, D. G. Higgins, and F. Wright. 1988. "Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 5:704-716.

    Smith, N. G. C., and A. Eyre-Walker. 2002. Adaptive protein evolution in Drosophila. Nature 415:1022-1024.

    Stephan, W., T. H. E. Wiehe, and M. W. Lenz. 1992. The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory. Theor. Popul. Biol. 41:237-254.

    Swanson, W. J., A. G. Clark, H. M. Waldrip-Dail, M. F. Wolfner, and C. F. Aquadro. 2001. Evolutionary EST analysis identifies rapidly evolving male reproductive proteins in Drosophila. Proc. Natl. Acad. Sci. USA 98:7375-7379.

    Ticher, A., and D. Grauer. 1989. Nucleic acid composition, codon usage, and the rate of synonymous substitution in protein-coding genes. J. Mol. Evol. 28:286-298.

    Wiehe, T. H. E., and W. Stephan. 1993. Analysis of a genetic hitchhiking model, and its application to DNA polymorphism data from Drosophila melanogaster. Mol. Biol. Evol. 10:842-854.(Yuseob Kim)