ABSTRACTo+!q&, 百拇医药
Estimates of the scaled selection coefficient,o+!q&, 百拇医药
{gamma} of Sawyer and Hartl, are shown to be remarkably robust to population subdivision. Estimates of mutation parameters and divergence times, in contrast, are very sensitive to subdivision. These results follow from an analysis of natural selection and genetic drift in the island model of subdivision in the limit of a very large number of subpopulations, or demes. In particular, a diffusion process is shown to hold for the average allele frequency among demes in which the level of subdivision sets the timescale of drift and selection and determines the dynamic equilibrium of allele frequencies among demes. This provides a framework for inference about mutation, selection, divergence, and migration when data are available from a number of unlinked nucleotide sites. The effects of subdivision on parameter estimates depend on the distribution of samples among demes. If samples are taken singly from different demes, the only effect of subdivision is in the rescaling of mutation and divergence-time parameters. If multiple samples are taken from one or more demes, high levels of within-deme relatedness lead to low levels of intraspecies polymorphism and increase the number of fixed differences between samples from two species. If subdivision is ignored, mutation parameters are underestimated and the species divergence time is overestimated, sometimes quite drastically. Estimates of the strength of selection are much less strongly affected and always in a conservative direction.
ONE of the primary goals of population genetics has been to measure and to understand the role of natural selection in shaping variation within and between species. Now that molecular technologies allow genetic variation to be assayed with relative ease, this goal seems within reach. A number of different approaches to studying selection have been proposed (HUDSON and KAPLAN 1988 ; NEUHAUSER and KRONE 1997 ; YANG 1998 ; DONNELLY et al. 2001 ; SLATKIN and BERTORELLE 2001 ), and a multitude of neutrality tests, reviewed by NIELSEN 2001 , can be applied if appropriate genetic data are gathered. This work considers SAWYER and HARTL's (1992) method, which belongs to a class of methods that use overall levels of polymorphism and divergence at two or more categories of sites in samples of DNA sequences from a pair of species. HUDSON et al. 1987 were the first to propose such a method, in which the categories were different loci, followed by MCDONALD and KREITMAN 1991 , who categorized sites within a locus as being either synonymous or nonsynonymous with respect to changes in the amino acid sequence of the protein product. Both methods assumed no intralocus recombination and allowed the hypothesis of strict selective neutrality to be tested. Shortly afterward, by assuming KIMURA's (1969) infinite-sites mutation model, i.e., with free recombination between sites, SAWYER and HARTL 1992 showed that McDonald-Krietman test data could be used not only to test neutrality but also to estimate selection, mutation, and divergence-time parameters.
NIELSEN 2001 pointed out that McDonald-Kreitman and related tests, in which sites can be classified a priori, provide a very powerful framework for inferences about natural selection, in contrast to tests like TAJIMA's (1989) and FU and LI's (1993), which measure deviations from the highly variable process of neutral coalescence. It is likely that McDonald-Kreitman and related methods will become the mainstay of genomic analyses of the role of selection. In two recent works, modified McDonald-Kreitman tests were applied to genomic data from Drosophila, suggesting that 45% of the amino acid differences between Drosophila simulans and D. yakuba resulted from positive selection (SMITH and EYRE-WALKER 2002 ) and that positive selection at a relatively small number of genes is responsible for the divergence of D. simulans and D. melanogaster (FAY et al. 2002 ). BUSTAMANTE et al. 2002 used a modified Sawyer-Hartl method to show that Arabidopsis species have experienced a higher proportion of deleterious amio acid substitutions than Drosophila species, in which positive selection is common, and attributed the difference to high levels of inbreeding in Arabidopsis.
An obvious shortcoming of these methods is that they assume the species under study are panmictic, i.e., not geographically or otherwise subdivided. It is well known that this assumption is incorrect for many species (SLATKIN 1985 ). When there is no intralocus recombination, MCDONALD and KREITMAN 1991 point out that shared genealogical history should control for the effects of demography when sites can be categorized a priori. It is less clear that this should be the case when collections of unlinked sites are used to estimate selection, mutation, and divergence-time parameters as in SAWYER and HARTL's (1992) method. It is possible that the effects of subdivision on the numbers of polymorphisms and fixed differences at synonymous and nonsynonymous sites could lead to errors in inferences. Therefore, the goal of this work is to extend the Poisson random field (PRF) theory of polymorphism and divergence developed by SAWYER and HARTL 1992 to include subdivided species. To do this, it is first shown that in the limit of a large number of subpopulations or demes allele-frequency dynamics at a single locus in a population with island-model migration (WRIGHT 1931 ; MORAN 1959 ; MARUYAMA 1970 ; LATTER 1973 ) are governed by a diffusion process that has the same form as the usual Wright-Fisher diffusion, e.g., see EWENS 1979 , but with a timescale different from that of the panmictic case. Then, the assumption of free recombination between sites allows the PRF model to be used to predict the patterns of variation in samples from a pair of island-model species.
The diffusion result is obtained using Theorem 3.3 in ETHIER and NAGYLAKI 1980 and relies upon the fact that the process of migration and drift within subpopulations occurs on a much faster timescale than changes in allele frequency by drift and selection in the total population. The result thus depends on a stochastic equilibrium of allele frequencies within demes with respect to migration and drift, which is also described. This follows some recent work (CHERRY and WAKELEY 2003) in which simulations supported the existence of such a diffusion under the additional assumption that demes are very large and migration rates correspondingly small. The present analysis shows that this additional assumption is unneccessary. The assumption of infinite deme sizes and infinitesimal migration rates was also made in the recent coalescent work on neutral large-number-of-demes models (WAKELEY 1998 , WAKELEY 2001 ), and it is made below in The expected number of neutral segregating sites, when the forward and backward results are compared. Otherwise, here it is assumed that the demes are finite in size and the migration rates are unconstrained.
This work makes a connection between the PRF theory and work on the robustness of the coalescent process to population structure (NORDBORG 1997 ; MOHLE 1998 ), in particular for the case of geographic structure (WAKELEY 1998 , WAKELEY 2001 ). The two are related by showing that the effective size of the ancestral, coalescent process is the same as that of the forward-time diffusion of allele frequencies and that the forward- and backward-derived predictions for the expected number of segregating sites in a sample are the same under neutrality. We expect such connections between forward and backward approaches to exist, a fact that is well established in the case of panmictic populations (EWENS 1990 ; MOHLE 2001 ). Like the forward (NAGYLAKI 1980 ) and backward (NOTOHARA 1993 ) strong-migration limits, these results and those of WAKELEY 1998 , WAKELEY 2001 for the coalescent process are based on a "separation of timescales." In this case, the fast processes are migration and drift within demes and the slow process is drift and possibly selection in the total population, which is mediated by migration. The effective size of the population is rescaled and patterns of genetic variation depend on how a sample is distributed among demes. In contrast, under the usual strong-migration limit, the only effect of structure is to rescale the effective size of the population (NAGYLAKI 1980 ; NOTOHARA 1993 ; NORDBORG 1997 ; CHARLESWORTH 2001 ).
The main result presented here, besides the existence of the diffusion (9) below, is that, if mutations are introduced at a constant rate per generation and sites segregate independently of one another, the PRF results of SAWYER and HARTL 1992 can be applied, but with a correction that depends on how samples are taken among demes. If each sample is taken from a different deme, then SAWYER and HARTL's (1992) results apply directly, but with slightly different mutation and divergence-time parameters. If some or all of the samples come from the same deme, the PRF results must be corrected for the effect of drift and migration within demes. Failure to recognize this can cause serious errors in the estimation of mutation rates and divergence times, but not, surprisingly, of selection coefficients.'ez, 百拇医药
THEORY'ez, 百拇医药
A population or species is assumed to be subdivided into D demes of equal size N. The organisms are assumed to be haploid, but the results will hold for diploid organisms if N is replaced with 2N, if selection is additive, and if migration is gametic. The island model of migration (WRIGHT 1931 ; MORAN 1959 ) is assumed: a fraction m of each deme is replaced by migrants every generation and all migrants are randomly sampled from a migrant pool to which all demes contribute equally. In each generation, migration occurs first, followed by selection, and then resampling (drift) within demes according to the Wright-Fisher model (FISHER 1930 ; WRIGHT 1931 ). In the next two sections, two alleles are assumed to be segregating at a single locus, and Many independently segregating loci considers their introduction by mutation. The wild-type or nonmutant allele has relative fitness equal to 1, and the mutant allele has fitness 1 + sD, where sD ">=" -1. The next section establishes the diffusion approximation for the frequency of the mutant allele as D "->"
{infty}1#, 百拇医药
, but DsD remains finite. The migration rate can vary between 0 and 1 (0 < m " 1) and N is assumed to be finite. This is in contrast to the usual assumption that Nm is finite as N goes to infinity.1#, 百拇医药
Considering the number of mutants within each deme, it is apparent that there are exactly N + 1 kinds of demes. Each deme that begins a generation with i copies of the mutant will have mutant frequency1#, 百拇医药
after migration and selection, where x is the frequency of the mutant in the total population. The next generation within the deme will be produced by randomly sampling N haploid individuals from this distribution. Thus, a deme that contains i copies of the mutant now has probability1#, 百拇医药
of having j copies at the start of the following generation. Because limD"->"1#, 百拇医药
{infty}sD = 0, it is often necessary to consider only one part of Pij:
The notation o(sD) used in and below means that limD"->"?!s!aq, 百拇医药
{infty}o(sD)/sD = 0. Thus Pij = P*ij + o(1). The process of drift, described by , happens independently within each deme?!s!aq, 百拇医药
Limiting allele frequency dynamics at a single locus:?!s!aq, 百拇医药
Let ZDi(t) record the fraction of demes that contain i copies of the mutant and zi(t) be a particular realization of this random variable. Thus, ZD(t) is a Markov chain whose state space consists of all possible configurations of the D demes among the N + 1 mutant-count classes. Appendix A proves a diffusion result for ZD(t) as D goes to infinity and DsD remains finite. Briefly, this is done by using of ETHIER and NAGYLAKI 1980 —see also of NAGYLAKI 1980. Define XD(t) = Ni=0iZDi(t)/N. The random variable XD(t) records the frequency of the mutant in the total population or the average frequency of the mutant among demes (x above). Next, let YD(t) = ZDi(t) - {nu}
i(t) be the deviation of ZDi(t) from the equilibrium prediction {nu}:^asj0, 百拇医药
i(t). For a given Pij(t), this equilibrium satisfies:^asj0, 百拇医药
It exists because P* =
*ij> is ergodic and has a finite number of states. We can set Ni=0{nu}:^asj0, 百拇医药
i(t) = 1, and {nu}:^asj0, 百拇医药
i(t) becomes the equilibrium prediction for ZDi(t).:^asj0, 百拇医药
The nature of the diffusion approximation (9) below is that the migration and drift within demes equilibrate quickly in comparison to the rate of drift and selection in the total population. The results show that, to a sufficient order of approximation, demes can be considered to always be at a stochastic equilibrium {nu}:^asj0, 百拇医药
j (0 " j " N) with respect to migration and drift for a given x. The fraction of demes that have j copies of the mutant converges on {nu}
j = Ni=1{nu}$, 百拇医药
iP*ij, where P*ij is given by with x constant. The distribution {nu}$, 百拇医药
is very well approximated by the hypergeometric distribution$, 百拇医药
which is a special case of the multivariate Poly(A) distribution; see Equation 40.13 in JOHNSON et al. 1997 . 5 is also identical to the two-allele case of the compound multinomial Dirichlet distribution, which RANNALA 1996 proved to hold for the frequencies of multiple alleles within a deme in the infinite-island or continent-island model, i.e., where allele frequencies among migrants are assumed to be constant. RANNALA 1996 did not assume Wright-Fisher reproduction, but rather that a birth-death-immigration process occurred within demes. Thus, RANNALA's (1996) model is similar to the Moran model, in which such distributions are known to arise: see pages 131–133 in MORAN 1962 . ROTHMAN et al. 1974 argued for the use of the compound multinomial Dirichlet distribution in the case of Wright-Fisher reproduction within demes.
The form of 5 was obtained by selecting parameters of a hypergeometric distribution that gave the same mean and variance of allele counts among demes as 4, namely@;, 百拇医药
which were obtained using (4) together with the moments of the binomial distribution (P*). 5 is the exact solution to (4) when N " 2. In addition, as required by (4): when m approaches 1, {nu}@;, 百拇医药
i becomes a binomial distribution with parameters N and x; and as m approaches 0, we have {nu}@;, 百拇医药
0 = 1 - x, {nu}@;, 百拇医药
N = x, and {nu}@;, 百拇医药
j = 0 for 1 " j " N - 1. Finally, if xi = j/N is the frequency in some deme i, then as N grows but 2Nm = M remains constant, (5) converges on the well-known ß-distribution result@;, 百拇医药
which WRIGHT 1931 obtained under the assumption that x was constant among migrants. To derive (8) from (5), it is necessary to use the limit result 6.1.46 in ABRAMOWITZ and STEGUN 1965 for ratios of gamma functions and to let dxi = 1/N. 1 plots the distribution (5) when N = 10 and x = 0.75 over the full range of migration rates. With these parameter values, the absolute error of using (5) to approximate the solution of (4) is never > ~
0.007 and the relative error is never > ~ji\t|w@, http://www.100md.com
5%.ji\t|w@, http://www.100md.com
fig.ommitteedji\t|w@, http://www.100md.com
Figure 1. The approximation (5) for the distribution of mutant allele counts among demes assuming that N = 10 and x = 0.75, shown as a function of the per-generation migration rate m.ji\t|w@, http://www.100md.com
Appendix A shows that, in the limit as D goes to infinity, the change in x by drift and selection is so much slower than that by migration and drift within demes that the collection of demes is always at the equilibrium {nu}ji\t|w@, http://www.100md.com
i, which depends on N and m, and of course x. By Theorem 3.3 of ETHIER and NAGYLAKI 1980 , as D goes to infinity the above system reduces to a diffusion x(·) with generatorji\t|w@, http://www.100md.com
in which {gamma} = N limD"->"ji\t|w@, http://www.100md.com
{infty}DsD. Time is measured in units of ND/(1 - F) generations, where F is the fixation coefficient, in this case given by A13 in Appendix A. Thus, the diffusion of x is identical to the usual Wright-Fisher diffusion with genic selection, with the exception that it occurs on a timescale longer than that of the panmictic case by the factor 1/(1 - F). Thus, all the well-known predictions of that model apply; e.g., see chapter 5 of EWENS 1979 .
CHERRY and WAKELEY 2003 assumed (8) to hold and showed that simulations agreed well with the predictions of the implied diffusion process, such as the time to fixation or loss of the mutant type. Without giving a proof, we can guess that this diffusion should be given by the results of the section above and Appendix A if ND "->"r|)c@, http://www.100md.com
{infty}r|)c@, http://www.100md.com
when D "->"r|)c@, http://www.100md.com
{infty}r|)c@, http://www.100md.com
and limD"->"r|)c@, http://www.100md.com
{infty}2NDmD = M, so that F = 1/(M + 1), but with limD"->"r|)c@, http://www.100md.com
{infty}ND/D = 0 (ETHIER and NAGYLAKI 1980 ). CHERRY and WAKELEY 2003 also showed that the distribution of allele frequencies among demes in simulations with N = 100 and m = 0.01 (and D = 1000 and sD = 0.001) conformed well to the predictions of 8 in a particular generation when x was equal to 0.611. Further support for the existence of this large-D, large-N diffusion is given in The expected number of neutral segregating sites by comparing its predictions under neutrality to those of the corresponding coalescent model (WAKELEY 1998 ). Otherwise, N is assumed here to be finite.
Many independently segregating loci:4p^6;2, 百拇医药
If we posit an infinite number of loci, i.e., nucleotide sites, which can sustain mutations and which each evolve according to the diffusion of the previous section independently, then the PRF results of SAWYER and HARTL 1992 hold for x. Because of the way time is measured in the diffusion, the appropriate mutation parameters are also scaled:4p^6;2, 百拇医药
The subscripts in 10 refer to "amino acid replacement" and "synonymous" following BUSTAMANTE et al. 2002 , and ua and us are the per-generation rates. Thus, one effect of restricted migration is to increase the apparent mutation rates over the panmictic case since 0 F 1. The other effect, of course, is to distribute variation among demes as described in the previous section. In addition, the parameter tdiv in SAWYER and HARTL 1992 must here be measured in units of ND/(1 - F) generations. With these modifications, 13 and Equati14on 14 in SAWYER and HARTL 1992 apply here to x.
Rewriting SAWYER and HARTL's (1992) 13 and 14 in terms of the present notation gives0#17h4^, 百拇医药
for the expected numbers of fixed and polymorphic, synonymous, and replacement differences in two species. When a sample is taken from the two species, as in SAWYER and HARTL 1992 , we need to consider the chance that a polymorphic site appears fixed in a sample from the species. Here, in contrast to the panmictic case, the distribution of the sample among demes becomes important.0#17h4^, 百拇医药
Assume that we have taken a random sample of n sequences from d different demes in one of the species, such that n1, n2, ... , nd are the sample sizes from each deme. We can write in general that the expected number of sites that show i1, i2, ... , id copies of the mutant base in the sample (0 ik nk) is given by0#17h4^, 百拇医药
where j = a, s. The probability h(ik|x, nk), that ik copies of the mutant base are in the sample of nk items from the kth sampled deme, is an average over the within-deme distribution of allele frequencies:
If N is large and m correspondingly small, we may wish to use the large-deme approximation:{nv$et, 百拇医药
That is, when N is large we can approximate the hypergeometric probability that the sample contain ik copies of the mutant allele (present in j copies in the deme) with a binomial distribution and the allele count distribution {nu}{nv$et, 百拇医药
j with WRIGHT's (1931) continous ß-distribution of allele frequences, g(xk|x).{nv$et, 百拇医药
Because we have assumed an infinite number of independently segregating sites with collective mutation rates given by (10), the PRF model (SAWYER and HARTL 1992 ) shows that Sj(i1, ... , id) is Poisson distributed with expected value equal to (15). The numbers of sites segregating at various frequencies within each deme contain information about migration rates, and the numbers of sites segregating at various frequencies in the total population contain information about the selection coefficient. Note that (15) can also be used to compute the expected number of apparent fixed differences, i.e., polymorphisms where the entire sample has the mutant base, as required in SAWYER and HARTL's (1992) analysis. This provides a framework for estimating selection coefficients (and migration rates) in the context of a subdivided population. As illustrated in RESULTS, we use 11121 in conjunction with 15 to obtain predictions about the numbers of fixed-synonymous, fixed-replacement, polymorphic-synonymous, and polymorphic-replacement sites in a sample from two species. Further, 15 gives the joint frequencies among demes of segregating polymorphisms. In the panmictic case, HARTL et al. 1994 , AKASHI 1999 , and BUSTAMANTE et al. 2001 showed that allele frequencies at polymorphic sites contain substantial information about selection.
RESULTS.(evm3, 百拇医药
The first result to note is that if each sample is taken from a different deme, the methods of SAWYER and HARTL 1992 can be applied wihout modification. It is necessary only to realize that the inferred mutation parameters and the divergence time are scaled in terms of ND/(1 - F) generations instead of the usual ND generations. This result follows from the fact that each sample drawn in this way has probability x of showing the mutant base. That is, h(1|x, 1) = x and h(0|x, 1) = 1 - x, and similarly for h*(ik|x, nk). Summing 15, for each species, over all i1, i2, ... , id such that 0 < dk=1ik < dk=1nk gives SAWYER and HARTL's (1992) 15 and 19 but with the scaled mutation rates that apply here: {theta} s and {theta} a. Similarly, SAWYER and HARTL's (1992) 17 and 18 are derived by considering the chance that ik = 1 for all k. In sum, inferences about selection coefficients, mutation rates, and divergence times are entirely robust to (island-model) population subdivision when each sample is taken from a different deme.
Inferences from single-deme samples:&+, http://www.100md.com
At the opposite extreme, consider the case in which all samples are drawn from the same deme within each species. Note that we assume, as in SAWYER and HARTL 1992 , that the two species are identical (here in terms of N, m, and {gamma} ). Let n1 and n2 denote the sample sizes from the two species. For this sample, the expected numbers of fixed-synonymous (Ks), fixed-replacement (Ka), polymorphic-synonymous (Ss), and polymorphic-replacement (Sa) sites are given by&+, http://www.100md.com
in which H(x, n) = 1 - h(n|x, n) - h(0|x, n). The results from Limiting allele frequency dynamics at a single locus are used to compute h(n|x, n) and h(0|x, n). Namely,&+, http://www.100md.com
This same equation can be used to compute h(0|x, n) = h(n|1 - x, n).&+, http://www.100md.com
2 plots the expected values of Ks, Ka, Ss, and Sa as functions of the migration rate when n1 = n2 = 10 and N = 100 and for three different values of {gamma} : -2, 0, and 2. The results are as expected for single-deme samples. When m = 1, they are the same as in a panmictic population. As m decreases, samples from single demes tend to be closely related, so the numbers of polymorphisms will decrease and the numbers of (apparent) fixation events will increase. This is true regardless of whether {gamma} is positive, zero, or negative, although the relative magnitudes of the four quantities depend strongly on {gamma} . The curves for E(Ks) and E(Ss) are, of course, identical for all values of {gamma} . The results that would be obtained by assuming limN"->"
{infty}2Nm = M and using 8 and 17 would be similar to what is shown in 2 if M were varied from 0.02 to 200.'pp{6, http://www.100md.com
fig.ommitteed'pp{6, http://www.100md.com
Figure 2. The dependence on migration rate (m) of the expected values of Ks, Ka, Ss, and Sa computed using 18192021, assuming n1 = n2 = 10 and N = 100. In addition, {theta} s = 10, {theta} a = 5, and tdiv = 7. (a) = 2; (b) = 0; (c) = -2.'pp{6, http://www.100md.com
To understand the effects of (island-model) population subdivision for the extreme case of single-deme samples, we can use the "data" of 2 to fit the parameters of SAWYER and HARTL's (1992) panmictic model. 3 shows that estimates of {gamma} are remarkably robust to subdivision, even in this case, where the effects of subdivision should be strongest. Again, if samples were taken singly from different demes, there would be no error in using the panmictic model. For single-deme samples there is some error when the migration rate is low, but even in the extreme case of m = 10-4 (2Nm = 0.02) the estimates are off only by ~
25%. However, the level of error will be greater for larger samples (see DISCUSSION) and when the absolute value of {gamma} is larger. An additional effect is that the error in estimating {gamma} is conservative in that the bias is toward neutrality regardless of whether is positive or negative. 4 shows the effect on the other parameters: tdiv, {theta} s, and {theta} a. As should be expected from 2, mutation rates are underestimated and the divergence time is overestimated when the migration rate is small. The error in estimating these other parameters is much more extreme than that for {gamma} . In addition, there is a small effect of {gamma} on estimates of a.:k, 百拇医药
fig.ommitteed:k, 百拇医药
Figure 3. The dependence on migration rate (m) of the estimated values of using the values of Ks, Ka, Ss, and Sa plotted in 2 and assuming SAWYER and HARTL's (1992) panmictic PRF model. At the right (m = 1) the population is in fact panmictic, and is estimated accurately in all three cases.
fig.ommitteed?-7|f, 百拇医药
Figure 4. The dependence on migration rate (m) of the estimated values of s, a, and tdiv using the values of Ks, Ka, Ss, and Sa plotted in 2 and assuming SAWYER and HARTL's (1992) panmictic PRF model. Estimates of s and tdiv depend only on neutral variation, but estimates of a show some effect of selection. The three curves are, from the top, = 2, = 0, and= -2.?-7|f, 百拇医药
The expected number of neutral segregating sites:?-7|f, 百拇医药
Under neutrality, the results presented here agree with those found using a coalescent approach in WAKELEY 1998 , and later in WAKELEY 1999 , WAKELEY 2001 , which were derived under the assumption that limN"->"?-7|f, 百拇医药
{infty}2Nm = M. We make the same assumption here and further assume that this occurs in such a way that the diffusion result still holds (see Limiting allele frequency dynamics at a single locus). Then we can use g(xk|x) and h*(ik|x, nk) in expression (15) to show that the expected number of synonymous segregating sites is equal to s n-1i=1 1/i when all n sampled are taken from separate demes. This was found in WAKELEY 1998 to hold for the samples from the neutral genetic locus in the large-D island model, under the assumption of no intralocus recombination. We expect this agreement under the infinite-sites model of mutation, because the marginal distribution of genealogies at a single site must be the same as that of an entire nonrecombining locus under neutrality. It is important to note that the variances and other moments of the numbers of segregating sites do depend on the recombination rate.
Consider the number of segregating sites in a sample of n sequences, all from the same deme. From the coalescent approach we haveao, 百拇医药
(WAKELEY 1998 ), in which S1(i, j) are Stirling numbers of the first kind (ABRAMOWITZ and STEGUN 1964) and M(n) = M(M + 1) ... (M + n - 1). Here, 15 becomesao, 百拇医药
and this is shown in Appendix B to be equivalent to (23).ao, 百拇医药
DISCUSSIONao, 百拇医药
The results presented above can be understood in terms of a sample-size effect of subdivision, one that depends on how the sample is distributed among demes. In the limit of a large number of demes, the history of a sample under neutrality has two distinct phases: the scattering phase and the collecting phase described in WAKELEY 1999 . Although in this analysis incorporating selection was not phrased in these terms, it is clear from 2 that the same effect is at work, namely, that a scattering phase, which is a stochastic sample size adjustment that begins with a sample of size n and ends with n' lineages each in a separate deme, where 1 n' n (WAKELEY 1999 ), induces a downward sample-size adjustment to single-deme samples. In the case of large N and correspondingly small m, the scattering phase for a sample from a single deme is given by P[n'|n] = |S1(n, n')|Mn'/M(n), which appears in 23. 5 shows how the expected values of Ks, Ka, Ss, and Sa depend on n1 = n2 under panmixia with = 2. Thus, the values on the right-hand side of 5 are identical to those on the right-hand side of 2A. Although scales of the horizontal axes are not the same, the effect of smaller migration rate is qualitatively similar to that of smaller sample size. The reason that the values on the left-hand sides of the two panels are different is that the average value of n' at the left in 2A is equal to 1.06, which is considerably smaller than the practical lower limit of 2 in 5. Instead, the values on the left-hand side of 5 can be compared to those in 2A for log10(m) = -2.67, or m = 0.00215, which (with N = 100) gives E[n'] {cong} 2.
fig.ommitteedam, 百拇医药
Figure 5. An illustration that the overestimation of fixation events and underestimation of polymorphism levels result from a sample-size effect. Except for n1 and n2, parameters are the same as in 2C, and the curves plot 18E20E as a function of sample size.am, 百拇医药
This work shows that inferences about natural selection made from DNA polymorphism and divergence data are robust to population subdivision (3) as long as the migration rate is not too low. This is remarkable in view of the strong effects subdivision has on numbers of polymorphisms, shown in 2, but is understandable in terms of the effect of subdivision on s, a, and tdiv. Except for the weak dependence of a estimates on (4), subdivision and migration act equally on selected and neutral variation. In both cases, fixation events are overestimated and polymorphisms underestimated when the migration rate is small. This causes mutation rates to be substantially underestimated and divergence times grossly overestimated if subdivision is ignored, but these effects compensate one another and allow relatively accurate estimates of selection even if subdivision is ignored. Often {gamma} will be the focus of study, but if, a, and tdiv are also of interest, it would be useful to have a framework for simultaneous inferences about migration rates, selection coefficients, and these other parameters. The theory presented above is a first step toward this goal.
It is important to note that inferences about natural selection made from allele frequencies at polymorphic sites will be robust to subdivision only in the case of samples taken singly from different demes. Otherwise, the distribution of samples among demes will cause some frequency classes to be overrepresented, resulting in biased inferences. Even when all the samples are taken from the same deme, restricted migration can mimic the effect of positive {gamma} on allele frequencies (WAKELEY and ALIACAR 2001 ). While allele frequencies at polymorphic sites provide an additional source of information about natural selection (HARTL et al. 1994 ), this illustrates that they are also greatly influenced by nonselective demographic factors; see also NIELSEN 2001 . In addition, allele frequency patterns are quite sensitive to levels of recombination (BUSTAMANTE et al. 2001 ). Thus, it is especially important to account for subdivision when making inferences from allele-frequency data.k{4, 百拇医药
ACKNOWLEDGMENTS
I thank Dan Hartl, Thomas Nagylaki, Stanley Sawyer, and Clifford Taubes for helpful discussions of the work. I am also grateful to Sabin Lessard for seeing that deme sizes need not be large for the large-number-of-demes coalescent to hold. This work was supported by grants DEB-9815367 and DEB-0133760 from the National Science Foundation.0.', 百拇医药
Manuscript received August 6, 2002; Accepted for publication October 14, 2002.0.', 百拇医药
APPENDIX A0.', 百拇医药
Again, ZDi(t) is the fraction of demes that contain i copies of the mutant. Let ZDij(t) record the fraction of demes that contain i copies of the mutant and are descended from a deme that contained j copies of the mutant in the previous generation. Of course, ZDi(t) = Nj=0ZDij(t), and we can study the behavior of ZDi(t) by considering the simpler behavior of ZDij(t). In particular, conditional on the state of the system z(t) at time t,
and ZDij(t + 1) and ZDkl(t + 1) are independent for all i and k, and j l. Thus, we have conditional moments/[h|+, http://www.100md.com
All the higher central moments of the ZDi(t + 1) are o(1/D)./[h|+, http://www.100md.com
Now let XD(t) = {sum} Ni=0 iZDi(t)/N, and YDi(t) = ZDi(t) - {nu}/[h|+, http://www.100md.com
i(t). The diffusion result follows from these results (derived below) for changes over one generation:/[h|+, http://www.100md.com
in which t has been suppressed, x = Ni=0 izi/N, and yi = zi - {nu}/[h|+, http://www.100md.com
i, and/[h|+, http://www.100md.com
(A12)/[h|+, http://www.100md.com
The fixation index is given by/[h|+, http://www.100md.com
It is clear from A12 that c(x, 0) = 0 for all x {isin} (0, 1). If, in addition, the zero solution of the difference equation/[h|+, http://www.100md.com
(A14)
is globally asymptotically stable, then the diffusion (9) holds (ETHIER and NAGYLAKI 1980 ). Note that y = 0 is equivalent to zi = {nu}+h\pn2/, http://www.100md.com
i and that A14 is equivalent to Y(k + 1, x, y) = Y(k, x, y)P*. Proof of A14 follows from the ergodicity of the stochastic matrix P*, i.e., that limk"->"+h\pn2/, http://www.100md.com
{infty}P*(k)ij = {nu}+h\pn2/, http://www.100md.com
j, along the same lines as the proof in NAGYLAKI 1980 (pp. 111–112).+h\pn2/, http://www.100md.com
The derivation of A9 follows from A4. For A5 we have+h\pn2/, http://www.100md.com
Putting in qj from 1 and simplifying give+h\pn2/, http://www.100md.com
(A18)+h\pn2/, http://www.100md.com
which gives (A5) if we put zi = yi + {nu}+h\pn2/, http://www.100md.com
i on the right and simplify using 7.+h\pn2/, http://www.100md.com
For A6 we have+h\pn2/, http://www.100md.com
Again, putting in qj and simplifying, this becomes A6.
For A7 we have6j[[*(, http://www.100md.com
As in (A20) above, the second sum on the right in (A26) is equal to E[XD(1) - x|z], which, from (A5), is o(1). Expanding and considering the third and fourth central moments of ZDi(1) gives the result (A7).6j[[*(, http://www.100md.com
In the derivations of (A8) and (A9) below I assume that the exact solution of (4) is sufficiently close to (5) that the latter can be used in place of the exact solution. More precisely, I assume that6j[[*(, http://www.100md.com
where the coefficients rk depend on N, i, m, and x. This is certainly true for 5, and because (5) and the exact solution of (4) are nearly identical in form (see 1 and associated text), it should also be true of the exact solution although the coefficients rk will be different.6j[[*(, http://www.100md.com
For A8 we have6j[[*(, http://www.100md.com
Using (A27), the second term on the right in A29 becomes6j[[*(, http://www.100md.com
Then by the same argument that gave (A7), using (A1), it can be shown that these higher moments are also o(1/D). Because of this, and putting in {nu}
i = Nj=0{nu}cza#], http://www.100md.com
jP*ji, A29 becomescza#], http://www.100md.com
which is equal to (A8).cza#], http://www.100md.com
For A9 we havecza#], http://www.100md.com
using (3.12) in ETHIER and NAGYLAKI 1980 . From (A3), we have Var[ZDi(1)|z] = o(1). From A27 we can see that, like (A30), the second term in (A34) ultimately depends on the moments of ZDi and so is also o(1). Therefore, Var[YDi(1)|z] = o(1) as required in A9.cza#], http://www.100md.com
This completes the derivation of (A5–A9), showing that Theorem 3.3 in ETHIER and NAGYLAKI 1980 can be applied and that the diffusion x(·) with generator (9) in the text holds as D goes to infinity.cza#], http://www.100md.com
APPENDIX Bcza#], http://www.100md.com
Beginning with 24, and then putting in g(x|x) and {phi}cza#], http://www.100md.com
s(x), we havecza#], http://www.100md.com
Using the identity
we obtain/a7xfry, 百拇医药
which is the same as 23 since the first term (n' = 1) is equal to zero./a7xfry, 百拇医药
LITERATURE CITED/a7xfry, 百拇医药
ABRAMOWITZ, M., and I. A. STEGUN, 1965 Handbook of Mathematical Functions. Dover, New York./a7xfry, 百拇医药
AKASHI, H., 1999 Inferring the fitness effects of DNA mutations from polymorphism and divergence data: statistical power to detect directional selection under stationarity and free recombination. Genetics 151:221-238./a7xfry, 百拇医药
BUSTAMANTE, C. D., J. WAKELEY, S. SAWYER, and D. L. HARTL, 2001 Directional selection and the site-frequency spectrum. Genetics 159:1779-1788./a7xfry, 百拇医药
BUSTAMANTE, C. D., R. NIELSEN, S. A. SAWYER, K. M. OLSEN, and M. D. PURUGGANAN et al., 2002 The cost of inbreeding in Arabidopsis.. Nature 416:531-534./a7xfry, 百拇医药
CHARLESWORTH, B., 2001 Effect of life history and mode of inheritance on neutral genetic variation. Genet. Res. 77:153-166./a7xfry, 百拇医药
CHERRY, J. L. and J. WAKELEY, 2003 A diffusion approximation for selection and drift in a subdivided population. Genetics 163:421-428.
DONNELLY, P., M. NORDBORG, and P. JOYCE, 2001 Likelihoods and simulation methods for a class of nonneutral population genetic models. Genetics 159:853-867.3, 百拇医药
ETHIER, S. N. and T. NAGYLAKI, 1980 Diffusion approximations of Markov chains with two timescales and applications to population genetics. Adv. Appl. Prob. 12:14-49.3, 百拇医药
EWENS, W. J., 1979 Mathematical Population Genetics. Springer-Verlag, Berlin.3, 百拇医药
EWENS, W. J., 1990 Population genetics theory—the past and the future, pp. 177–227 in Mathematical and Statistical Developments of Evolutionary Theory, edited by S. LESSARD. Kluwer Academic Publishers, Amsterdam.3, 百拇医药
FAY, J. C., G. J. WYCKOFF, and C.-I WU, 2002 Testing the neutral theory of molecular evolution with genomic data from Drosophila.. Nature 415:1024-1026.3, 百拇医药
FISHER, R. A., 1930 The Genetical Theory of Natural Selection. Clarendon Press, Oxford.3, 百拇医药
FU, X.-Y. and W.-H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133:693-709.
HARTL, D. L., E. N. MORIYAMA, and S. A. SAWYER, 1994 Selection intensity for codon bias. Genetics 138:227-234.8-k#g, 百拇医药
HUDSON, R. R. and N. L. KAPLAN, 1988 The coalescent process in models with selection and recombination. Genetics 120:831-840.8-k#g, 百拇医药
HUDSON, R. R., M. KREITMAN, and M. AGUADE, 1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116:153-159.8-k#g, 百拇医药
JOHNSON, N. L., S. KOTZ and N. BALAKRISHNAN, 1997 Discrete Multivariate Distributions. Wiley, New York.8-k#g, 百拇医药
KIMURA, M., 1969 The number of heterozygous nucleotide sites maintained in a finite population due to the steady flux of mutations. Genetics 61:893-903.8-k#g, 百拇医药
LATTER, B. D. H., 1973 The island model of population differentiation: a general solution. Genetics 73:147-157.8-k#g, 百拇医药
MARUYAMA, T., 1970 Effective number of alleles in a subdivided population. Theor. Popul. Biol. 1:273-306.8-k#g, 百拇医药
MCDONALD, J. H. and M. KREITMAN, 1991 Adaptive protein evolution at the adh locus in Drosophila.. Nature 351:652-654.
MÖHLE, M., 1998 Robustness results for the coalescent. J. Appl. Prob. 35:438-447.&jefdi0, 百拇医药
MÖHLE, M., 2001 Forward and backward diffusion approximations for haploid exchangeable population models. Stoch. Proc. Appl. 95:133-149.&jefdi0, 百拇医药
MORAN, P. A. P., 1959 The theory of some genetical effects of population subdivison. Austr. J. Biol. Sci. 12:109-116.&jefdi0, 百拇医药
MORAN, P. A. P., 1962 Statistical Processes of Evolutionary Theory. Clarendon Press, Oxford.&jefdi0, 百拇医药
NAGYLAKI, T., 1980 The strong-migration limit in geographically structured populations. J. Math. Biol. 9:101-114.&jefdi0, 百拇医药
NEUHAUSER, C. and S. M. KRONE, 1997 The genealogy of samples in models with selection. Genetics 145:519-534.&jefdi0, 百拇医药
NIELSEN, R., 2001 Statistical tests of neutrality in the age of genomics. Heredity 86:641-647.&jefdi0, 百拇医药
NORDBORG, M., 1997 Structured coalescent processes on different time scales. Genetics 146:1501-1514.&jefdi0, 百拇医药
NOTOHARA, M., 1993 The strong migration limit for the genealogical process in geographically structured populations. J. Math. Biol. 31:115-122.
RANNALA, B., 1996 The sampling theory of neutral alleles in an island population of fluctuating size. Theor. Popul. Biol. 50:91-104.6\8dg, http://www.100md.com
ROTHMAN, E. D., C. F. SING, and A. R. TEMPLETON, 1974 A model for the analysis of population structure. Genetics 78:934-960.6\8dg, http://www.100md.com
SAWYER, S. A. and D. L. HARTL, 1992 Population genetics of polymorphism and divergence. Genetics 132:1161-1176.6\8dg, http://www.100md.com
SLATKIN, M., 1985 Gene flow in natural populations. Annu. Rev. Ecol. Syst. 16:393-430.6\8dg, http://www.100md.com
SLATKIN, M. and G. BERTORELLE, 2001 The use of intraallelic variability for testing neutrality and estimating population growth rate. Genetics 158:865-874.6\8dg, http://www.100md.com
SMITH, N. G. and A. EYRE-WALKER, 2002 Adaptive protein evolution in Drosophila.. Nature 415:1022-1024.6\8dg, http://www.100md.com
TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.6\8dg, http://www.100md.com
WAKELEY, J., 1998 Segregating sites in Wright's island model. Theor. Popul. Biol. 53:166-175.6\8dg, http://www.100md.com
WAKELEY, J., 1999 Non-equilibrium migration in human history. Genetics 153:1863-1871.6\8dg, http://www.100md.com
WAKELEY, J., 2001 The coalescent in an island model of population subdivision with variation among demes. Theor. Popul. Biol. 59:133-144.6\8dg, http://www.100md.com
WAKELEY, J. and N. ALIACAR, 2001 Gene genealogies in a metapopulation. Genetics 159:893-905. (corrigendum: 160: 1263–1264).6\8dg, http://www.100md.com
WRIGHT, S., 1931 Evolution in Mendelian populations. Genetics 16:97-159.6\8dg, http://www.100md.com
YANG, Z., 1998 Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568-573.(John Wakeley)