Evolution of Prokaryotic DNA: Intragenic and Extragenic Divergences Observed with Orthologs from Three Related Species
http://www.100md.com
分子生物学进展 2004年第6期
Danish University of Pharmaceutical Sciences, Institute of Pharmacology, Universitetsparken, Copenhagen, Denmark
E-mail anfu@dfh.dk.
Abstract
This study compared orthologous gene pairs from Escherichia coli K12, E. coli O157:H7 EDL933, Salmonella typhimurium LT2, and Yersinia pestis CO92 using only homologs of equal length, and differing nucleotides were counted and mapped. The data showed very clearly how the rates of divergence change with intragenic and extragenic position. The rate of synonymous mutation is lowest near start codons and near stop codons, and, a little surprisingly, the opposite seemed to be true for nonsynonymous substitutions. Analysis outside genes reveals that nucleotide divergences occur less frequently upstream of start codons than downstream of stop codons, and a remarkable drop in divergences is seen for two of the data sets around N = 9 nucleotides upstream of start codons; that is, the Shine-Dalgarno region changes at a lower rate. The explanation is likely to be the link between expressivity and sequence complementarity to the 3' end of 16S ribosomal rRNA. The latter is highly conserved across many bacterial and archaebacterial species.
Key Words: Sequence alignment ? Shine-Dalgarno regions ? mutation ? molecular evolution ? codon usage bias ? expressivity
Introduction
The factors that determine composition of coding and noncoding regions of DNA are not well understood, but in the postgenomic era, some compositional rules are beginning to unravel. Composition is a product of mutation, selection, and drift. However, the mutation may not strike equally all over a genome. It is well known that the leading strand of replication typically is richer in guanine than in cytosine, probably because of differential effects of the mutational pressure at the asymmetrical replication fork (Rocha and Danchin 2001). The phenomenon is rather striking (e.g., the sliding window plot of GC-skew [Rocha, Viari, and Danchin 1998]). Selection for specific compositional traits has been demonstrated in a number of species. Expression levels and tRNA availability often play decisive roles for the intragenic composition. Gouy and Gautier (1982) found that highly expressed genes tend to use a narrow set of codons, which were demonstrated to correspond to the more abundant tRNAs (Ikemura and Ozeki 1983; Ikemura 1985; Bulmer 1987; Kanaya 2001). There is also some evidence that the codon choice can depend on the intragenic position. Bulmer (1988) found that the codon usage in the first codons of genes may be under special selectional forces relating to translational efficiency, which, in turn, is influenced by the composition of start codons, as AUG usually is more efficient than GUG or UUG (Stenstrom, Holm-gren, and Isaksson 2001). The classical elucidation by Shine and Dalgarno (1974) of the sequence of the ribosomal 16S subunit proved that the extragenic sequence can be a major determinant of expression levels, and recent bioinformatic studies on a whole-genome scale have confirmed that there is a clear link between the sequence in the Shine-Dalgarno region and expressivity (Osada, Saito, and Tomita 1999; Sakai et al. 2001; Fuglsang 2003).
Given that mutations and selectional forces display regional preferences, is it then possible to identify regions of DNA that evolve faster than others? We may get an answer from pairwise alignment studies of homologous genes from closely related species (i.e., by aligning homologs and count divergences). There is little information in, say, two lacZ genes representing any two species, but if we study many such orthologous pairs from the same two species, then we might be able to elucidate some patterns. One problem is that gene lengths also evolve, so codon-by-codon alignment will rely on either orthologs of equal length or on using parsing algorithms that are able to handle sequence gaps. We have chosen the former approach because gapped alignments give rise to some uncertainty about intragenic positions and because gapped alignment is much more difficult to implement on a large scale. On a small subset of orthologs of bacterial genes of Escherichia coli and Salmonella typhimurium, Eyre-Walker and Bulmer (1993) found a decreased tendency for synonymous substitution near the gene starts. Other studies conducted on orthologs of eukaryotes have focused more on the specific nucleotides involved in the changes and codon positions than on regional variations of these phenomena (Alvarez-Valin, Jabbari, and Bernardi 1998; Alvarez-Valin et al. 1999). There is a lack of studies encompassing whole genomes. This study, therefore, aims to expand the knowledge by using a whole-genome approach to quantify the intragenic divergences of orthologous pairs in relation the positions inside and outside genes.
Materials and Methods
Construction of Data Sets
It was decided to use the best-known microorganism, Escherichia coli strain K12 (GenBank accession number NC000913 [Blattner et al. 1997]), as a reference organism and to compare this organism to a few other bacteria. The criterion for alignment in this study is that the pairs of genes be orthologous and of equal length. Within these boundaries, any choice or combination of two bacteria could be used for a data set in this study. However, the larger the evolutionary distance the between two species (or strains), the higher the chance that any two pairs of homologous genes differ with respect to length. This, in turn, means that the number of homologous genes of equal length could be too low to allow any conclusions to be drawn. The same problem would arise if one or both of the genomes used for a data set are very small. In addition, some genomes are too weakly annotated to allow inclusion. For example, in the GenBank file of the Streptomyces coelicolor genome (accession number NC003888 [Bentley et al. 2002]), most genes are annotated "hypothetical" or "putative" or having unknown function, making inclusion in this study impossible.
Based on these considerations, E. coli strain O157:H7 EDL933 (GenBank accession number 002655 [Perna et al. 2001]) was chosen as the close relative of E. coli K12, and Yersinia pestis CO92 (GenBank accession number NC003143 [Parkhill et al. 2001]) and Salmonella typhimurium LT2 (GenBank accession number NC003197 [McClelland et al. 2001]) were selected as more distant relatives. Phylogenetically, the Escherichia genus is a member of the enterobacteriaceae group in the subdivision of the proteobacteria. Salmonella also belongs to the enterobacteriaeceae, whereas Yersinia is more distant to E. coli, belonging to the pasteurellaceae group in the subdivision. All three strains are roughly comparable in GC content. (Table 1 lists the number of homologs and their GC content.) There are, thus, three data sets in this study: the first, containing homologous genes of equal length from the two E. coli strains; the second, containing homologous genes of equal length from E. coli K12 and S. typhimurium LT2; and the third, containing homologous genes of equal length from E. coli K12 and Y. pestis CO92.
Table 1 Bacteria in This Study.
Intragenic Evolution Rates
It was decided to count both synonymous and nonsynonymous divergences in the data sets. All pairs of homologs were parsed from N = 1 to N = 50 codons, where N = 1 denotes the start codon. For each value of N, the codon was compared, and in case the codons differed, the corresponding amino acids of the two codons were compared, revealing whether the divergence was synonymous or nonsynonymous. Finally, frequencies of synonymous and nonsynonymous changes were plotted as function of N. For completeness, a similar analysis was run for the region upstream of stop codons (where N = 1 denotes the stop codon). To avoid counting divergences twice, genes were only counted if they spanned at least 300 nucleotides, including the stop codon.
Extragenic Evolution Rates
To study the extragenic evolution rates, the data sets were expanded with information about the 50 nucleotides upstream of start codons and downstream of stop codons. Nucleotide differences were then counted and plotted as function of the nucleotide position upstream of the start codon or downstream of the stop codon.
Conservation in the Shine-Dalgarno Region
It has been shown previously that a distinct pattern of nucleotide nonrandomness can be seen at around N = 9 nucleotides upstream of bacterial start codons, indicating usage of Shine-Dalgarno regions (Fuglsang and Engberg 2003). Furthermore, E. coli is reported to display more pronounced selection for translational efficiency than Bacillus subtilis (Shields and Sharp 1987; Sharp et al. 1988). This led to the investigation of nucleotide nonrandomness in data sets for high-expressivity genes versus low-expressivity genes, and it was shown that in E. coli, the nonrandomness is more pronounced in the Shine-Dalgarno region for the high-expressivity genes than for the low-expressivity genes, whereas no difference was observed in B. subtilis (Fuglsang 2003). On this basis, and because the data sets with data for S. typhimurium and Y. pestis displayed markedly lower divergence rates in this region (see Results and Discussion), it was decided to perform the same analysis on S. typhimurium and Y. pestis. For both species, two data sets were constructed, one consisting of the 500 genes having lowest expressivity and the other consisting of the 500 genes having highest expressivity. Note that expressivity is not the same as expression levels; expressivity denotes the codon adaptation index (CAI [Sharp and Li 1987]), which is only a surrogate value for expression levels. Nonrandomness analysis (Fuglsang and Engberg 2003; Fuglsang 2003) was performed on these two fractions individually.
Statistical Analysis
Spearman's rank correlation analysis was performed using GraphPad Prism version 3.0 (GraphPad Inc., Calif.) to test whether divergence rates show an increasing or decreasing trend near gene ends. When linear relationships were expected, linear regression was used. A probability corresponding to less than 5% chance was considered significant.
Results and Discussion
Intragenic Observations
Figure 1a–c shows the divergence frequencies downstream of start codons in the three data sets. Correlation analysis results are given in the figure legends. In all three data sets, we have an approximately equal rate of divergence at the first few codons, where after the synonymous codon divergences increase very significantly in all data sets, whereas the nonsynonymous changes are without a significant trend (E. coli K12 versus E. coli O157:H7 EDL933) or significantly decreasing (E. coli K12 versus S. typhimurium, and E. coli K12 versus Y. pestis). Figure 2a–c shows similar data but for the codons upstream of stop codons. The same principles as observed in figure 1a–c are also visible here: synonymous changes become more frequent as the distance from the stop codon increase, whereas the nonsynonymous changes become less frequent. All in all, these figures reveal that synonymous intragenic evolution is more likely to occur distally to start and stop codons, whereas nonsynonymous evolution is more likely to occur proximally to start and stop codons, and the nonsynonymous changes generally are more rare. It should be emphasized that not all synonymous changes may be selectionally inert. Gouy and Gautier (1982) found that highly expressed genes tend to use a narrow set of codons, and Ikemura and Ozeki (1983) showed that the preferred codons in highly expressed genes corresponded very well to tRNAs that are more abundant than their synonymous counterparts. Since then, it has been shown in a variety of organisms, both eukaryotic and prokaryotic, that the major source of variation in codon usage bias is expression levels, even though there are exceptions to this principle (see examples by McInerney [1998] and Herbeck, Wall, and Wernegreen [2003]). Bulmer (1988) studied a small set of genes and found some indicators that the codon usage bias of the first few codons just downstream of the start codon is subject to a special kind of selection. Later, it was proposed as the "minor codon modulator hypothesis" that the expression level of a gene is in part controlled by the presence of nonoptimal codons, especially if they reside in the beginning of a gene (Chen and Inouye 1990, 1994). Thus, rare codons clustered downstream of the start codon may stall the translational machinery as the availabilities of charged tRNA for these particular codons are low. In this context, the figure 1a–c would suggest that the relatively high rate of nonsynonymous change and the relatively low rate of synonymous change near start codons arises because retaining the expression level becomes (relatively) more important than retaining the primary structure of the protein. However, the absolute magnitudes of divergence still indicate that the primary structure of the protein is quantitatively most important.
FIG. 1. (a) Synonymous (?) and nonsynonymous () divergence frequency versus the codon position downstream of start codons in the data set corresponding to orthologs from E. coli K12 and E. coli O157:H7 EDL933. There is a highly significant increasing tendency for the synonymous divergences: = 0.8745, P < 0.0001. (b) Similar to (a) but for the data set corresponding to orthologs from E. coli K12 and S. typhimurium. There is a highly significant increasing tendency for the synonymous divergences: = 0.9484, P < 0.0001. There is also a highly significant decreasing tendency for the nonsynonymous divergences: Spearman's rank correlation coefficient = –0.5280, P < 0.0001. (c) Similar to (a) but for the data set corresponding to orthologs from E coli K12 and Y. pestis. There is a highly significant increasing tendency for the synonymous divergences: = 0.5540, P < 0.0001
FIG. 2. (a) Synonymous (?) and nonsynonymous () divergence frequency versus the codon position upstream of stop codons in the data set corresponding to orthologs from E. coli K12 and E. coli O157:H7 EDL933. There is a very significant increasing tendency for the synonymous divergences: = 0.5727, P < 0.0001. The nonsynonymous changes are less significant: = –0.3261, P < 0.05. (b) Similar to (a) but for the data set corresponding to orthologs from E coli K12 and S. typhimurium. There is a highly significant increasing tendency for the synonymous divergences: = 0.8439, P < 0.0001. There is also a highly significant decreasing tendency for the nonsynonymous divergences: = –0.6719, P < 0.0001. (c) Similar to figure (a) but for the data set corresponding to orthologs from E coli K12 and Y. pestis. There is a highly significant increasing tendency for the synonymous divergences: = 0.7283, P < 0.0001. There is also a highly significant decreasing tendency for the nonsynonymous divergences: = –0.7453, P < 0.0001
It is known that bacteria such as E. coli are able to process short-leadered and leaderless mRNA. When a ribosomal binding site is lacking, there might, therefore, be an alternative path of recognition events leading to initiation of the translation process. It has been proposed that the penultimate stem of 16S rRNA may be able to form base pairs with a stretch of nucleotides after the start codon, and, therefore, this region could be functioning in much the same fashion as the Shine-Dalgarno region (Sprengart, Fatscher, and Fuchs 1990). This hypothesis is subject to much controversy. Sprengart, Fatscher, and Fuchs (1990) and Sprengart, Fuchs, and Porter (1996) found that the degree of complementarity plays an important role for translational efficiency both in the presence and in the absence of a strong Shine-Dalgarno sequence. Nevertheless, Moll et al. (2001) found with topological models of the 16S rRNA structure that it is unlikely that such base pairing occurs at all. Figure 1a–c could, in principle, also reflect such phenomena. However, the length of the downstream box (whether it exists or not) is purportedly around 15 nucleotides, but on figure 1a–c, there is no sign of the observations being confined to such a well-defined short stretch of nucleotides. In that regard, the data presented here does not support a role for a downstream box in molecular evolution; we see no special boxlike pattern downstream of the start codon.
Extragenic Observations and Expressivity
In figure 3a–c, the extragenic divergences are shown for nucleotides upstream of start codons and downstream of stop codons. There is no clear pattern in the divergence frequencies downstream of stop codons. However, in figure 3b and c (and perhaps figure 3a as well), a remarkable drop in divergence is seen in the region around 10 nucleotides upstream of start codons. This is, in my interpretation, a clear proof of the importance of functional Shine-Dalgarno regions. Previously published results on this region showed that the degree of nonrandomness, centered on N = 9 nucleotides upstream of start codons in E. coli, is very pronounced here (Fuglsang and Engberg 2003) and that the nucleotides, furthermore, are more nonrandom in the highly expressed genes than in the lowly expressed genes (Fuglsang 2003). As figure 4a and b shows, the same is true for S. typhimurium and for Y. pestis. Both clearly make extensive use of Shine-Dalgarno regions, and these seem to be positioned just as in E. coli. Table 2 lists the 3' ends of the 16S rRNA for the species and strains in this study plus some more distantly related examples. This region is extremely well conserved across a wide range of eubacteria and archaebacteria. The natural role of the Shine-Dalgarno region is to facilitate ribosomal binding and translational initiation. The higher the complementarity between a gene's Shine-Dalgarno region and the 3' end of the 16S rRNA, the higher the chance of translational initiation. Therefore, highly expressed genes tend to have a highly complementary Shine-Dalgarno region. Figure 5a shows a plot of synonymous mutations between E. coli K12 and S. typhimurium orthologs versus the E. coli K12 codon adaptation index. Note that the two codon adaptation indexes are well correlated (inset in figure 5a). The figure shows that highly expressed genes tend to undergo synonymous evolution at a slower rate than lowly expressed genes. The same holds true for the nonsynonymous changes (fig. 5b). Data for the other data sets reveal the same tendency (not shown). Thus, the higher the expression of a gene, the slower it evolves, both synonymously and nonsynonymously. A study by Alff-Steinberger (2001) concluded that rare codons tend to be more prone to evolution. Because rare codons give lower expressivities, the findings of this study are in good agreement with Alff-Steinberger's (2001) results and also with Sharp and Li (1987), who concluded that codon bias and substitution rate varies inversely. All in all these findings also accord well with the findings of Eyre-Walker and Bulmer (1993), who concluded that the synonymous substitution rate is lower near start codons. This study expands those findings considerably, and, thus, the following are concluded:
The synonymous substitution rate is lower near start and stop codons.
The nonsynonymous substitution rate is higher near stop codons and, in some cases, also near start codons.
The substitution rate in Shine-Dalgarno regions is reduced, relating to the fact that the 3' end of 16S rRNA is conserved.
Highly expressed genes have lower rates of synonymous as well as nonsynonymous substitution.
FIG. 3. (a) Divergences upstream of start codons () or downstream of stop codons (?) for the data set corresponding to E. coli K12 and E. coli O157:H7 EDL933. The rate of mutation is generally lower upstream of start codons than downstream of stop codons. (b) Same as (a) but for E. coli K12 and S. typhimurium. Note the clear drop in divergences in the Shine-Dalgarno region (around 10 nucleotides upstream of the start codon). (c) Same as (a) but for E. coli K12 and Y. pestis
FIG. 4. (a) Nonrandomness upstream of start codons for S. typhimurium in the 500 genes displaying lowest expressivity () and the 500 genes displaying highest expressivity (?). The peak around N = 9 is more pronounced for the genes of high expressivity, establishing a clear link between expressivity and the Shine-Dalgarno region. (b) Similar to (a) but for Y. pestis
Table 2 The 3' End of Aligned 16S rRNA of the Species in This Study.
FIG. 5. (a) A plot of the synonymous divergences between E. coli K12 and S. typhimurium versus the E. coli K12 codon adaptation index. There is a negative correlation: = –0.4736, P < 0.0001. This shows that in highly expressed genes, there are fewer synonymous divergences. The insert is a plot showing the correlation of the two codon adaptation indexes. The correlation is very good: r2 = 0.8620, P < 0.0001. (b) A plot of the nonsynonymous divergences between E. coli K12 and S. typhimurium versus the E. coli K12 codon adaptation index. There is a negative correlation: = –0.5937, P < 0.0001. Note that generally the percentages are lower than for the synonymous divergences (a)
A Word on the Methodology
Imagine a pair of orthologs in which one of the genes has undergone an insertion (e.g., plus one codon) by a deletion (minus one codon). The result, then, would be a pair of orthologs of equal length, fully complying with the inclusion criteria for this study. The farther apart the insertion event and the deletion event took place, the larger a degree of divergence would be expected, and this might be the reason why figure 5b reveals a few "outlier" examples having extremely high nonsynonymous divergences. Formally, this could easily be tested by aligning the amino acids. One alternative approach to overcome this problem could be to scan genomes and look for genes of equal length and make pairs on the basis of the homology alone (e.g., 90% nucleotide identity or any other defined limit). I have tried this approach but it seems to be unsuitable, at least for identifying pairs of orthologs. This approach will yield pairs that are not orthologs but just homologs; for example, a response regulator involved in sugar metabolism may be paired with a response regulator involved in branched chain amino acid metabolism etc.
The data presented here only represent bacteria in the subdivision of the proteobacteria. Therefore, one should be careful not to conclude that molecular evolution works the same all over regardless of phylogenetic position. It would thus be interesting to repeat the analysis for orthologous gene pairs taken from bacteria located elsewhere on the phylogenetic tree. This has so far been very difficult; for example, generation of similar data sets with B. subtilis (GenBank accession number NC000964) and B. halodurans (GenBank accession number NC002570), firmicutes, only gives 132 orthologous pairs and, thus, do not yield conclusive figures.
Literature Cited
Alff-Steinberger, C. 2001. A comparative study of mutations in Escherichia coli and Salmonella typhimurium shows that codon conservation is strongly correlated with codon usage. J. Theor. Biol. 206:307-311.
Alvarez-Valin, F., K. Jabbari, and G. Bernardi. 1998. Synonymous and nonsynonymous substitutions in mammalian genes: intragenic correlations. J. Mol. Evol. 46:37-44.
Alvarez-Valin, F., K. Jabbari, N. Carels, and G. Bernardi. 1999. Synonymous and nonsynonymous substitutions in genes from Gramineae: intragenic correlations. J. Mol. Evol. 49:330-342.
Bentley, S. D., K. F. Chater, and A. M. Cerdeno-Tarraga, et al. (40 co-authors). 2002. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2),. Nature 417:141-147.
Blattner, F. R., G. Plunkett, and C. A. Bloch, et al. (14 co-authors). 1997. The complete genome sequence of Escherichia coli K-12. Science 277:1453-1474.
Bulmer, M. 1987. Coevolution of codon usage and transfer RNA abundance. Nature 325:728-730.
Bulmer, M. 1988. Codon usage and intragenic position. J. Theor. Biol. 133:67-71.
Chen, G. F., and M. Inouye. 1990. Suppression of the negative effect of minor arginine codons on gene expression: preferential usage of minor codons within the first 25 codons of the Escherichia coli genes. Nucleic Acids Res. 18:1465-1473.
Chen, G. F., and M. Inouye. 1994. Role of the AGA/AGG codons, the rarest codons in global gene expression in Escherichia coli. Genes Dev. 8:2641-2652.
Eyre-Walker, A., and M. Bulmer. 1993. Reduced synonymous substitution rate at the start of enterobacterial genes. Nucleic Acids Res. 21:4599-4603.
Fuglsang, A. 2003. Association of the nucleotide with codon bias, amino acid usage and expressivity: differences between Bacillus subtilis and Escherichia coli. APMIS 111:926-930.
Fuglsang, A., and J. Engberg. 2003. Non-randomness in Shine-Dalgarno regions: links to gene characteristics. Biochem. Biophys. Res. Commun. 302:296-301.
Gouy, M., and C. Gautier. 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10:7055-7074.
Herbeck, J. T., D. P. Wall, and J. J. Wernegreen. 2003. Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endosymbiont Wigglesworthia. Microbiology 149:585-2596.
Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13-34.
Ikemura, T., and H. Ozeki. 1983. Codon usage and transfer RNA contents: organism-specific codon-choice patterns in reference to the isoacceptor contents. Cold Spring Harb. Symp. Quant. Biol. 47:1087-1097.
Kanaya, S., Y. Yamada, M. Kinouchi, Y. Kudo, and T. Ikemura. 2001. Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol. 53:290-298.
McClelland, M., K. E. Sanderson, and J. Spieth, et al. (23 co-authors). . Complete genome sequence of Salmonella enterica serovar typhimurium LT2. Nature 413:852-856.
McInerney, J. O. 1998. Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proc. Natl. Acad. Sci. USA 95:10698-10703.
Moll, I., M. Huber, S. Grill, P. Sairafi, F. Mueller, R. Brimacombe, P. Londei, and U. Blasi. 2001. Evidence against an Interaction between the mRNA downstream box and 16S rRNA in translation initiation. J. Bacteriol. 183:3499-3505.
Osada, Y., R. Saito, and M. Tomita. 1999. Analysis of base-pairing potentials between 16S rRNA and 5' UTR for translation initiation in various prokaryotes. Bioinformatics 15:578-581.
Parkhill, J., B. W. Wren, and N. R. Thomson, et al. (32 co-authors). 2001. Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413:523-527.
Perna, N. T., G. Plunkett, and V. Burl, et al. (25 co-authors). 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409:529-533.
Rocha, E. P., and A. Danchin. 2001. Ongoing evolution of strand composition in bacterial genomes. Mol. Biol. Evol. 18:1789-1799.
Rocha, E. P., A. Viari, and A. Danchin. 1998. Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons. Nucleic Acids Res. 26:2971-2980.
Sakai, H., C. Imamura, Y. Osada, R. Saito, T. Washio, and M. Tomita. 2001. Correlation between Shine-Dalgarno sequence conservation and codon usage of bacterial genes. J. Mol. Evol. 52:164-170.
Sharp, P. M., E. Cowe, D. G. Higgins, D. C. Shields, K. H. Wolfe, and F. Wright. 1988. Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster, and Homo sapiens; a review of the considerable within-species diversity. Nucleic Acids Res. 17:8207-8211.
Sharp, P.M., and W.-H. Li. 1987. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15:1281-1295.
Shields, D. C., and P. M. Sharp. 1987. Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases. Nucleic Acids Res. 19:8023-8040.
Shine, J., and L. Dalgarno. 1974. The 3'-terminal sequence of Escherichia coli 165 ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. USA 71:1342-1346.
Sprengart, M. L., H. P. Fatscher, and E. Fuchs. 1990. The initiation of translation in E. coli: apparent base pairing between the 16s rRNA and downstream sequences of the mRNA. Nucleic Acids Res. 19:1719-1723.
Sprengart, M. L., E. Fuchs, and A. G. Porter. 1996. The downstream box: an efficient and independent translation initiation signal in Escherichia coli. EMBO J. 15:665-674.
Stenstrom, C. M., E. Holmgren, and L. A. Isaksson. 2001. Cooperative effects by the initiation codon and its flanking regions on translation initiation. Gene 273:259-265.(Anders Fuglsang)
E-mail anfu@dfh.dk.
Abstract
This study compared orthologous gene pairs from Escherichia coli K12, E. coli O157:H7 EDL933, Salmonella typhimurium LT2, and Yersinia pestis CO92 using only homologs of equal length, and differing nucleotides were counted and mapped. The data showed very clearly how the rates of divergence change with intragenic and extragenic position. The rate of synonymous mutation is lowest near start codons and near stop codons, and, a little surprisingly, the opposite seemed to be true for nonsynonymous substitutions. Analysis outside genes reveals that nucleotide divergences occur less frequently upstream of start codons than downstream of stop codons, and a remarkable drop in divergences is seen for two of the data sets around N = 9 nucleotides upstream of start codons; that is, the Shine-Dalgarno region changes at a lower rate. The explanation is likely to be the link between expressivity and sequence complementarity to the 3' end of 16S ribosomal rRNA. The latter is highly conserved across many bacterial and archaebacterial species.
Key Words: Sequence alignment ? Shine-Dalgarno regions ? mutation ? molecular evolution ? codon usage bias ? expressivity
Introduction
The factors that determine composition of coding and noncoding regions of DNA are not well understood, but in the postgenomic era, some compositional rules are beginning to unravel. Composition is a product of mutation, selection, and drift. However, the mutation may not strike equally all over a genome. It is well known that the leading strand of replication typically is richer in guanine than in cytosine, probably because of differential effects of the mutational pressure at the asymmetrical replication fork (Rocha and Danchin 2001). The phenomenon is rather striking (e.g., the sliding window plot of GC-skew [Rocha, Viari, and Danchin 1998]). Selection for specific compositional traits has been demonstrated in a number of species. Expression levels and tRNA availability often play decisive roles for the intragenic composition. Gouy and Gautier (1982) found that highly expressed genes tend to use a narrow set of codons, which were demonstrated to correspond to the more abundant tRNAs (Ikemura and Ozeki 1983; Ikemura 1985; Bulmer 1987; Kanaya 2001). There is also some evidence that the codon choice can depend on the intragenic position. Bulmer (1988) found that the codon usage in the first codons of genes may be under special selectional forces relating to translational efficiency, which, in turn, is influenced by the composition of start codons, as AUG usually is more efficient than GUG or UUG (Stenstrom, Holm-gren, and Isaksson 2001). The classical elucidation by Shine and Dalgarno (1974) of the sequence of the ribosomal 16S subunit proved that the extragenic sequence can be a major determinant of expression levels, and recent bioinformatic studies on a whole-genome scale have confirmed that there is a clear link between the sequence in the Shine-Dalgarno region and expressivity (Osada, Saito, and Tomita 1999; Sakai et al. 2001; Fuglsang 2003).
Given that mutations and selectional forces display regional preferences, is it then possible to identify regions of DNA that evolve faster than others? We may get an answer from pairwise alignment studies of homologous genes from closely related species (i.e., by aligning homologs and count divergences). There is little information in, say, two lacZ genes representing any two species, but if we study many such orthologous pairs from the same two species, then we might be able to elucidate some patterns. One problem is that gene lengths also evolve, so codon-by-codon alignment will rely on either orthologs of equal length or on using parsing algorithms that are able to handle sequence gaps. We have chosen the former approach because gapped alignments give rise to some uncertainty about intragenic positions and because gapped alignment is much more difficult to implement on a large scale. On a small subset of orthologs of bacterial genes of Escherichia coli and Salmonella typhimurium, Eyre-Walker and Bulmer (1993) found a decreased tendency for synonymous substitution near the gene starts. Other studies conducted on orthologs of eukaryotes have focused more on the specific nucleotides involved in the changes and codon positions than on regional variations of these phenomena (Alvarez-Valin, Jabbari, and Bernardi 1998; Alvarez-Valin et al. 1999). There is a lack of studies encompassing whole genomes. This study, therefore, aims to expand the knowledge by using a whole-genome approach to quantify the intragenic divergences of orthologous pairs in relation the positions inside and outside genes.
Materials and Methods
Construction of Data Sets
It was decided to use the best-known microorganism, Escherichia coli strain K12 (GenBank accession number NC000913 [Blattner et al. 1997]), as a reference organism and to compare this organism to a few other bacteria. The criterion for alignment in this study is that the pairs of genes be orthologous and of equal length. Within these boundaries, any choice or combination of two bacteria could be used for a data set in this study. However, the larger the evolutionary distance the between two species (or strains), the higher the chance that any two pairs of homologous genes differ with respect to length. This, in turn, means that the number of homologous genes of equal length could be too low to allow any conclusions to be drawn. The same problem would arise if one or both of the genomes used for a data set are very small. In addition, some genomes are too weakly annotated to allow inclusion. For example, in the GenBank file of the Streptomyces coelicolor genome (accession number NC003888 [Bentley et al. 2002]), most genes are annotated "hypothetical" or "putative" or having unknown function, making inclusion in this study impossible.
Based on these considerations, E. coli strain O157:H7 EDL933 (GenBank accession number 002655 [Perna et al. 2001]) was chosen as the close relative of E. coli K12, and Yersinia pestis CO92 (GenBank accession number NC003143 [Parkhill et al. 2001]) and Salmonella typhimurium LT2 (GenBank accession number NC003197 [McClelland et al. 2001]) were selected as more distant relatives. Phylogenetically, the Escherichia genus is a member of the enterobacteriaceae group in the subdivision of the proteobacteria. Salmonella also belongs to the enterobacteriaeceae, whereas Yersinia is more distant to E. coli, belonging to the pasteurellaceae group in the subdivision. All three strains are roughly comparable in GC content. (Table 1 lists the number of homologs and their GC content.) There are, thus, three data sets in this study: the first, containing homologous genes of equal length from the two E. coli strains; the second, containing homologous genes of equal length from E. coli K12 and S. typhimurium LT2; and the third, containing homologous genes of equal length from E. coli K12 and Y. pestis CO92.
Table 1 Bacteria in This Study.
Intragenic Evolution Rates
It was decided to count both synonymous and nonsynonymous divergences in the data sets. All pairs of homologs were parsed from N = 1 to N = 50 codons, where N = 1 denotes the start codon. For each value of N, the codon was compared, and in case the codons differed, the corresponding amino acids of the two codons were compared, revealing whether the divergence was synonymous or nonsynonymous. Finally, frequencies of synonymous and nonsynonymous changes were plotted as function of N. For completeness, a similar analysis was run for the region upstream of stop codons (where N = 1 denotes the stop codon). To avoid counting divergences twice, genes were only counted if they spanned at least 300 nucleotides, including the stop codon.
Extragenic Evolution Rates
To study the extragenic evolution rates, the data sets were expanded with information about the 50 nucleotides upstream of start codons and downstream of stop codons. Nucleotide differences were then counted and plotted as function of the nucleotide position upstream of the start codon or downstream of the stop codon.
Conservation in the Shine-Dalgarno Region
It has been shown previously that a distinct pattern of nucleotide nonrandomness can be seen at around N = 9 nucleotides upstream of bacterial start codons, indicating usage of Shine-Dalgarno regions (Fuglsang and Engberg 2003). Furthermore, E. coli is reported to display more pronounced selection for translational efficiency than Bacillus subtilis (Shields and Sharp 1987; Sharp et al. 1988). This led to the investigation of nucleotide nonrandomness in data sets for high-expressivity genes versus low-expressivity genes, and it was shown that in E. coli, the nonrandomness is more pronounced in the Shine-Dalgarno region for the high-expressivity genes than for the low-expressivity genes, whereas no difference was observed in B. subtilis (Fuglsang 2003). On this basis, and because the data sets with data for S. typhimurium and Y. pestis displayed markedly lower divergence rates in this region (see Results and Discussion), it was decided to perform the same analysis on S. typhimurium and Y. pestis. For both species, two data sets were constructed, one consisting of the 500 genes having lowest expressivity and the other consisting of the 500 genes having highest expressivity. Note that expressivity is not the same as expression levels; expressivity denotes the codon adaptation index (CAI [Sharp and Li 1987]), which is only a surrogate value for expression levels. Nonrandomness analysis (Fuglsang and Engberg 2003; Fuglsang 2003) was performed on these two fractions individually.
Statistical Analysis
Spearman's rank correlation analysis was performed using GraphPad Prism version 3.0 (GraphPad Inc., Calif.) to test whether divergence rates show an increasing or decreasing trend near gene ends. When linear relationships were expected, linear regression was used. A probability corresponding to less than 5% chance was considered significant.
Results and Discussion
Intragenic Observations
Figure 1a–c shows the divergence frequencies downstream of start codons in the three data sets. Correlation analysis results are given in the figure legends. In all three data sets, we have an approximately equal rate of divergence at the first few codons, where after the synonymous codon divergences increase very significantly in all data sets, whereas the nonsynonymous changes are without a significant trend (E. coli K12 versus E. coli O157:H7 EDL933) or significantly decreasing (E. coli K12 versus S. typhimurium, and E. coli K12 versus Y. pestis). Figure 2a–c shows similar data but for the codons upstream of stop codons. The same principles as observed in figure 1a–c are also visible here: synonymous changes become more frequent as the distance from the stop codon increase, whereas the nonsynonymous changes become less frequent. All in all, these figures reveal that synonymous intragenic evolution is more likely to occur distally to start and stop codons, whereas nonsynonymous evolution is more likely to occur proximally to start and stop codons, and the nonsynonymous changes generally are more rare. It should be emphasized that not all synonymous changes may be selectionally inert. Gouy and Gautier (1982) found that highly expressed genes tend to use a narrow set of codons, and Ikemura and Ozeki (1983) showed that the preferred codons in highly expressed genes corresponded very well to tRNAs that are more abundant than their synonymous counterparts. Since then, it has been shown in a variety of organisms, both eukaryotic and prokaryotic, that the major source of variation in codon usage bias is expression levels, even though there are exceptions to this principle (see examples by McInerney [1998] and Herbeck, Wall, and Wernegreen [2003]). Bulmer (1988) studied a small set of genes and found some indicators that the codon usage bias of the first few codons just downstream of the start codon is subject to a special kind of selection. Later, it was proposed as the "minor codon modulator hypothesis" that the expression level of a gene is in part controlled by the presence of nonoptimal codons, especially if they reside in the beginning of a gene (Chen and Inouye 1990, 1994). Thus, rare codons clustered downstream of the start codon may stall the translational machinery as the availabilities of charged tRNA for these particular codons are low. In this context, the figure 1a–c would suggest that the relatively high rate of nonsynonymous change and the relatively low rate of synonymous change near start codons arises because retaining the expression level becomes (relatively) more important than retaining the primary structure of the protein. However, the absolute magnitudes of divergence still indicate that the primary structure of the protein is quantitatively most important.
FIG. 1. (a) Synonymous (?) and nonsynonymous () divergence frequency versus the codon position downstream of start codons in the data set corresponding to orthologs from E. coli K12 and E. coli O157:H7 EDL933. There is a highly significant increasing tendency for the synonymous divergences: = 0.8745, P < 0.0001. (b) Similar to (a) but for the data set corresponding to orthologs from E. coli K12 and S. typhimurium. There is a highly significant increasing tendency for the synonymous divergences: = 0.9484, P < 0.0001. There is also a highly significant decreasing tendency for the nonsynonymous divergences: Spearman's rank correlation coefficient = –0.5280, P < 0.0001. (c) Similar to (a) but for the data set corresponding to orthologs from E coli K12 and Y. pestis. There is a highly significant increasing tendency for the synonymous divergences: = 0.5540, P < 0.0001
FIG. 2. (a) Synonymous (?) and nonsynonymous () divergence frequency versus the codon position upstream of stop codons in the data set corresponding to orthologs from E. coli K12 and E. coli O157:H7 EDL933. There is a very significant increasing tendency for the synonymous divergences: = 0.5727, P < 0.0001. The nonsynonymous changes are less significant: = –0.3261, P < 0.05. (b) Similar to (a) but for the data set corresponding to orthologs from E coli K12 and S. typhimurium. There is a highly significant increasing tendency for the synonymous divergences: = 0.8439, P < 0.0001. There is also a highly significant decreasing tendency for the nonsynonymous divergences: = –0.6719, P < 0.0001. (c) Similar to figure (a) but for the data set corresponding to orthologs from E coli K12 and Y. pestis. There is a highly significant increasing tendency for the synonymous divergences: = 0.7283, P < 0.0001. There is also a highly significant decreasing tendency for the nonsynonymous divergences: = –0.7453, P < 0.0001
It is known that bacteria such as E. coli are able to process short-leadered and leaderless mRNA. When a ribosomal binding site is lacking, there might, therefore, be an alternative path of recognition events leading to initiation of the translation process. It has been proposed that the penultimate stem of 16S rRNA may be able to form base pairs with a stretch of nucleotides after the start codon, and, therefore, this region could be functioning in much the same fashion as the Shine-Dalgarno region (Sprengart, Fatscher, and Fuchs 1990). This hypothesis is subject to much controversy. Sprengart, Fatscher, and Fuchs (1990) and Sprengart, Fuchs, and Porter (1996) found that the degree of complementarity plays an important role for translational efficiency both in the presence and in the absence of a strong Shine-Dalgarno sequence. Nevertheless, Moll et al. (2001) found with topological models of the 16S rRNA structure that it is unlikely that such base pairing occurs at all. Figure 1a–c could, in principle, also reflect such phenomena. However, the length of the downstream box (whether it exists or not) is purportedly around 15 nucleotides, but on figure 1a–c, there is no sign of the observations being confined to such a well-defined short stretch of nucleotides. In that regard, the data presented here does not support a role for a downstream box in molecular evolution; we see no special boxlike pattern downstream of the start codon.
Extragenic Observations and Expressivity
In figure 3a–c, the extragenic divergences are shown for nucleotides upstream of start codons and downstream of stop codons. There is no clear pattern in the divergence frequencies downstream of stop codons. However, in figure 3b and c (and perhaps figure 3a as well), a remarkable drop in divergence is seen in the region around 10 nucleotides upstream of start codons. This is, in my interpretation, a clear proof of the importance of functional Shine-Dalgarno regions. Previously published results on this region showed that the degree of nonrandomness, centered on N = 9 nucleotides upstream of start codons in E. coli, is very pronounced here (Fuglsang and Engberg 2003) and that the nucleotides, furthermore, are more nonrandom in the highly expressed genes than in the lowly expressed genes (Fuglsang 2003). As figure 4a and b shows, the same is true for S. typhimurium and for Y. pestis. Both clearly make extensive use of Shine-Dalgarno regions, and these seem to be positioned just as in E. coli. Table 2 lists the 3' ends of the 16S rRNA for the species and strains in this study plus some more distantly related examples. This region is extremely well conserved across a wide range of eubacteria and archaebacteria. The natural role of the Shine-Dalgarno region is to facilitate ribosomal binding and translational initiation. The higher the complementarity between a gene's Shine-Dalgarno region and the 3' end of the 16S rRNA, the higher the chance of translational initiation. Therefore, highly expressed genes tend to have a highly complementary Shine-Dalgarno region. Figure 5a shows a plot of synonymous mutations between E. coli K12 and S. typhimurium orthologs versus the E. coli K12 codon adaptation index. Note that the two codon adaptation indexes are well correlated (inset in figure 5a). The figure shows that highly expressed genes tend to undergo synonymous evolution at a slower rate than lowly expressed genes. The same holds true for the nonsynonymous changes (fig. 5b). Data for the other data sets reveal the same tendency (not shown). Thus, the higher the expression of a gene, the slower it evolves, both synonymously and nonsynonymously. A study by Alff-Steinberger (2001) concluded that rare codons tend to be more prone to evolution. Because rare codons give lower expressivities, the findings of this study are in good agreement with Alff-Steinberger's (2001) results and also with Sharp and Li (1987), who concluded that codon bias and substitution rate varies inversely. All in all these findings also accord well with the findings of Eyre-Walker and Bulmer (1993), who concluded that the synonymous substitution rate is lower near start codons. This study expands those findings considerably, and, thus, the following are concluded:
The synonymous substitution rate is lower near start and stop codons.
The nonsynonymous substitution rate is higher near stop codons and, in some cases, also near start codons.
The substitution rate in Shine-Dalgarno regions is reduced, relating to the fact that the 3' end of 16S rRNA is conserved.
Highly expressed genes have lower rates of synonymous as well as nonsynonymous substitution.
FIG. 3. (a) Divergences upstream of start codons () or downstream of stop codons (?) for the data set corresponding to E. coli K12 and E. coli O157:H7 EDL933. The rate of mutation is generally lower upstream of start codons than downstream of stop codons. (b) Same as (a) but for E. coli K12 and S. typhimurium. Note the clear drop in divergences in the Shine-Dalgarno region (around 10 nucleotides upstream of the start codon). (c) Same as (a) but for E. coli K12 and Y. pestis
FIG. 4. (a) Nonrandomness upstream of start codons for S. typhimurium in the 500 genes displaying lowest expressivity () and the 500 genes displaying highest expressivity (?). The peak around N = 9 is more pronounced for the genes of high expressivity, establishing a clear link between expressivity and the Shine-Dalgarno region. (b) Similar to (a) but for Y. pestis
Table 2 The 3' End of Aligned 16S rRNA of the Species in This Study.
FIG. 5. (a) A plot of the synonymous divergences between E. coli K12 and S. typhimurium versus the E. coli K12 codon adaptation index. There is a negative correlation: = –0.4736, P < 0.0001. This shows that in highly expressed genes, there are fewer synonymous divergences. The insert is a plot showing the correlation of the two codon adaptation indexes. The correlation is very good: r2 = 0.8620, P < 0.0001. (b) A plot of the nonsynonymous divergences between E. coli K12 and S. typhimurium versus the E. coli K12 codon adaptation index. There is a negative correlation: = –0.5937, P < 0.0001. Note that generally the percentages are lower than for the synonymous divergences (a)
A Word on the Methodology
Imagine a pair of orthologs in which one of the genes has undergone an insertion (e.g., plus one codon) by a deletion (minus one codon). The result, then, would be a pair of orthologs of equal length, fully complying with the inclusion criteria for this study. The farther apart the insertion event and the deletion event took place, the larger a degree of divergence would be expected, and this might be the reason why figure 5b reveals a few "outlier" examples having extremely high nonsynonymous divergences. Formally, this could easily be tested by aligning the amino acids. One alternative approach to overcome this problem could be to scan genomes and look for genes of equal length and make pairs on the basis of the homology alone (e.g., 90% nucleotide identity or any other defined limit). I have tried this approach but it seems to be unsuitable, at least for identifying pairs of orthologs. This approach will yield pairs that are not orthologs but just homologs; for example, a response regulator involved in sugar metabolism may be paired with a response regulator involved in branched chain amino acid metabolism etc.
The data presented here only represent bacteria in the subdivision of the proteobacteria. Therefore, one should be careful not to conclude that molecular evolution works the same all over regardless of phylogenetic position. It would thus be interesting to repeat the analysis for orthologous gene pairs taken from bacteria located elsewhere on the phylogenetic tree. This has so far been very difficult; for example, generation of similar data sets with B. subtilis (GenBank accession number NC000964) and B. halodurans (GenBank accession number NC002570), firmicutes, only gives 132 orthologous pairs and, thus, do not yield conclusive figures.
Literature Cited
Alff-Steinberger, C. 2001. A comparative study of mutations in Escherichia coli and Salmonella typhimurium shows that codon conservation is strongly correlated with codon usage. J. Theor. Biol. 206:307-311.
Alvarez-Valin, F., K. Jabbari, and G. Bernardi. 1998. Synonymous and nonsynonymous substitutions in mammalian genes: intragenic correlations. J. Mol. Evol. 46:37-44.
Alvarez-Valin, F., K. Jabbari, N. Carels, and G. Bernardi. 1999. Synonymous and nonsynonymous substitutions in genes from Gramineae: intragenic correlations. J. Mol. Evol. 49:330-342.
Bentley, S. D., K. F. Chater, and A. M. Cerdeno-Tarraga, et al. (40 co-authors). 2002. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2),. Nature 417:141-147.
Blattner, F. R., G. Plunkett, and C. A. Bloch, et al. (14 co-authors). 1997. The complete genome sequence of Escherichia coli K-12. Science 277:1453-1474.
Bulmer, M. 1987. Coevolution of codon usage and transfer RNA abundance. Nature 325:728-730.
Bulmer, M. 1988. Codon usage and intragenic position. J. Theor. Biol. 133:67-71.
Chen, G. F., and M. Inouye. 1990. Suppression of the negative effect of minor arginine codons on gene expression: preferential usage of minor codons within the first 25 codons of the Escherichia coli genes. Nucleic Acids Res. 18:1465-1473.
Chen, G. F., and M. Inouye. 1994. Role of the AGA/AGG codons, the rarest codons in global gene expression in Escherichia coli. Genes Dev. 8:2641-2652.
Eyre-Walker, A., and M. Bulmer. 1993. Reduced synonymous substitution rate at the start of enterobacterial genes. Nucleic Acids Res. 21:4599-4603.
Fuglsang, A. 2003. Association of the nucleotide with codon bias, amino acid usage and expressivity: differences between Bacillus subtilis and Escherichia coli. APMIS 111:926-930.
Fuglsang, A., and J. Engberg. 2003. Non-randomness in Shine-Dalgarno regions: links to gene characteristics. Biochem. Biophys. Res. Commun. 302:296-301.
Gouy, M., and C. Gautier. 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10:7055-7074.
Herbeck, J. T., D. P. Wall, and J. J. Wernegreen. 2003. Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endosymbiont Wigglesworthia. Microbiology 149:585-2596.
Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:13-34.
Ikemura, T., and H. Ozeki. 1983. Codon usage and transfer RNA contents: organism-specific codon-choice patterns in reference to the isoacceptor contents. Cold Spring Harb. Symp. Quant. Biol. 47:1087-1097.
Kanaya, S., Y. Yamada, M. Kinouchi, Y. Kudo, and T. Ikemura. 2001. Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J. Mol. Evol. 53:290-298.
McClelland, M., K. E. Sanderson, and J. Spieth, et al. (23 co-authors). . Complete genome sequence of Salmonella enterica serovar typhimurium LT2. Nature 413:852-856.
McInerney, J. O. 1998. Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proc. Natl. Acad. Sci. USA 95:10698-10703.
Moll, I., M. Huber, S. Grill, P. Sairafi, F. Mueller, R. Brimacombe, P. Londei, and U. Blasi. 2001. Evidence against an Interaction between the mRNA downstream box and 16S rRNA in translation initiation. J. Bacteriol. 183:3499-3505.
Osada, Y., R. Saito, and M. Tomita. 1999. Analysis of base-pairing potentials between 16S rRNA and 5' UTR for translation initiation in various prokaryotes. Bioinformatics 15:578-581.
Parkhill, J., B. W. Wren, and N. R. Thomson, et al. (32 co-authors). 2001. Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413:523-527.
Perna, N. T., G. Plunkett, and V. Burl, et al. (25 co-authors). 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409:529-533.
Rocha, E. P., and A. Danchin. 2001. Ongoing evolution of strand composition in bacterial genomes. Mol. Biol. Evol. 18:1789-1799.
Rocha, E. P., A. Viari, and A. Danchin. 1998. Oligonucleotide bias in Bacillus subtilis: general trends and taxonomic comparisons. Nucleic Acids Res. 26:2971-2980.
Sakai, H., C. Imamura, Y. Osada, R. Saito, T. Washio, and M. Tomita. 2001. Correlation between Shine-Dalgarno sequence conservation and codon usage of bacterial genes. J. Mol. Evol. 52:164-170.
Sharp, P. M., E. Cowe, D. G. Higgins, D. C. Shields, K. H. Wolfe, and F. Wright. 1988. Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster, and Homo sapiens; a review of the considerable within-species diversity. Nucleic Acids Res. 17:8207-8211.
Sharp, P.M., and W.-H. Li. 1987. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15:1281-1295.
Shields, D. C., and P. M. Sharp. 1987. Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases. Nucleic Acids Res. 19:8023-8040.
Shine, J., and L. Dalgarno. 1974. The 3'-terminal sequence of Escherichia coli 165 ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. USA 71:1342-1346.
Sprengart, M. L., H. P. Fatscher, and E. Fuchs. 1990. The initiation of translation in E. coli: apparent base pairing between the 16s rRNA and downstream sequences of the mRNA. Nucleic Acids Res. 19:1719-1723.
Sprengart, M. L., E. Fuchs, and A. G. Porter. 1996. The downstream box: an efficient and independent translation initiation signal in Escherichia coli. EMBO J. 15:665-674.
Stenstrom, C. M., E. Holmgren, and L. A. Isaksson. 2001. Cooperative effects by the initiation codon and its flanking regions on translation initiation. Gene 273:259-265.(Anders Fuglsang)