当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2005年 > 第9期 > 正文
编号:11258350
Paleogenomics or the Search for Remnant Duplicated Copies of the Yeast DUP240 Gene Family in Intergenic Areas
     Laboratoire de Dynamique, Evolution et Expression de Génomes de Microorganismes, FRE 2326 ULP/CNRS, Institut de Botanique, Strasbourg, France

    E-mail: souciet@gem.u-strasbg.fr.

    Abstract

    Duplication, resulting in gene redundancy, is well known to be a driving force of evolutionary change. Gene families are therefore useful targets for approaching genome evolution. To address the gene death process, we examined the fate of the 10-member-large S288C DUP240 family in 15 Saccharomyces cerevisiae strains. Using an original three-step method of analysis reported here, both slightly and highly degenerate DUP240 copies, called pseudo–open reading frames (ORFs) and relics, respectively, were detected in strain S288C. It was concluded that two previously annotated ORFs correspond, in fact, to pseudo-ORFs and three additional relics were identified in intergenic areas. Comparative intraspecies analysis of these degenerate DUP240 loci revealed that the two pseudo-ORFs are present in a nondegenerate state in some other strains. This suggests that within a given gene family different loci are the target of the gene erasure process, which is therefore strain dependent. Besides, the variable positions observed indicate that the relic sequence may diverge faster than the flanking regions. All in all, this study shows that short conserved protein motifs provide a useful tool for detecting and accurately mapping degenerate gene remnants. The present results also highlight the strong contribution of comparative genomics for gene relic detection because the possibility of finding short conserved protein motifs in intergenic regions (IRs) largely depends on the choice of the most closely related paralog or ortholog. By mapping new genetic components in previously annotated IRs, our study constitutes a further refinement step in the crucial stage of genome annotation and provides a strategy for retracing ancient chromosomal reshaping events and, hence, for deciphering genome history.

    Key Words: DUP240 family ? fate of duplicated genes ? gene death ? pseudogenes ? relics ? Saccharomyces cerevisiae

    Introduction

    The aim of genomics is to identify all the genetic components of a genome, including the functional as well as the nonfunctional ones. Degenerate nonfunctional gene copies are one possible consequence of gene redundancy (Lynch and Conery 2000; Massingham, Davies, and Lio 2001; Prince and Pickett 2002), but they also provide keys for deciphering genome evolution. Pseudogenes are nonfunctional copies of functional genes (Vanin 1985; Mighell et al. 2000) which have deviated from their original sequence by only few simple disablements such as frameshift, missense, or nonsense mutations (Zhang, Harrison, and Gerstein 2002). Furthermore, they generally still have a detectable open reading frame (ORF), so that these copies can be easily recovered by homology matching (Harrison et al. 2002). During evolution, some of these pseudogenes freed from selective constraints can accumulate many further mutations leading to the loss of any significant ORF. These highly degenerate gene remnants found in intergenic regions (IRs) and named gene relics (Fischer et al. 2001) are obviously more difficult to detect. Lafontaine et al. (2004) have established that the mean number of relics per gene family correlates with the size of that gene family.

    The Saccharomyces cerevisiae S288C DUP240 family is composed of 10 members (showing 50%–98% nucleotide identity), which form either solo ORFs or two tandem repeats (fig. 1; Feuermann et al. 1997). The Dup240 proteins, which are approximately 240 amino acids long, have three conserved and two putative transmembrane domains in common (see fig. 4A), but their function is not yet known (Poirey et al. 2002). Previous comparative analyses (Leh-Louis et al. 2004a, 2004b) have shown that the DUP240 tandem loci are sites of gene birth and death, at which new paralogs emerge and some others disappear. Given this evolutive feature, we first screened the S288C intergenic areas, using an original three-step method of analysis to detect the presence of any DUP240 sequences that might have degenerated due to the accumulation of deleterious mutations; the results showed that the S288C DUP240 family is actually composed of eight ORFs, two pseudo-ORFs, and three additional DUP240 relics. Comparative analysis of these degenerate DUP240 loci in the 15 S. cerevisiae strains then showed that the target of the decay process is strongly strain dependent.

    FIG. 1.— Genetic organization and chromosomal location of the S288C DUP240 members. Dark gray boxes show the 10 previously identified DUP240 ORFs. Hatched boxes show remnant Dup240p motifs. Arrows indicate the orientation of these elements. tRNA genes and solo LTR or complete Ty elements are indicated by light gray circles and boxes, respectively. Dotted boxes correspond to DNA sequences of mitochondrial origin. Boxed areas indicate regions of highly conserved nucleotide sequences between chromosomes I and VII. The coordinates of the DUP240 relics are (1) relic-DUP240-I 188,909–189,980; (2) relic-DUP240-VII 404,497–405,148; and (3) relic-DUP240-XIII 371,524–372,250.

    FIG. 4.— DUP240 ORF and pseudo-ORFs in strain S288C. Diagram of (A) a Dup240 protein sequence, (B) the pseudo-YAR023c peptide sequence, and (C) the pseudo-YAR029w peptide sequence. Black and gray boxes correspond to the conserved and hydrophobic domains characteristic of Dup240p sequences, respectively. White boxes show variable intermediate regions. Remnant Dup240p motifs are indicated by hatched boxes.

    Materials and Methods

    Strains and Media

    The 15 S. cerevisiae strains used in this study come from various environments and have been previously described in Leh-Louis et al. (2004a, 2004b). Only the two laboratory strains, S288C and 1278b, are heterothallic and haploid. The other 13 strains are natural homothallic and diploid isolates (CLIB95, CLIB219, CLIB382, CLIB388, CLIB410, CLIB413, K1, R12, R13, TL213 [CLIB556], TL229 [CLIB630], YIIc12, and YIIc17). Growth and sporulation conditions were the same as those previously described (Leh-Louis et al. 2004a, 2004b).

    Three-Step Method of Analysis for Relic Identification in the S288C Genome Sequence

    The WU-Blast Version 2.0 (December 7, 2002) software (W. Gish [1996–2004] http://blast.wustl.edu) was used to perform TBlastN searches on the S288C genome sequence (Goffeau et al. 1997) using the default parameters (comparison matrix = BLOSUM62, w = 3, gap penalty = 9, gap extension penalty = 2) and the complete peptide sequences of all the known DUP240 ORFs, i.e., the 10 S288C DUP240 members and the 10 new DUP240 paralogs identified by Leh-Louis et al. (2004b). All hits in IRs, whatever the E values obtained, were then considered for the second-step analysis. The complete IR sequence where a match was found was compared pairwise with all the known DUP240 paralogs by dotplot using the DNA StriderTM 1.3 dot matrix (window = 23, stringency = 15 or 13 for relic-DUP240-XIII). This method allows to detect substantial regions of similarity between the two sequences tested, as well as their approximate coordinates, through the visual perception of diagonal lines. When a significant diagonal, even if it is disrupted or shifted, standed out the background noise, the IR nucleotide sequence of interest was translated into all three reading frames. The resulting sequences were manually compared with the multiple alignment of the 10 S288C Dup240p sequences (Poirey et al. 2002) in order to identify the Dup240p motifs.

    Molecular Biology Methods

    Coordinates of the DUP240 elements are given in accordance to the S288C genome sequence available on the Saccharomyces Genome Database (SGD) Web site (http://www.yeastgenome.org/). Yeast genomic DNA preparation and polymerase chain reaction amplification and sequencing have previously been described in Leh-Louis et al. (2004a, 2004b). Relic13 (5'-ACATCTTTGCCTCGGTAGT-3') and relic13R (5'-GATATCACATAGAACAGCGA-3') were used as primers to amplify the relic-DUP240-XIII locus in the 15 strains studied. These sequence data have been submitted to the DNA Data Bank of Japan/European Molecular Biology Laboratory/GenBank databases under accession number AJ849516-532. Relic-DUP240-I and -VII and pseudo-YAR023c have been mapped on the previously submitted sequences having the following accession numbers: AJ585103-108, AJ585190, AJ585524-525, AJ586490-508, AJ586612, and AJ585532-548.

    Variable Nucleotide Positions Calculation

    The relic-DUP240-XIII locus sequences of the 15 strains studied were aligned using the PILEUP program (gap creation penalty = 5, gap extension penalty = 1) available in UWGCG Package Version 10.2 (Madison, Wis.) to identify the variable nucleotide positions (VPs). The VP value corresponds to the sum of all the positions where a nucleotide substitution has occurred in at least 1 of the 15 strains, and this sum is normalized for a sequence length of 100 nt. Its calculation is restricted to the condition that there are no two different substitutions at the same variable position and does not take into account deletion or insertion events. The VP values for relic-DUP240-XIII and for its corresponding 5' and 3' IRs were then compared to those previously determined by Leh-Louis et al. (2004a) for a reference gene set present in the same 15 strains.

    Results

    Gene Relics of the S288C DUP240 Multigene Family

    To detect DUP240 relics in the yeast S288C IRs, we used a three-step method of analysis. Relics are highly degenerated traces of ancient duplicated copies of genes. Therefore, we chose to use TBlastN (that take into account the protein sequence) rather than BlastN in order to identify very weak similarities encoded at the nucleic acid level. Furthermore, to increase the sensitivity of our method of analysis, the TBlastN searches were performed using the full-size sequences of all the known Dup240 proteins. All the matched IRs, whatever the E values obtained, were then compared at the DNA level with the DUP240 paralogs using a dotplot method to discriminate between significant hits and those generated by chance in the TBlastN (0.998 being the greatest E value obtained for a significant hit, i.e., relic-DUP240-XIII). The presence of a diagonal line on the dotplot, even if it is disrupted or shifted, indicates that there are substantial regions of similarity between the two sequences tested. These IR sequences were then translated into the three reading frames, which were considered simultaneously to manually map the conserved Dup240p motifs (C1, C2, and C3) and the hydrophobic domains (H1 and H2). These motifs were then extended, without consideration of the reading frames, using the multiple alignment of the Dup240p sequences (Poirey et al. 2002) for comparison. By this method of analysis, three DUP240 relics were identified: relic-DUP240-I and relic-DUP240-VII, which are located at the 3' end of tandem I and VII loci, respectively, and relic-DUP240-XIII, which is located within a long IR (1,927 bp) of chromosome XIII (fig. 1).

    Despite a truncated C1 domain, relic-DUP240-I still harbors traces of the three conserved and two hydrophobic Dup240p domains (fig. 2A and B) and is 1,072 bp long, as compared with 720 bp, the length of a standard DUP240 paralog. This difference in length is due to two DNA insertions (figs. 1, 2A, and 2B). First, there is a 181-bp-long inserted sequence consisting of two noncontiguous mitochondrial DNA fragments, which correspond to a part of COB intron 5 and to a segment of COX1 exon 8, respectively (fig. 2A and B). Coordinates of these two fragments were previously determined based on the best BlastN alignments (Blanchard and Schmidt 1996; Ricchetti, Fairhead, and Dujon 1999). Our study shows that the remnant C2 domain perfectly borders the inserted segments and therefore made it possible to determine their exact coordinates. Microhomology stretches are found at the junction of each of these sequences (data not shown), suggesting that a double-insertion event involving two mitochondrial DNA has occurred in order to repair a chromosomal double-strand break (DSB) by nonhomologous end-joining recombination (Ricchetti, Fairhead, and Dujon 1999; Yu and Gabriel 1999). Besides, the hypothesis of a single-insertion event of a large mitochondrial DNA sequence followed by a large deletion is unlikely in this particular case because the chromosomal COB and COX1 segments are found on Watson and Crick strands in the mitochondrial genome, respectively. The second inserted sequence is the complete long terminal repeat (LTR) YARW7 (figs. 1, 2A, and 2B) flanked by a 5-bp repeat (GTAAC) that corresponds to the target site duplication typical of Ty integration (Farabaugh and Fink 1980; Gafner and Philippsen 1980).

    FIG. 2.— Identification of relic-DUP240-I in strain S288C. (A) DNA dot matrix between the S288C IR YAR033w-YAT1 and the DUP E ORF sequence of strain CLIB219 (accession number AJ586504). Shifts in diagonal line indicate the occurrence of two insertion events. (B) Relic-DUP240-I nucleotide sequence translated into the three reading frames. Characteristic Dup240p conserved and hydrophobic domains are indicated by black and gray lines, respectively. Black boxes correspond to conserved amino acids based on the Dup240p sequence multiple alignment (Poirey et al. 2002). Gray boxes highlight hydrophobic amino acids. Open boxes show the two DNA sequences that interrupt the C2 domain of relic-DUP240-I: the COB intron 5 and COX1 exon 8 segments of mitochondrial origin are inserted between coordinates 189,212 and 189,392, while the delta7 LTR is inserted between coordinates 189,420 and 189,754.

    Relic-DUP240-VII is the counterpart of relic-DUP240-I (fig. 1; Feuermann et al. 1997) but differs from the latter by the presence of two additional LTRs, resulting in the loss of the 3' end of relic-DUP240-VII (fig. 1). It is unlikely that DSB events occurred at the same position in these two relics and that, moreover, the same mitochondrial DNA stretches were used to repair these breaks. Furthermore, it is also unlikely that two distinct Ty elements have inserted independently at the same position into both of them (7 and 8, see fig. 1). Relic-DUP240-I and -VII must therefore stem from the same ancestor ORF, in which these two successive insertion events occurred.

    Relic-DUP240-XIII constitutes a new locus related to the DUP240 family (fig. 1). This relic is more highly degenerated than the previous ones because DNA dot matrix analysis had to be performed with a lower stringency to be able to distinguish a significant diagonal line (fig. 3). Comparisons carried out at the amino acid level showed that this relic covers the standard sequence of a Dup240 protein and that at least nine frameshift and eight nonsense mutations have affected the nucleotide sequence of this ancient DUP240 ORF. Therefore, unlike relic-DUP240-I and -VII, the degeneration process of relic-DUP240-XIII does not involve large deletion or insertion events but only the accumulation of numerous point mutations or disablements affecting only few nucleotides.

    FIG. 3.— Relic-DUP240-XIII in strain S288C. DNA dot matrix between the S288C IR ERB1-tV(AAC)M1 and the YAR031w ORF sequence of strain CLIB219 (accession number AJ586504). Large black and gray arrows correspond to the conserved and hydrophobic Dup240p domains, respectively. Equivalent domains identified in relic-DUP240-XIII are given in hatched boxes. Black arrows above the diagram corresponding to relic-DUP240-XIII illustrate frameshift mutations.

    Pseudo-ORFs of the S288C DUP240 Multigene Family

    YAR023c and YAR029w, which are 540 and 225 bp in length, respectively, are the two shortest members of the DUP240 family (fig. 1; Feuermann et al. 1997; Poirey et al. 2002). This raises questions as to the status of these ORFs; indeed, they could stem from either the accumulation of mutations responsible for the shortening of their full-size ancestor ORF (pseudogenes) or from a partial gene duplication. If they correspond to pseudogenes, traces of the complete lacking motifs should be recovered upstream and/or downstream from the remaining ORF, a situation not warranted in the latter case.

    DNA dot matrix analysis and peptide comparisons performed on the YAR023c region showed that a single frameshift mutation is responsible for masking the complete C1 domain of this ORF, the next in-frame ATG being located in H1 (fig. 4B).

    Our computational analysis also confirmed that all the Dup240p motifs are still present in the YAR029w region. Comparisons with the DUP B paralog identified in CLIB382, another S. cerevisiae strain (Leh-Louis et al. 2004b), made it possible to accurately determine the nature and size of the events responsible for the shortening of YAR029w (fig. 4C).

    The YAR023c and YAR029w ORFs should therefore be denoted pseudo-ORFs (pseudo-YAR023c and pseudo-YAR029w), as referred to the definition of pseudogenes, which are slightly degenerate copies of functionally characterized genes.

    Comparative Analysis of the DUP240 Relics and Pseudo-ORFs in S. cerevisiae Strains

    Previous comparative analyses of the DUP240 loci in the 15 S. cerevisiae strains of various origins showed that the three solo ORFs are subject to different evolutionary constraints (Leh-Louis et al. 2004a), while the tandem repeats are characterized by great polymorphism involving the birth and death of ORFs (Leh-Louis et al. 2004b). The same set of strains was therefore checked to establish whether or not a relic or pseudo-ORF might be present at identical chromosomal locations to those determined in S288C and whether they show a nucleotide polymorphism. It is indeed not obvious that all strains present the same degenerate state for one particular locus; some of them could contain a less degenerate relic or even a complete DUP240 ORF at these considered regions.

    Our analysis of the available tandem I and VII loci sequences showed that the presence of the relic is generally correlated with that of DUP240 ORFs in the corresponding tandem repeat (fig. 5). Indeed, no traces of relic-DUP240-VII were found in strains devoid of DUP240 ORF in the tandem VII region (fig. 5), suggesting that the ORFs and relic have either always been absent from this location or have been lost. By contrast, although the tandem I locus of strain TL229 is devoid of DUP240 ORF and contains only two Ty LTRs, it however harbors the 3' end of relic-DUP240-I (fig. 5). This tag is strong evidence that DUP240 ORFs were initially present in the TL229 tandem I region but that they were subsequently deleted as the result of an ectopic homologous recombination occurring between LTR sequences.

    FIG. 5.— Comprehensive description of the DUP240 relics and pseudo-ORFs in the 15 Saccharomyces cerevisiae strains. Black triangles show the position of the –1 or +1 nt mutation in pseudo-YAR023c. Plus and minus signs indicate the presence and the absence of the considered element, respectively. nd = not determined. a, haplotypes that stem from the heterozygous diploid strain after sporulation; b, presence or absence of DUP240 ORFs in the tandem I and VII loci (previously determined by Leh-Louis et al. (2004b)); and c, both copies are present in the same DUP240 tandem I array.

    Unlike relic-DUP240-I and -VII, relic-DUP240-XIII is present in all the 15 strains tested and harbors only point mutation variations (fig. 5). These features made it possible to determine the number of VPs for this relic locus (727 bp) and for its corresponding 5' and 3' IRs (976 and 193 bp, respectively). The VP value is calculated from a multiple sequence alignment and corresponds to all the positions where a nucleotide substitution has occurred in at least 1 of the 15 strains. Leh-Louis et al. (2004a) previously calculated the VP values of 10 arbitrarily chosen genes in the same 15 strains in order to estimate the gene nucleotide sequence divergence among the S. cerevisiae species. A VP value greater than these reference VP values indicates that the sequence tested is subject to a more relaxed selection pressure than the one applied on the reference sequences, and vice versa. The VP values of our IRs (3.7 and 4.7) are twice those previously determined, which indicates that, as expected, these IRs are subject to more relaxed selection pressures. The VP value of relic-DUP240-XIII (worked out at 6.2) suggests that this DUP240 relic sequence is diverging faster than the 10 control genes and even faster than its flanking IRs.

    Analysis of the pseudo-ORF loci in the 15 S. cerevisiae strains tested showed that 7 of them have undergone a frameshift mutation (–1 or +1 nt) in an A-tract located at the 5' end of pseudo-YAR023c, whereas the remaining strains harbor a complete YAR023c ORF (fig. 5). Similar results were obtained on the pseudo-YAR029w locus, where a complete DUP240 ORF (DUP B) was identified in strain CLIB382 (fig. 5; Leh-Louis et al. 2004b). Our intraspecies analysis of the degenerate DUP240 loci therefore showed that a nondegenerate DUP240 ORF was indeed detected in some strains at the DUP240 pseudo-ORF loci, whereas this never occurred in the case of the relic loci.

    Discussion

    Previous comparative analyses have shown that the DUP240 multigene family evolves according to the birth and death model. Indeed, the three DUP240 solo ORFs have been shown to be subject to different evolutive constraints (Leh-Louis et al. 2004a), while the DUP240 tandem loci are characterized by the emergence of new paralogs and the disappearance of some others (Leh-Louis et al. 2004b). This loss of some DUP240 paralogs through locus deletion constitutes only a part of the nonfunctionalization process, the other aspect being the loss through accumulation of deleterious mutations leading to pseudogenes and gene relics (Prince and Pickett 2002; Lafontaine et al. 2004). Many studies have already been intended to screen genomes for slightly degenerate copies of genes. Pseudogenes have been identified in numerous organisms ranging from bacteria to human (Cole et al. 2001; Homma et al. 2002; Zhang and Gerstein 2004). More highly degenerate remnants of gene, gene relics, are obviously more difficult to detect and have first been identified in some S. cerevisiae S288C IRs (Fischer et al. 2001). Afterwards, Zhang, Harrison, and Gerstein (2002) surveyed the occurrence of conserved protein motifs in the noncoding IRs of four eukaryotic genomes. They used 1,319 short well-characterized protein motifs from the PROSITE database (http://www.expasy.org/prosite/) (like leucine zipper or zinc finger motifs of 14 amino acids in size, on an average) to scan each of the six translated frame. Using this method, they found 67 ancient protein fragments in fly, 34 in worm, 21 in human, and 6 in the yeast S. cerevisiae. However, gene relics are characterized by the presence of numerous frameshift mutations that scatter the ancestor gene sequence among the three possible reading frames. In order to circumvent this problem, Lafontaine et al. (2004) performed sequence comparisons at the nucleotide level without consideration of the reading frames. Their systematic search for gene relics in the IRs of the S. cerevisiae strain S288C brought to light the existence of 120 relics, 4 of which involved the DUP240 family. The two relics, called #9 and #10, correspond, in fact, to the single remnant ORF relic-DUP240-I, while relics #8 and #61 correspond to the 5' lost sequence of pseudo-YAR029w and relic-DUP240-VII, respectively.

    Our systematic screening procedure is based on a TBlastN analysis using the entire peptide sequence of all the known Dup240 proteins in order to identify very weak similarities encoded at the nucleic acid level. This first step of screening is followed by the search for peptide motifs that are characteristic of the Dup240p, in the three possible reading frames simultaneously. Therefore, we identified a new locus related to the DUP240 family (relic-DUP240-XIII) and additional Dup240p motifs flanking YAR023c and YAR029w, indicating that in S288C these annotated ORFs are, in fact, pseudo-ORFs. This strategy made it possible to exactly determine the coordinates of the Dup240p motifs and those of possible inserted sequences. The DUP240 multigene family can now be said to consist of eight ORFs, two pseudo-ORFs, and three additional relics. By mapping new genetic components in previously annotated IRs, our study therefore constitutes a further refinement step in genome annotation, and the results obtained show that short conserved protein motifs provide a useful tool for detecting and accurately mapping both slightly and highly degenerate copies of genes. Zhang, Harrison, and Gerstein (2002) referred to this kind of analysis as "genomic paleontology" because the aim of such studies is to search for molecular fossils, just as paleontologists dig for animal fossils.

    Gene relics are a direct consequence of gene redundancy and provide useful markers for retracing chromosomal rearrangements and, hence, genome evolution. For example, the S288C tandem I and VII loci, which have five and two directly repeated DUP240 ORFs, respectively, constitute a highly conserved sequence unit between chromosomes I and VII (fig. 1; Feuermann et al. 1997). The tandem VII locus can be assumed to originate from a duplication of the tandem I locus onto chromosome VII, or vice versa, prior to reshaping events that may have been responsible for the current S288C chromosomal organization. The presence of similar DUP240 relics at the 3' end of both tandem I and VII loci, both interrupted by the same noncontiguous mitochondrial DNA stretches, strongly supports this hypothesis.

    The present data also show what a useful tool intra- and interspecies analyses can be for gene relic detection because the possibility of finding short conserved protein motifs depends largely on the choice of the most closely related sequence. For example, the DUP B paralog identified in strain CLIB382 during a previous comparative analysis on the 15 S. cerevisiae strains (Leh-Louis et al. 2004b) allowed us to accurately determine the impairments affecting pseudo-YAR029w, which was not possible with only the S288C DUP240 sequences. Likewise, despite the weak E values obtained with TBlastN, relic-DUP240-XIII was found using all the known DUP240 paralogs (the S288C ones plus those previously identified). Subsequent comparisons with the data available on other Saccharomyces sensu stricto species show that relic-DUP240-XIII is closely related to a Saccharomyces bayanus DUP240 ORF (WashU_Sbay_c588.12 [Cliften et al. 2003], MIT_Sbay_c579_18538 [Kellis et al. 2003]), which probably corresponds to its true ortholog because it is located with the same orientation in the same syntenic group ERB1-FAR3 (fig. 1). The peptide sequence of this S. bayanus ORF also provided a useful means of mapping the Dup240p motifs more exactly at the S288C relic-DUP240-XIII locus.

    Lafontaine et al. (2004) observed a continuum of sequence degeneration in the yeast S288C relic set. The DUP240 gene family perfectly illustrates this process because the following S288C members display more and more advanced stages in the decay process: (1) pseudo-YAR023c, (2) pseudo-YAR029w, (3) relic-DUP240-I and -VII, and (4) relic-DUP240-XIII. However, our intraspecies study shows a more variable situation. In addition to the continuum of sequence degeneration among the different relic loci in the same strain, we showed the existence of such a continuum for the same locus among several strains. Indeed, six other S. cerevisiae strains also present the YAR023c locus at the pseudo-ORF state, but the remaining ones harbor a full-size YAR023c ORF (fig. 5). The S288C pseudo-YAR029w has lost a part of its coding sequence at both 5' and 3' ends. We revealed two less degenerate pseudo-YAR029w, which have lost a part of their coding sequence at either the 5' (strain YIIc17) or the 3' end (strains CLIB219 and CLIB413 E1), and even a full-size ORF (DUP B) in strain CLIB382 (fig. 5). The S288C pseudo-YAR029w therefore seems to evolve towards a relic state and thus raises the question as to how to define the borderline between pseudogenes and gene relics.

    The gene loss, resulting from either locus deletion or gradual erasure, counteracts the ongoing gene duplication process. Studies on multigene families such as the DUP240 family can bring to light the various stages in the decay process. Pseudogenes freed from selection pressure can drift through a series of random mutations (Blake, Hess, and Nicholson-Tuell 1992; Ophir and Graur 1997; Ophir et al. 1999), resulting in gene relics that will continue to accumulate mutations until they are completely erased from the genome sequence. Relic-DUP240-XIII well illustrates this erasure process because its sequence diverges faster than its flanking regions, judging from the variable position calculation. Our comparative intraspecies analysis therefore shows that within a given gene family different loci can undergo gene erasure, a process that is not only species dependent but also strain dependent.

    Acknowledgements

    We are grateful to Ingrid Lafontaine for her advice on the use of DNA dot matrix and to Philippe Hammann and Malek Alioua for automated DNA sequencing in the Strasbourg Institut de Biologie Moléculaire des Plantes/CNRS facilities. This work was supported in part by an European Union Grant Comprehensive Yeast Genome Database (QLRI CT 1999 01333) and by the Génolevures-2 sequencing consortium (GDR CNRS 2354). B.W. is supported by a grant from the French Ministère de l'Education Nationale, de la Recherche et de la Technologie.

    References

    Blake, R. D., S. T. Hess, and J. Nicholson-Tuell. 1992. The influence of nearest neighbors on the rate and pattern of spontaneous point mutations. J. Mol. Evol. 34:189–200.

    Blanchard, J. L., and G. W. Schmidt. 1996. Mitochondrial DNA migration events in yeast and humans: integration by a common end-joining mechanism and alternative perspectives on nucleotide substitution patterns. Mol. Biol. Evol. 13:537–548.

    Cliften, P., P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston, B. A. Cohen, and M. Johnston. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301:71–76.

    Cole, S. T., K. Eiglmeier, J. Parkhill et al. (44 co-authors). 2001. Massive gene decay in the leprosy bacillus. Nature 409:1007–1011.

    Farabaugh, P. J., and G. R. Fink. 1980. Insertion of the eukaryotic transposable element Ty1 creates a 5-base pair duplication. Nature 286:352–356.

    Feuermann, M., J. de Montigny, S. Potier, and J. L. Souciet. 1997. The characterization of two new clusters of duplicated genes suggests a ‘Lego’ organization of the yeast Saccharomyces cerevisiae chromosomes. Yeast 13:861–869.

    Fischer, G., C. Neuveglise, P. Durrens, C. Gaillardin, and B. Dujon. 2001. Evolution of gene order in the genomes of two related yeast species. Genome Res. 11:2009–2019.

    Gafner, J., and P. Philippsen. 1980. The yeast transposon Ty1 generates duplications of target DNA on insertion. Nature 286:414–418.

    Goffeau, A., R. Aert, M. L. Agostini-Carbone, A. Ahmed, M. Aigle, and L. Alberghina. 1997. The yeast genome directory. Nature 387(Suppl.):5–105.

    Harrison, P. M., H. Hegyi, S. Balasubramanian, N. M. Luscombe, P. Bertone, N. Echols, T. Johnson, and M. Gerstein. 2002. Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res. 12:272–280.

    Homma, K., S. Fukuchi, T. Kawabata, M. Ota, and K. Nishikawa. 2002. A systematic investigation identifies a significant number of probable pseudogenes in the Escherichia coli genome. Gene 294:25–33.

    Kellis, M., N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254.

    Lafontaine, I., G. Fischer, E. Talla, and B. Dujon. 2004. Gene relics in the genome of the yeast Saccharomyces cerevisiae. Gene 335:1–17.

    Leh-Louis, V., B. Wirth, L. Despons, S. Wain-Hobson, S. Potier, and J. L. Souciet. 2004a. Differential evolution of the Saccharomyces cerevisiae DUP240 paralogs and implication of recombination in phylogeny. Nucleic Acids Res. 32:2069–2078.

    Leh-Louis, V., B. Wirth, S. Potier, J. L. Souciet, and L. Despons. 2004b. Expansion and contraction of the DUP240 multigene family in Saccharomyces cerevisiae populations. Genetics 167:1611–1619.

    Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155.

    Massingham, T., L. J. Davies, and P. Lio. 2001. Analysing gene function after duplication. Bioessays 23:873–876.

    Mighell, A. J., N. R. Smith, P. A. Robinson, and A. F. Markham. 2000. Vertebrate pseudogenes. FEBS Lett. 468:109–114.

    Ophir, R., and D. Graur. 1997. Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 205:191–202.

    Ophir, R., T. Itoh, D. Graur, and T. Gojobori. 1999. A simple method for estimating the intensity of purifying selection in protein-coding genes. Mol. Biol. Evol. 16:49–53.

    Poirey, R., L. Despons, V. Leh, M. J. Lafuente, S. Potier, J. L. Souciet, and J. C. Jauniaux. 2002. Functional analysis of the Saccharomyces cerevisiae DUP240 multigene family reveals membrane-associated proteins that are not essential for cell viability. Microbiology 148:2111–2123.

    Prince, V. E., and F. B. Pickett. 2002. Splitting pairs: the diverging fates of duplicated genes. Nat. Rev. Genet. 3:827–837.

    Ricchetti, M., C. Fairhead, and B. Dujon. 1999. Mitochondrial DNA repairs double-strand breaks in yeast chromosomes. Nature 402:96–100.

    Vanin, E. F. 1985. Processed pseudogenes: characteristics and evolution. Annu. Rev. Genet. 19:253–272.

    Yu, X., and A. Gabriel. 1999. Patching broken chromosomes with extranuclear cellular DNA. Mol. Cell 4:873–881.

    Zhang, Z., and M. Gerstein. 2004. Large-scale analysis of pseudogenes in the human genome. Curr. Opin. Genet. Dev. 14:328–335.

    Zhang, Z. L., P. M. Harrison, and M. Gerstein. 2002. Digging deep for ancient relics: a survey of protein motifs in the intergenic sequences of four eukaryotic genomes. J. Mol. Biol. 323:811–822.(Bénédicte Wirth, Véroniqu)