当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第11期 > 正文
编号:11371957
A probabilistic model of 3' end formation in Caenorhabditis elegans
http://www.100md.com 《核酸研究医学期刊》
     Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK

    * To whom correspondence should be addressed. Tel: +44 1223 834244; Fax: +44 1223 494919; Email: rd@sanger.ac.uk

    ABSTRACT

    The 3' ends of mRNAs terminate with a poly(A) tail. This post-transcriptional modification is directed by sequence features present in the 3'-untranslated region (3'-UTR). We have undertaken a computational analysis of 3' end formation in Caenorhabditis elegans. By aligning cDNAs that diverge from genomic sequence at the poly(A) tract, we accurately identified a large set of true cleavage sites. When there are many transcripts aligned to a particular locus, local variation of the cleavage site over a span of a few bases is frequently observed. We find that in addition to the well-known AAUAAA motif there are several regions with distinct nucleotide compositional biases. We propose a generalized hidden Markov model that describes sequence features in C.elegans 3'-UTRs. We find that a computer program employing this model accurately predicts experimentally observed 3' ends even when there are multiple AAUAAA motifs and multiple cleavage sites. We have made available a complete set of polyadenylation site predictions for the C.elegans genome, including a subset of 6570 supported by aligned transcripts.

    INTRODUCTION

    The 3' ends of most protein-coding transcripts terminate with a poly(A) tail that is important for nuclear export, stability and efficient translation. The tail is added via a multi-protein complex that recognizes sequence elements in the 3'-untranslated region (3'-UTR), cleaves the nascent transcript, and adds adenylate residues in a template-independent reaction. The biochemical details of the process have been studied most intensively in mammals and yeast . In mammals, the two sequence features that are most important are a highly conserved AAUAAA motif located 10–30 nt upstream of the cleavage site and a GU-rich element located 20–40 nt downstream of the cleavage site. Together, these two elements specify the location of the cleavage site, which is often at a CA dinucleotide. The Cleavage and Polyadenylation Specificity Factor (CPSF) has been shown to bind to the AAUAAA motif and Cleavage Stimulation Factor (CstF) to the GU-rich element.

    In Saccharomyces cerevisiae, the 3'-UTR features are slightly different. The AAUAAA motif is not as highly conserved and there is no downstream GU-rich element. Instead, there is a UA-rich sequence upstream of the AAUAAA motif. The protein that binds the AAUAAA motif is Rna15; the UA-rich sequence is bound by Hrp1 (5–7). The cleavage site is 10–30 nt downstream of the AAUAAA motif and has the sequence Y(A)n. In addition to these features, U-rich sequences immediately flanking the cleavage site also appear to be important (8).

    We are interested in understanding 3' end formation in Caenorhabditis elegans. Previous studies have found the presence of the AAUAAA motif 7–22 nt upstream of the cleavage site but none of the other common elements (9,10). Furthermore, only 50% are AAUAAA; many single base variants are seen, especially AAUGAA. One unusual feature of 3' end formation in C.elegans is that the process is associated with trans-splicing when genes are in operons. In these circumstances, 3' end formation of the upstream gene has been shown to be functionally upstream of SL2 trans-splicing of the downstream gene (11). As in mammals, CPSF binds the AAUAAA motif, but unlike in mammals (12), there is evidence that efficient 3' end formation can take place in the absence of a putative CstF binding site (10). CstF is present, but its role is apparently to increase the local concentration of SL2 at the trans-splice site and not to specify the position of the cleavage site (11).

    Computational detection of the 3' end has been previously attempted by several groups, though this work has mainly been carried out in Homo sapiens. These include a linear discriminant function (13), trained to recognize contextual characteristics of sequences near the AAUAAA motif, such as up- and downstream n-mer frequencies. Another group (14) used a quadratic discriminant function to learn weight matrices for the AAUAAA motif and the GU-rich element. Weight matrices were also assembled from alignments of a large number of sequences containing AAUAAA motifs discovered from the expressed sequence tag (EST) data (15). This group adjusted the width of putative weight matrices up- and downstream of the AAUAAA motif to optimize prediction accuracy though maximum discrimination was found using just the AAUAAA motif and the local downstream region. Instead of using weight matrices, an investigation into 3' end processing in S.cerevisiae (16) used a hidden Markov model (HMM) to describe nucleotide frequencies in well-characterized words in the vicinity of the cleavage site, linked by background frequencies elsewhere.

    The current model for sequence features involved in 3' end formation in C.elegans is focussed entirely on the AAUAAA motif (9). From a predictive standpoint, this means that one typically scans a weight matrix across the sequence and annotates those sites scoring over a particular threshold. This is not a reliable method of prediction. In order to more thoroughly examine sequence features involved in 3' end formation in C.elegans, we construct a dataset derived from experimentally verified cleavage sites. We find that in addition to the AAUAAA motif there also exist characteristic nucleotide distributions at the cleavage site and at various positions within the 3'-UTR. We build a probabilistic model of 3' end formation using a generalized hidden Markov model and demonstrate that this model not only predicts the 3' end more accurately than heuristics based on weight matrices, but also predicts the heterogeneity observed at the cleavage site itself.

    MATERIALS AND METHODS

    22 156 candidate 3'-UTRs up to 1000 bp long were extracted from WormBase release WS110 (http://ws110.wormbase.org). Sequences were truncated if they overlapped downstream genes on the same strand. 216 943 C.elegans transcripts (cDNAs and ESTs) were retrieved from EMBL/GenBank. The transcripts were processed with a Perl script that used the following rules to identify transcripts containing a poly(A) tail.

    Transcripts must be at least 200 nt long.

    Any sequence with 6 or more terminal A's is kept.

    Since the vector may be present at the end of the sequence, sequences with runs of mostly A's near the end were also kept. The Perl regular expression used to define the run of As with a potential sequencing error and up to 30 bp of vector was: /(A{3,1000}.?A{3,1000}) (.{0,30})$/.

    Of the transcripts, 5306 passed these tests and the final 200 nt were searched against the candidate 3'-UTRs with BLASTN version 2.0MP-WashU 23-May-2003 (W. Gish unpublished work; http://blast.wustl.edu) using parameters W = 30 M = 1 N =–3 Q = 3 R = 3. Of the 3'-UTRs, 1810 had matching transcripts. After removing duplicate sequences caused by alternate gene isoforms, we were left with 1468. We created multiple alignments of each unique candidate 3'-UTR and their matching transcripts using a Perl script that employed Bioperl libraries (17). Of these, 1156 had at least one matching transcript that diverged from the genomic sequence in what appeared to be a poly(A) tail. Of these, we were able to map a unique AATAAA motif and cleavage site for 961 sequences (see Results).

    Performance evaluation of weight matrices and HMMs were carried out using a 5-fold cross-validation strategy. four-fifth of the data was used for training and one-fifth for testing. We averaged the results for five non-overlapping test sets. The length parameters were fixed and not estimated for each training set. This is important for the spacer state where the length distribution was calculated from unambiguous sites that represented a minority of the data (see Results). Transition, emission and length parameters were estimated with a variety of Perl scripts and are available as supplementary files. HMM decoding algorithms were written in C and Java and are available at http://www.sanger.ac.uk/Projects/C_elegans/POLYA (the java program uses BioJava libraries found at http://www.biojava.org).

    RESULTS

    Preamble

    Since our study examines genomic DNA, we use T rather than U to describe sequence features. In addition, we distinguish AATAAA, the exact hexanucleotide, from ‘the AATAAA motif’, which is the signal recognized by CPSF and which may differ from AATAAA at one or more nucleotides.

    Detection of cleavage sites

    In order to look more closely at the signals involved in 3' end formation in C.elegans we began by assembling a dataset consisting of experimentally verified cleavage sites. The key to this procedure is mapping the point where a poly(A) tail present in a cDNA diverges from its parent genomic sequence. Approximately half of C.elegans genes have some cDNA evidence, normally in the form of expressed sequence tags. However, in many cases it is not possible to detect the cleavage site for the following reasons:

    Most C.elegans ESTs in GenBank have no initial poly(T) tract corresponding to the poly(A) tail because the initial part of the sequencing read was clipped off before submission.

    A gene may not have any cDNAs that align to its 3'-UTR.

    The point at which the cDNA and genome fail to align may not resemble a poly(A) tail due to cloning or sequencing artefacts.

    The 3'-UTR may be shorter or longer than our length thresholds.

    Cleavage sites are imprecise

    In some cases, alignment of a cDNA identifies an unambiguous cleavage site, as in Figure 1a. However, in many other cases it was difficult to define a unique cleavage site for the following reasons:

    The cleavage site may exist within a run of A's in the genomic sequence (Figure 1b). This occurs in the majority of cleavage sites (77%), and it is not possible to determine where the actual cleavage occurred. We call this class of cleavage site ambiguous.

    There may be multiple AATAAA motifs that signal multiple distinct cleavage sites. These may be located hundreds of nucleotides apart or they may be much closer, even overlapping (Figure 1c).

    Even when the position of the AATAAA is clearly defined and there are no A's in the genomic DNA, there may be multiple distinct cleavage sites clustered together (Figure 1d).

    Figure 1. Properties of cleavage sites and 3'-UTRs AATAAA motifs are boxed in yellow; the cleavage site is seen where the mRNA diverges from genomic sequence (nucleotides not present in the genome are in lowercase). Where the divergence occurs downstream of genomic A's, the cleavage site is ambiguous and is boxed in green. (a) AC3.5 has one aligned mRNA; cleavage is detected between two G residues. (b) In C07A12.4a, the precise cleavage position cannot be assigned accurately, but it must occur in the region shown in green. (c) F17C11.9a has two distinct cleavage sites, as shown by the two different mRNAs diverging at different points. (d) F26E4.6 has seven aligned mRNAs and four distinct cleavage sites over five bases. (e) Length distribution for the distance between the AATAAA motif and the cleavage site for 106 sequences containing a single, unambiguous cleavage site and a single unique AATAAA within 40 nt upstream as in 1a. (f) Bars: histogram of observed lengths of 3'-UTRs from WS110. Line: expectation from a geometric distribution with a mean of 200.

    Of the 1156 poly(A) tail alignments found in this way, 111 had a single, non-overlapping AATAAA motif and an unambiguous cleavage site (Figure 1a). 855 sequences were found to have ambiguous cleavage sites, as in Figure 1b. The remaining sequences were of the form shown in Figure 1c and d. Those sequences with well-defined AATAAA motifs (such as in Figure 1a and b) were retained for use in model building.

    Distance from AATAAA to cleavage site

    To simplify model building and evaluation, we desired to create a set of sequences where the cleavage site and AATAAA motif were clearly defined. The imprecise nature of the majority of cleavage sites and the AATAAA consensus makes this a difficult task. Our approach to this problem was to find the maximum likelihood position for both sites by using a length distribution from the AATAAA motif to the cleavage site and a weight matrix for the AATAAA motif itself. We identified 111 sequences with an unambiguous cleavage site and a unique, exact AATAAA within 40 nt upstream. Five sequences were removed because they were clear outliers containing an inexact match to AATAAA at a more likely position. The length distribution of the remaining 106 sequences is shown in Figure 1e. The distribution is very tight, having a range from 10 to 18 and a mode of 14. When visually inspecting sequences, people often use a heuristic such as 10–30 nt. The observed distribution has an entropy of 2.63 bits which is substantially more specific than a flat 10–30 nt distribution which has an entropy of 4.39 bits.

    Assigning the AATAAA motif and cleavage site

    To assign the most likely inexact AATAAA motif and the most likely ambiguous cleavage site to the rest of our data, we computed the probability of the AATAAA motif and its length from the cleavage site for all possible AATAAA motifs and cleavage sites. The AATAAA motif probability was found from a previously derived weight matrix (9). The length probability came from a smoothed version of Figure 1e that allows a range from 5 to 30 nt. From 1156 sequences we were able to assign a unique maximum likelihood AATAAA motif and cleavage site for 961 sequences, with the remainder being of the forms seen in Figure 1c and d. Of the 961 sequences, there were 940 containing an AATAAA motif seen at least twice (Table 1) and 21 with an AATAAA motif such as AATAAC that was seen only once. We can be less confident about the patterns seen only once, so they were removed from the model building stage.

    Table 1. AATAAA motif observations

    3'-UTR length

    A histogram of 3'-UTR lengths is shown in Figure 1f. The range is highly variable, with 97% falling between 1 and 1000 nt, but most are relatively short. Approximately 70% are <250 nt.

    Nucleotide composition in the vicinity of the cleavage site

    To examine sequence features characteristic of 3' end formation, we aligned genomic sequences anchored at the cleavage site and plotted the nucleotide frequencies 80 bp upstream and 40 bp downstream (Figure 2a). Nucleotide frequencies can be used to partition the area into distinct regions which differ from each other and from the genomic background distribution of 32% each for A and T, and 18% each for C and G. The AATAAA motif is clearly visible 15–20 bp upstream of the cleavage site despite the length heterogeneity between the cleavage site and the motif. The cleavage site itself appears in a T-rich region with a spike of As at the cleavage site. The region between the AATAAA motif and cleavage site is also T-rich. Downstream of the cleavage site the nucleotide composition is T-rich but it gradually changes to typical genome levels by 30–40 bp downstream. The 5'-most sequence upstream of the AATAAA motif is T-rich near the AATAAA motif but farther upstream it has a pyrimidine bias, with C also favoured over G as previously observed (S. Jones, unpublished work).

    Figure 2. Nucleotide composition near the cleavage site. (a) Nucleotide frequencies are shown from –80 to +40 with respect to the cleavage site at 0. Frequencies farther up- and downstream are given in the UTR and Gen columns outside the main graph. Gen corresponds to the average nucleotide composition in C.elegans genomic DNA. (b) State transition diagram for a generalized HMM that describes the sequence composition in the vicinity of the cleavage site. Circular states have geometric distributions, rectangular states have fixed lengths, and the diamond state has a distribution similar to Figure 1e but with a range from 5 to 30 nt.

    A probabilistic model of 3' end formation

    We built a model based on six distinct regions in the vicinity of the 3' end. We define these in the 5' to 3' direction as follows:

    UTR—highly variable in length and tends to be pyrimidine-rich

    AATAAA—6 nt corresponding to the AATAAA motif (Table 2)

    SP—T-rich spacer region of restricted length (Figure 1e)

    CS—4 nt corresponding to the cleavage site (Table 3)

    DS—short, variable length T-rich sequence downstream of the cleavage site

    G—genomic sequence

    Using these six regions we designed an HMM to describe 3' end formation (Figure 2b). HMMs are convenient statistical constructs for modelling biological sequences because they allow one to represent sequence features with characteristic compositions and lengths as a network of interconnected states with emission and transition probabilities (18). In a traditional HMM, the expected length within each state follows a geometric distribution (19). Since the length of the SP state has a highly restricted range and a clearly non-geometric distribution we employed a generalized HMM rather than a traditional HMM. Generalized HMMs are commonly used in gene prediction algorithms (20–24). In these HMMs, each state emits sequences whose lengths are determined by state-specific probability distributions. The UTR, DS and G states have geometric distributions with means 200, 15 and 680 respectively. The AATAAA and CS states have constant distributions of 6 and 4 nt. The SP state has a distribution similar to Figure 1e with minimum and maximum values of 5 and 30 nt. Decoding algorithms for generalized HMMs are slightly more complicated than the standard Viterbi and forward/backward algorithms, but ours still scales linearly with the length of the search sequence because the non-geometric states have restricted ranges.

    Table 2. AATAAA weight matrix

    Table 3. Cleavage site weight matrix

    Prediction of 3' ends

    We evaluated the performance of our HMM by comparing it to heuristic methods based on an AATAAA weight matrix (9). Since cleavage sites are imprecise, we calculated the accuracy based on identifying the correct AATAAA motif and not the cleavage site. Table 4 shows that a crude scan for all exact matches to AATAAA within 1000 nt of the stop codon correctly identifies 56% of signals, though 46% of the total predictions are spurious. If we propose that the 5'-most exact match to AATAAA is the signal, the proportion of signals detected correctly is reduced by 5% but there is an 8% increase in specificity. Using the first maximum score allows for those sequences that contain a mismatch variant of the AATAAA motif; instead of looking for exact matches to AATAAA, we scan with a weight matrix and call the highest scoring hexamer a hit. In the case of multiple identical hits, the 5'-most one is reported, as this would be the first one exposed on the nascent transcript. This has a sensitivity and specificity of 60%. A far greater sensitivity (94%) is achieved if we report all exact matches to AATAAA and all possible single base mismatches, though there is a large penalty to specificity.

    Table 4. Accuracy of various weight matrix and HMM regimes for detecting AATAAA motifs

    Two different strategies were used to evaluate the HMM: Viterbi and posterior decoding. The Viterbi algorithm finds a single maximum likelihood AATAAA motif in the sequence while posterior decoding determines the probability of the AATAAA motif at each point in the sequence. Posterior decoding therefore allows one to find the most likely motif and other, less likely ones. For the posterior we used a probability threshold of 0.1, which means that fewer than 10 AATAAA motifs can be found. The HMM strategies are far more accurate than the weight matrix methods. The Viterbi algorithm recorded 70% sensitivity and specificity. Posterior decoding maintained a similar 68% specificity but significantly increased the sensitivity to 82%. These results indicate that the context in which an AATAAA motif appears is an important factor for 3' end formation. Furthermore, it suggests that in cases where the maximum likelihood AATAAA motif is incorrect, the observed AATAAA motif can be found by looking at other high-scoring positions.

    The stochastic nature of 3' end site selection

    While collecting our dataset of unique AATAAA and cleavage sites we selected against those genes with high cDNA coverage, as genes containing a larger number of matching transcripts tended to have multiple distinct cleavage sites, such as in Figure 1d. Figure 3a shows the distribution of cleavage sites at each nucleotide for a 3'-UTR with 31 cDNA matches. According to our model, the posterior probability of the AATAAA motif indicates that there is only one such motif in the region. The posterior probability of the cleavage site shows a multi-modal distribution. The frequency of observed cleavage sites is very similar to the posterior probability. Figure 3b shows a case where there are multiple AATAAA motifs and cleavage sites. Here too, the posterior probability of the cleavage site is similar to the observed frequencies. The fact that our model fits the observed distribution so well suggests that it is capturing most, if not all, of the local information used to select the cleavage site.

    Figure 3. Posterior probabilities mirror observed cleavage sites. The posterior probability of the AATAAA motif and cleavage site are shown in red and blue lines, respectively. The observed frequency of cleavage sites is indicated by a green line. When the cleavage site is ambiguous, the frequency is averaged over the ambiguous positions, which gives the green line a flat peak. (a) 31 mRNAs aligned to gene ZK652.4 show that there are multiple, tightly clustered cleavage sites. (b) 38 mRNAs aligned to gene R09B3.3 show a broad cluster of cleavage sites which are the result of three predicted AATAAA motifs.

    Genome-wide scan

    We applied the HMM to predict cleavage sites for all the genes in the C.elegans genome. There are 22 168 different annotated genes in WormBase release WS110 (http://ws110.wormbase.org). For 9710 of these, a 3'-UTR is annotated in WormBase by extending from the stop codon to the 3' end of the 3'-most EST match assigned to the gene. For each gene, we used our HMM to search the 1000 bases 3' of each annotated stop codon, and annotated the most likely cleavage site as determined by the Viterbi algorithm (available in the supplementary data). We expect 70% of these to be correct, from our previous experiments (Table 4). Figure 4 shows the frequency distribution of the distance between WormBase 3'-UTRs and the Viterbi prediction for each of their 3'-UTR candidates. Peaks are visible around –65 and –10, presumably corresponding to different EST clipping regimes. Based on the graph, we suggest that those predictions that extend the WormBase 3'-UTR up to 80 nt are highly likely to be correct because the EST was clipped short. Those predictions that are too short by up to 10 nt are consistent with the local heterogeneity of the cleavage site and are also likely to be correct. The proportion of predictions falling within the range –80 to +10 is 70%, as expected. This results in a set of 6570 high confidence identifications of C.elegans cleavage sites, which could be used for further studies and will be available through WormBase.

    Figure 4. Difference between lengths of 3'-UTRs determined by EST alignment and our model Frequency distribution of the difference in length observed when subtracting predicted 3'-UTR length from WormBase 3'-UTR lengths (9356 sequences).

    Testing a scanning model for 3' end recognition

    Under a model in which the cleavage and polyadenylation machinery scans along RNA in a 5' to 3' direction, misidentification of the cleavage site may lead to truncated proteins if the cleavage occurs in the coding region. In the experiments above, we only looked for cleavage sites downstream of the stop codon. In order to test weight matrix approaches and our HMM under conditions of a scanning model, we evaluated the methods on virtual mature mRNAs containing complete coding sequences plus 1000 bp downstream. In these experiments, we modified the HMM by including 3 coding states which correspond to the nucleotide frequencies observed in first, second and third positions within codons. Table 4 shows that the weight matrix methods find a large number of false positives in the coding sequence. However, the specificity of the HMM degrades only slightly, and the performance difference of the posterior decoding is particularly small. If the biological machinery scans along the mRNA ‘looking’ for cleavage sites, it is clearly advantageous to ‘see’ more than just the AATAAA motif.

    DISCUSSION

    In this study, we have made a significant step to improving 3' end prediction in C.elegans by developing an HMM that captures global features present in the 3'-UTR. HMMs have become popular in the sequence analysis community because they offer a method to incorporate diverse sequence features under a rigorous probabilistic framework, and because they have established decoding algorithms. HMMs are stochastic models and this fits well with cleavage site selection, which appears to be a stochastic process. In cases where there are numerous transcripts aligned downstream of a stop codon, we found that the posterior probability of cleavage sites derived from the HMM mirrors the frequencies of experimentally observed cleavage sites. This suggests that the HMM faithfully represents the local requirements of 3' end formation.

    In order to determine why our HMM missed 20% of the real AATAAA motifs, we examined the 3'-UTRs of the incorrect predictions in WormBase using ACEDB (http://www.acedb.org). In 30% of cases, there were additional transcripts without poly(A) tails that supported the predicted 3' end. These 3' ends may therefore fall into the class depicted in Figure 1c with multiple AATAAA motifs. Unfortunately, we do not have access to the raw traces and cannot extend the sequence into the poly(A) tails to find the cleavage site. Thus, we believe it is likely that many of the false positive predictions are real sites. Another class where we suspect the predictions are real are instances where the predicted and observed AATAAA motifs were a few nucleotides apart. This occurs in 5% of the incorrect predictions. Our original maximum likelihood assignment of the AATAAA motif and the cleavage site was based on a weight matrix for the AATAAA motif and a probability distribution for the distance to the cleavage site. Our HMM is a more explicit model of the 3' end and in these cases the HMM may be more accurate than our initial maximum likelihood ‘observation’.

    Approximately 25% of the missed predictions (5% of the whole set) resulted from a variety of oversights in collecting the data. We assumed that unlabelled genomic sequence downstream of a terminal exon contains a 3'-UTR followed by genomic sequence. This is not always the case. Some 3'-UTRs contain introns, which means our HMM and the polyadenylation machinery see different sequences. Some 3' regions contain transcripts that do not appear to correspond to the 3'-UTR of the labelled gene and may instead belong to novel genes. There were also cases where the aligned transcript had a better match elsewhere in the genome. Finally, there were clerical errors that allowed some known, nearby genes into the region assayed for 3' end formation, and in these cases the HMM sometimes predicted the AATAAA motif in introns or UTRs of the downstream gene.

    In the largest fraction of missed AATAAA motifs, 40% of the time, we could not determine the cause of the error. It may be that with greater transcript coverage some of these 3' ends will turn out to have multiple AATAAA motifs. Alternatively, these 3' ends may form a different class, perhaps with specific factors that direct their positioning. We detected no unusual compositional biases though, so the reasons for these incorrect predictions remains a mystery. Taken together, based on the fact that we can explain a number of the incorrect predictions as potentially correct predictions, we believe that our HMM is more accurate than we can reliably report, and may approach 90% sensitivity.

    The HMM contains states for the AATAAA motif, the cleavage site and regions on either side of these features. It does not explicitly model other sequence elements, but it may be taking these into account. For example, the downstream state is T-rich and this roughly corresponds to a CstF binding site. Whether or not CstF binding downstream of the cleavage site is actually required for 3' end formation is not known and may be dependent on the nature of the AATAAA motif (25).

    Correct identification of full-length transcripts is important both for studying the process of 3' end formation and for integrating experimental results, such as northern blots, SAGE tags and microarrays. Another implication for this work is that it may improve the quality of gene prediction. One of the difficulties in gene prediction is identifying the terminal exon. Misidentification can cause single genes to be split or neighbouring genes to be fused. Employing a more descriptive model of 3' ends should help reduce this problem, and we are currently exploring this possibility.

    We have started to use the methodology described in this paper to analyse polyadenylation and cleavage sites in other organisms. Preliminary data shows that different phyla have different features in the vicinity of the cleavage site. For example, the SP composition appears A-rich in insects and contains separate T-rich and A-rich domains in vertebrates (data not shown). For this reason, it would be inappropriate to apply the HMM developed for C.elegans to distantly related genomes, but a similar framework of model, fit with relevant data, may be applicable.

    SUPPLEMENTARY MATERIAL

    ACKNOWLEDGEMENTS

    We thank Yuji Kohara and the NEXTDB project for submission of unpublished nematode EST sequences to the public nucleotide database. Thanks to Lachlan Coin and Kevin Howe for technical assistance and to Tom Blumenthal and Lawrence Hene for comments on the manuscript. All authors are supported by The Wellcome Trust. A.H. holds an MRC Research Studentship.

    REFERENCES

    Colgan,D.F. and Manley,J.L. ( (1997) ) Mechanism and regulation of mRNA polyadenylation. Genes Dev., , 11, , 2755–2766.

    Guo,Z. and Sherman,F. ( (1996) ) 3'-end-forming signals of yeast mRNA. Trends Biochem Sci., , 21, , 477–481.

    Zhao,J., Hyman,L. and Moore,C. ( (1999) ) Formation of mRNA 3' ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis. Microbiol. Mol. Biol. Rev., , 63, , 405–445.

    Proudfoot,N.J. ( (2001) ) Genetic dangers in poly(A) signals. EMBO Rep., , 2, , 891–892.

    Gross,S. and Moore,C.L. ( (2001) ) Rna15 interaction with the A-rich yeast polyadenylation signal is an essential step in mRNA 3'-end formation. Mol. Cell Biol., , 21, , 8045–8055.

    Chen,S. and Hyman,L.E. ( (1998) ) A specific RNA–protein interaction at yeast polyadenylation efficiency elements. Nucleic Acids Res., , 26, , 4965–4974.

    Kessler,M.M., Henry,M.F., Shen,E., Zhao,J., Gross,S., Silver,P.A. and Moore,C.L. ( (1997) ) Hrp1, a sequence-specific RNA-binding protein that shuttles between the nucleus and the cytoplasm, is required for mRNA 3'-end formation in yeast. Genes Dev., , 11, , 2545–2556.

    Dichtl,B. and Keller,W. ( (2001) ) Recognition of polyadenylation sites in yeast pre-mRNAs by cleavage and polyadenylation factor. EMBO J., , 20, , 3197–3209.

    Blumenthal,T. and Steward,K. ( (1997) ) RNA processing and gene structure. In Riddle,D.L., Blumenthal,T., Meyer,B.J. and Priess,J.R. (eds), Caenorhabditis elegans II. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, pp. 117–145.

    Huang,T., Kuersten,S., Deshpande,A.M., Spieth,J., MacMorris,M. and Blumenthal,T. ( (2001) ) Intercistronic region required for polycistronic pre-mRNA processing in Caenorhabditis elegans. Mol. Cell Biol., , 21, , 1111–1120.

    Evans,D., Perez,I., MacMorris,M., Leake,D., Wilusz,C.J. and Blumenthal,T. ( (2001) ) A complex containing CstF-64 and the SL2 snRNP connects mRNA 3' end formation and trans-splicing in C.elegans operons. Genes Dev., , 15, , 2562–2571

    Chen,F. and Wilusz,J. ( (1998) ) Auxiliary downstream elements are required for efficient polyadenylation of mammalian pre-mRNAs. Nucleic Acids Res., , 26, , 2891–2898.

    Salamov,A.A. and Solovyev,V.V. ( (1997) ) Recognition of 3'-processing sites of human mRNA precursors. Comput. Appl. Biosci., , 13, , 23–28.

    Tabaska,J.E. and Zhang,M.Q. ( (1999) ) Detection of polyadenylation signals in human DNA sequences. Gene, , 231, , 77–86.

    Legendre,M. and Gautheret,D. ( (2003) ) Sequence determinants in human polyadenylation site selection. BMC Genomics, , 4, , 7.

    Graber,J.H., McAllister,G.D. and Smith,T.F. ( (2002) ) Probabilistic prediction of Saccharomyces cerevisiae mRNA 3'-processing sites Nucleic Acids Res., , 30, , 1851–1858.

    Stajich,J.E., Block,D., Boulez,K., Brenner,S.E., Chervitz,S.A., Dagdigian,C., Fuellen,G., Gilbert,J.G., Korf,I., Lapp,H. et al. ( (2000) ) The Bioperl toolkit: Perl modules for the life sciences. Genome Res., , 12, , 1611–1618.

    Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. ( (1998) ) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.

    Rabiner,L.R. ( (1989) ) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, , 77, , 257–286.

    Kulp,D., Haussler,D., Reese,M.G. and Eeckman,F.H. ( (1996) ) A generalized hidden Markov model for the recognition of human genes in DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol., , 4, , 134–142.

    Burge,C. and Karlin,S. ( (1997) ) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., , 268, , 78–94.

    Stanke,M. and Waack,S. ( (2003) ) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, , 2, (suppl.), II215–II225.

    Majoros,W.H., Pertea,M., Antonescu,C. and Salzberg,S.L. ( (2003) ) GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Res., , 31, , 3601–3604.

    Korf,I. ( (2004) ) Gene finding in novel Genomes. BMC Bioinformatics, , 5, , 59.

    MacDonald,C.C. and Redondo,J.L. ( (2002) ) Reexamining the polyadenylation signal: were we wrong about AAUAAA? Mol. Cell. Endocrinol., , 190, , 1–8.(Ashwin Hajarnavis, Ian Korf and Richard )