当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2005年 > 第8期 > 正文
编号:11258343
Interchromosomal Segmental Duplications Explain the Unusual Structure of PRSS3, the Gene for an Inhibitor-Resistant Trypsinogen
     * Institute for Systems Biology, Seattle; Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle; Departments of Pediatrics/Genetics, University of Miami School of Medicine; Bioinformatics and Computational Biology Program, George Mason University; || ViaLogy, Altadena; and ? PhenoGenomics Corporation, Bellevue

    E-mail: lrowen@systemsbiology.org; btrask@fhcrc.org.

    Abstract

    Homo sapiens possess several trypsinogen or trypsinogen-like genes of which three (PRSS1, PRSS2, and PRSS3) produce functional trypsins in the digestive tract. PRSS1 and PRSS2 are located on chromosome 7q35, while PRSS3 is found on chromosome 9p13. Here, we report a variation of the theme of new gene creation by duplication: the PRSS3 gene was formed by segmental duplications originating from chromosomes 7q35 and 11q24. As a result, PRSS3 transcripts display two variants of exon 1. The PRSS3 transcript whose gene organization most resembles PRSS1 and PRSS2 encodes a functional protein originally named mesotrypsinogen. The other variant is a fusion transcript, called trypsinogen IV. We show that the first exon of trypsinogen IV is derived from the noncoding first exon of LOC120224, a chromosome 11 gene. LOC120224 codes for a widely conserved transmembrane protein of unknown function. Comparative analyses suggest that these interchromosomal duplications occurred after the divergence of Old World monkeys and hominids. PRSS3 transcripts consist of a mixed population of mRNAs, some expressed in the pancreas and encoding an apparently functional trypsinogen and others of unknown function expressed in brain and a variety of other tissues. Analysis of the selection pressures acting on the trypsinogen gene family shows that, while the apparently functional genes are under mild to strong purifying selection overall, a few residues appear under positive selection. These residues could be involved in interactions with inhibitors.

    Key Words: segmental duplication ? fusion gene ? selection ? trypsinogen

    Introduction

    It is now widely appreciated that the human genome has been shaped by mutation, gross rearrangement, and duplications of gene-containing segments during its evolution. Roughly 5% of the human genome appears to have arisen in the last 40 Myr through duplication of segments of 1 kb or longer (Bailey et al. 2001). Genes contained in these duplicate segments often evolve to take on distinct functions (Hurles 2004) either through alteration of their protein structure or their regulatory elements. Here, we report that duplicated segments from human chromosomes 7 and 11 have coalesced on chromosome 9, forming an unusual trypsinogen gene with two distinct promoters, derived from each of the originating chromosomes.

    Trypsinogens are the inactive precursors to trypsins, a class of serine proteases that digest proteins by cleaving at lysine or arginine residues. They are produced in the pancreas and secreted to the duodenum and intestines, where they are activated by enterokinase (enteropeptidase, PRSS7) to form trypsin, which, in turn, activates itself and other digestive enzymes (Kitamoto et al. 1994). Trypsinogen genes in mammals and birds constitute a multigene family whose members are found within the beta T cell receptor (TCRB) locus (Hood, Rowen, and Koop 1995; Wang et al. 1995; Rowen, Koop, and Hood 1996). In the fully characterized TCRB loci of human (Rowen, Koop, and Hood 1996) and mouse (AE000663, AE000664, AE000665), clusters of trypsinogen (T) or trypsinogen-like (TL) genes flank a region spanning hundreds of kilobases containing the TCRB variable gene segments.

    Analysis of the human TCRB locus on 7q35 revealed genes coding for the functional trypsinogen proteins PRSS1 and PRSS2, also known, respectively, as the cationic and anionic trypsinogens (Scheele, Bartelt, and Bieger 1981), but not PRSS3, also called mesotrypsinogen (Rinderknecht et al. 1984; Nyaruhucha, Kito, and Fukuoka 1997). The 3.6-kb five-exon genes coding for PRSS1 and PRSS2 are embedded within the first and last units, respectively, of a tandem array of five 10.6-kb–duplication units located near the 3' end of the TCRB locus (Rowen, Koop, and Hood 1996). The three internal units of the tandem array contain trypsinogen pseudogenes, none of which corresponds to the mesotrypsinogen cDNAs in GenBank. Complicating matters further, cDNAs for an alternative form of mesotrypsinogen called trypsinogen IV were reported (Wiegand et al. 1993). Although exons 2–5 were the same as those found in mesotrypsinogen, the sequence of exon 1 of trypsinogen IV was completely different from exon 1 of any other trypsinogen. As a result, the predicted protein is missing the leader signal required for the secretion of pancreatic enzymes.

    Because a cluster of nonfunctional TCRB V gene segments was previously localized to chromosome 9 (Charmley, Wei, and Concannon 1993; Robinson et al. 1993), we reasoned that the missing mesotrypsinogen-trypsinogen IV gene would be found in association with these V genes. This hypothesis was confirmed by our sequence of this portion of chromosome 9p13. Moreover, we report here that 175 kb of chromosome 11 sequence was also duplicated to chromosome 9 after the divergence of hominids and Old World monkeys, and that trypsinogen IV's first exon has been co-opted from another gene. We show that the two PRSS3 variants are expressed in different tissues and explore the selection pressures acting on the trypsinogen gene family.

    Materials and Methods

    Sequencing

    The sequence of the beta T cell/trypsinogen locus on chromosome 7 (NG_001333; U66059 [GenBank] , U66060 [GenBank] , U66061 [GenBank] ) was described earlier (Rowen, Koop, and Hood 1996). The sources of chromosome 9 sequence containing the orphon V gene segments and PRSS3 are annotated in AF029308. For the region on chromosome 11, a BAC clone RP11-61J24 was identified from a library screen using probes to trypsinogen IV exon 1 and sequenced (AC010583). The human sequences analyzed in this report can be found at chr7:141810473–141960472, chr9:33566742–33816741, and chr11:129032812–129239214 in the 05/04 genome assembly (http://genome.ucsc.edu). A rhesus macaque BAC clone, CHORI250-28G19, containing an orthologous trypsinogen gene cluster was sequenced and submitted to GenBank as AC149201. All sequencing was done using the high-redundancy shotgun method (Rowen, Lasky, and Hood 1999) and finished to >99.99% accuracy.

    Fluorescence In Situ Hybridization

    Clones from regions involved in the chromosome 7-9 duplication (group A) and in the chromosome 11-9 duplication (group B) were used to determine the origin and timing of duplicated sequence in the chromosome 9 locus. Group A comprises the following: human chromosome 7 cosmid B97 (subcloned from a YAC derived from the CGM1 cell line) containing V20-V25; human chromosome 9 cosmid X91 (from ATCC 1475; AF029308) containing orphon V20-V24; human chromosome 7 BAC CTD-2087C12 containing trypsinogen genes; and rhesus macaque BAC CHORI250-28G19 (AC149201 [GenBank] ) containing trypsinogen genes. Group B comprises the following: human chromosome 9 cosmid 3B9 (subcloned from BAC CTA-109D8; AF029308) containing trypsinogen IV exon 1 and human chromosome 11 BAC RP11-61J24 (AC010583) containing LOC120224. Note that not all four clones in group A were analyzed in all species.

    Cosmid or BAC DNA was isolated from bacterial cultures, biotinylated via nick translation, and hybridized in the presence of human Cot1 DNA to metaphase spreads prepared from phytohaemagglutinin-stimulated human lymphocytes and fibroblast or lymphoblast cell lines of various other primates using published procedures (Trask 1999). The cell lines used were CRL1847 or AG16618 for chimpanzee, CRL1854 or AG05251 for gorilla, CRL1850 or GM06213 for orangutan, GM03443 for rhesus macaque, CRL1495 for baboon, and H39 for gibbon. Cell repository lines (CRL) were obtained from ATCC (www.atcc.org), and AG- and GM-lines were obtained from the Coriell Cell Repositories (http://locus.umdnj.edu/ccr/). The sites of hybridization were detected with two layers of fluorescein-conjugated avidin connected with biotinylated goat antiavidin antibody. The chromosomes were counterstained with 4'-6diamidino-2-phenylindole, and images were collected for analysis as described elsewhere (Trask 1999). For each assay, the number and location of hybridization signals were analyzed in at least 10 metaphase spreads and numerous interphase nuclei.

    Sequence Analysis

    Exon 1 containing expressed sequence tags (ESTs) for mesotrypsinogen, trypsinogen IV, and LOC120224 were identified from the 05/04 genome assembly. Library information was derived from the EST accession numbers and the Image Consortium (http://image.llnl.gov/image/html/humlib_info.shtml). To identify duplicated regions, we used the TCRB/trypsinogen locus on chromosome 7 as a starting query sequence for similarity searches. We performed pairwise alignment of similar sequences on chromosomes 7, 9, and 11 using Blast2 (Tatusova and Madden 1999), without repeat masking and with parameters set so that alignments spanned interspersed-repeat integrations, small insertions, and deletions. RepeatMasker (Smit et al. 2004) was used to identify interspersed-repeat sequences either spanning or truncated at the breakpoints of similarity between sequence pairs, allowing the original and rearranged sequences to be identified in some cases. To calculate divergence rates between species, regions of similarity between human and chimpanzee (11/03 assembly) or rhesus macaque (BAC CHORI250-28G19) sequences were first identified using BLAT (Kent 2002) searches or Blast2. The percent identities/divergences of all regions of similarity were calculated excluding insertions-deletions (indels) and applying Jukes-Cantor correction for multiple substitutions (Jukes and Cantor 1969).

    Detecting Natural Selection

    Selection pressures acting on 3 human, 5 rhesus macaque, and 11 mouse putatively functional trypsinogen genes were estimated (human: PRSS1, PRSS2, PRSS3; rhesus macque: try9, try12, try13, try14, try16; mouse: try4, try5, try7, try8, try9, try10, try11, try12, try15, try16, try20). The mouse genes are in sequence with accessions AE000663, AE000664, and AE000665. Genes were classed as putatively functional if they did not contain frame-shifts, premature stop codons, or the R122H mutation known to cause hereditary pancreatitis (Whitcomb et al. 1996). In addition, one apparently functional gene from rhesus macaque (try4) was excluded from selection analyses as the GENECONV software (Sawyer 1989) predicts that this gene has been involved in a gene-conversion event (P < 0.01 for both global permutation and Bonferroni-corrected Karlin-Altshul calculated P values, all sites considered, mismatches allowed [g1]). A multiple sequence alignment of coding regions of the 19 genes was made using ClustalW (Thompson, Higgins, and Gibson 1994) and manually edited to keep codons in frame and remove sites with gaps in any sequence (supplementary data files 1 and 2, Supplementary Material online).

    We first compared the number of nonsynonymous substitutions per nonsynonymous site (dN) to the number of synonymous substitutions per synonymous site (dS) over the length of the genes using the program yn00 (Yang and Nielsen 2000). We then used the ADAPTSITE software version 1.3.2 (Suzuki and Gojobori 1999; Suzuki, Gojobori and Nei 2001) and the CODEML program in the PAML software version 3.14 (Yang 1997; Wong et al. 2004) to determine whether any amino acid sites were likely to be subject to positive selection. For ADAPTSITE, a phylogenetic tree was constructed using the p-distance tree option in the NJBOOT program in the LINTREE software (Takezaki, Rzhetsky, and Nei 1995). PAUP* (Swofford 2003) was used to produce a neighbor-joining tree using the uncorrected p-distance for input into CODEML. ADAPTSITE looks at each amino acid site and estimates the number of synonymous and nonsynonymous sites and changes throughout the tree using ancestral sequences constructed by maximum parsimony. The program then tests whether the proportion of nonsynonymous changes at each amino acid site is significantly different from the neutral expectation (dN/dS = 1). CODEML uses maximum likelihood methods to determine how well models that allow sites with different classes of dN/dS () ratios fit the data. The fit of four models to the data was calculated. Model 1a permits two site classes with 0 < 0 < 1 and 1 = 1. Model 2a adds to Model 1a a third site class with 2 > 1, thus allowing some sites to be under positive selection. Model 7 assumes a beta distribution of , with all values of between 0 and 1. Model 8 adds an extra class with > 1 to Model 7. These models were chosen for analysis because the fit of the data to the more general models (M2a and M8) can be compared to the fit to the more specific models (M1a and M7) using chi-square tests as described in Yang et al. (2000). The program was run with three starting values and, in all cases, converged on the same likelihood values, suggesting that global rather than local maxima were achieved.

    Results and Discussion

    Genomic Organization of the Mesotrypsinogen/Trypsinogen IV Gene

    The interval between NOL6 and UBE2R2, which encompasses PRSS3, is vastly larger in the genomes of human and chimpanzee (>340 kb) than in the genomes of dog (9.8 kb) or mouse (11.6 kb). This difference suggests that the primate genome underwent rearrangement of this region of chromosome 9p13 subsequent to the mammalian radiation. Approximately 190 kb of the additional DNA in human and chimpanzee can be explained by duplicative transfers of sequence to chromosome 9 from the TCRB locus on chromosome 7q35 and from a region on 11q24. The derivative sequence at 9p13 has two regions of similarity with the TCRB locus on chromosome 7 separated by a 66-kb region of similarity with portions of 11q24 (fig. 1). In human, an additional 150 kb of sequence between the NOL6 and UBE2R2 genes are accounted for by multiple intrachromosomal duplications (see the "Segmental Duplication" track of the Human Genome Browser (http://genome.ucsc.edu/).

    FIG. 1.— Paralogous relationships among regions on human chromosomes 9p13, 7q35, and 11q24 resulting from interchromosomal duplications and local rearrangements. Creation of two transcriptional variants of PRSS3; trypsinogen IV is a hybrid of an exon copied from a chromosome 11 gene LOC120224 and mesotrypsinogen exons duplicatively transferred from chromosome 7. Shaded panels indicate the extent of sequence shared by two chromosomes. Segments that are in inverted orientation are shown with arrows below them. The positions of known genes (black blocks) and the trypsinogen-duplication units (white blocks) are shown. Chromosome-specific insertions or deletions 3 kb, for example, of interspersed repetitive elements, are not shown. Each rearrangement breakpoint that lies within an interspersed repetitive element is marked with either O (intact and thus original) or T (truncated and thus derived).

    The more telomeric segment of similarity between 9p13 and 7q35 spans 88 kb on chromosome 9 and contains V gene segments V20 through V26 (fig. 1). The centromere-proximal region of chromosome 7 similarity spans 28 kb and contains V gene segment V29 (fig. 1). A 20-kb region of chromosome 7 containing V gene segments V27 and V28 is not present on chromosome 9. In addition to V29, the more centromeric region of chromosome 7 similarity contains a trypsinogen gene (PRSS3, mesotrypsinogen) embedded in a 10.6-kb trypsinogen-duplication unit and the initial 1.2 kb of an adjacent trypsinogen-duplication unit. Chromosome 7 is clearly the source of this duplicated segment: this trypsinogen-duplication unit continues in the chromosome 7 sequence. The paralogous region on chromosome 7 contains a total of five trypsinogen-duplication units, which include PRSS1, three trypsinogen pseudogenes, and PRSS2, respectively (Rowen, Koop, and Hood 1996). Immediately upstream of the V20-V26–containing segment of chromosome 7 similarity are two additional short regions (1.9 kb and 0.6 kb) that both match in reverse orientation to portions of the trypsinogen-duplication units. These small segments show high similarity to the trypsinogen-duplication units located at the 3' end of the TCRB locus on chromosome 7, but might have been derived from the unit at the 3' end of the duplicated sequence on chromosome 9. Both segments have higher percent identity to the unit on chromosome 9 than to any of the five units on chromosome 7 (average 95% vs. 92%).

    The region of 11-9 paralogy (fig. 1) includes the first exon of a six-exon gene, LOC120224, found on chromosome 11 and represented by cDNA AK098106. LOC120224 is the original source of this first exon as it is contained in ESTs of LOC120224 from rhesus macaque, a species that does not have the 11-9 duplication (see below). No other known transcriptional units were transferred from chromosome 11 to chromosome 9. Curiously, in contrast to the portion transferred from chromosome 7, the portion of chromosome 11 represented on chromosome 9 underwent significant rearrangement, presumably subsequent to the duplicative transfer. A total of 109 kb of sequence found on chromosome 11 is absent from the chromosome 9 copy, with the result that nine discrete blocks with similarity to chromosome 11 are observed (fig. 1). Moreover, several blocks of sequence are rearranged in order and orientation on chromosome 9 relative to chromosome 11. Of the 18 rearrangement breakpoints, 9 lie in interspersed-repeat elements, and in all of these cases, the chromosome 11 copy contains an intact repetitive element spanning the breakpoint, while chromosome 9 contains the derived, truncated form. Thus, in each case where the original state can be deduced from the repeats present at the breakpoints, chromosome 11 represents the original, unaltered state, and the copy on chromosome 9 has been rearranged.

    Timing of the Duplications

    Excluding the trypsinogen-duplication units, the two main blocks of sequences on chromosome 9 derived from chromosome 7 average 94.5% and 94.9% identity to their counterparts on chromosome 7, suggesting that the two regions were copied to chromosome 9 at the same time. Interestingly, the percent identity between the chromosome 9 and 7 sequences drops at the start of the PRSS3 trypsinogen-duplication unit and is only 91.6% over the length of the unit. This point is discussed separately below. The chromosome 9-11 paralogy has an overall percent identity of 95.2%. Duplication of sequence from chromosome 7, followed by the chromosome 11-to-9 duplication is the most parsimonious explanation for the interruption of the regions of similarity with chromosome 7 and the replacement, possibly through a double recombination, of the V27 and V28 gene segment on chromosome 9 by chromosome 11 sequence.

    The approximate average percent divergence of 5% for both the chromosome 7 versus 9 and 11 versus 9 comparisons suggests that both segmental duplications to chromosome 9 occurred sometime after the separation of hominids from Old World monkeys (7% divergence; Yi, Ellsworth, and Li 2002) but before the human-orangutan split (3% divergence; Chen and Li 2001), assuming no homogenizing sequence exchanges occurred between the paralogs. This conclusion is supported by our fluorescence in situ hybridization (FISH) analyses in a variety of primates using probes derived from regions involved in either the chromosome 7-9 or chromosome 11-9 transfers. Probes from both regions show hybridization signals on two chromosome pairs in orangutan, gorilla, chimp, and human, but on a single pair in baboon/macaque and gibbon (not shown). Thus, both duplicative transfers to chromosome 9 occurred after the divergence of hominids from Old World monkeys, but before the orangutan lineage split off from the gorilla/chimp/human branch.

    The 3% greater divergence of the trypsinogen-repeat block on chromosome 9 relative to the version on chromosome 7 compared to the rest of the duplication is intriguing. Its divergence of 8% would imply that the chromosome 7 and 9 copies of the trypsinogen-containing portion of the duplication began diverging 35 MYA (assuming a neutral mutation rate of 1.1 x 10–9 changes per nucleotide per year [Chen and Li 2001]), but FISH results establish that this region duplicated from chromosome 7 to 9 more recently (i.e., 15–20 MYA, after the gibbon lineage branched off from the common ancestor of human and orangutan). We excluded the formal possibility that the trypsinogen-repeat units are diverging more rapidly than neighboring sequence by comparing the divergence between human and chimpanzee within these units to that of neighboring regions. Human and chimpanzee sequences from the first introns in the trypsinogen genes have diverged at a rate (1.3%, averaged for four genes, 1-kb sequence compared) similar to the rate observed for 8.5 kb of nontandemly duplicated noncoding sequence 5' of the PRSS1-duplication unit (1.4%). These rates are similar to previous estimates of human-chimp divergence (e.g., 1.24 ± 0.07% in Chen and Li 2001).

    The most plausible explanation for the anomalous divergence between the chromosome 7 and chromosome 9 paralogs of the trypsinogen units is that a single chromosome 7 segment duplicated to chromosome 9, 15–20 MYA, but the original source of the PRSS3-containing portion is no longer present on chromosome 7, or its relationship with the PRSS3 unit has been obscured by gene-conversion events. Tandemly duplicated blocks are often subject to deletion in unequal crossover events (Strachan and Read 1999). Indeed, loss of the trypsinogen-duplication unit containing the try6 pseudogene is a common polymorphism in humans (Seboun et al. 1989; Rowen, Koop, and Hood 1996), and gene conversion is speculated to occur among trypsinogens in human (Chen and Ferec 2000) and other species (Roach et al. 1997). At least two of the human-duplication units on chromosome 7 and two of the rhesus macaque-duplication units show evidence of gene conversion (P < 0.01) using GENECONV (Sawyer 1989).

    Human versus rhesus macaque comparisons provide additional insight into trypsinogen gene evolution. Human-rhesus divergence is greater for the trypsinogen first intron (10.4% average divergence of best match, 1-kb sequence compared) than for the single-copy sequence just 5' of the PRSS1-duplication unit (5.7%). The divergence of the single-copy sequence is similar to that found in 470 kb of sequence from the major histocompatibility complex class II region (Daza-Vamenta et al. 2004). Furthermore, five of the six human trypsinogen intronic sequences match best to the trypsinogen gene at the 5' end of the rhesus array. These findings suggest that human and rhesus macaque each retained a different subset of the ancestral set of duplication units, or that orthologous relationships are obscured by gene conversion between the units, with possible loss of the donor units.

    Gene Structures of PRSS3 and Chromosome 11-Specific LOC120224

    Analysis of the duplicative transfers from chromosomes 7 and 11 to chromosome 9 reveals the origin of the mesotrypsinogen and trypsinogen IV transcriptional variants of PRSS3. The intron/exon organization of mesotrypsinogen on chromosome 9 is the same as that of PRSS1 and PRSS2 on chromosome 7 (fig. 2). Each of the three genes spans 3.6 kb. The five exons of mesotrypsinogen code for a protein that is 247 amino acids long. Thirteen residues, including the leader peptide required for secretion, are in the first exon of mesotrypsinogen derived from chromosome 7 sequence (see cDNAs X15505 [GenBank] and BC069476 [GenBank] for examples).

    FIG. 2.— The splice variants of PRSS1, mesotrypsinogen, and trypsinogen IV. The PRSS1 gene is on chromosome 7, all other genes are on chromosome 9. The proposed locations of the initiating methionine (M) and the activation peptide sequence (+) are also shown.

    In contrast, the trypsinogen IV variant of PRSS3 spans 48.6 kb and derives its first exon from a sequence that was duplicated from chromosome 11 (fig. 1). Most ESTs and mRNAs for trypsinogen IV include the chromosome 11–derived exon 1 and chromosome 7–derived exons 2–5 of the PRSS3 gene. Allelic variants of this splice form are called a and b (Wiegand et al. 1993) (fig. 2). None contains mesotrypsinogen's first exon. An mRNA for another splice form (AY052783 [GenBank] , named "isoform B") includes an additional exon after its chromosome 11–derived first exon. This additional exon is from the chromosome 7–derived sequence situated before the first exon of mesotrypsinogen. The translation start site for the "a/b" splice form is thought to be in the exon derived from chromosome 11 (Wiegand et al. 1993). Translation of the "B" isoform is predicted to start in its second exon (AY052783 annotation).

    The first exon of all the trypsinogen IV variants corresponds to exon 1 of a predicted gene on chromosome 11 currently named LOC120224 (AK098106; NM_138788). This first untranslated exon of LOC120224 is 95% identical over 80 nt (EST B1256410) to exon 1 of trypsinogen IV. The LOC120224 mRNA codes for a 275-amino acid protein that aligns well to predicted/hypothetical protein sequences in mouse, rat, chicken, frog, and fish, suggesting that although the gene's function is currently unknown, it is likely to be important. LOC120224 has six predicted transmembrane domains, five of which have been annotated as the "DUF 716" domain. A Psi-Blast search shows that this protein is related to proteins involved in antiviral responses.

    Expression of the Mesotrypsinogen and Trypsinogen IV Transcripts

    Mesotrypsinogen, like cationic and anionic trypsinogen mRNAs (PRSS1 and PRSS2) is transcribed predominantly in the pancreas (Wiegand et al. 1993). Pancreatic trypsinogen gene expression is thought to be regulated by the pancreatic transcription factor PTF1, which contains a subunit called p48, whose expression appears restricted to the pancreas (Krapp et al. 1996). Based on ESTs sequenced from an unnormalized HR85 normal pancreatic islet library (Kaestner et al. 2003), mesotrypsinogen is expressed at a markedly lower level than are the cationic and anionic trypsinogens (mesotrypsinogen: 8 ESTs, trypsinogen IV: 0, PRSS1: 1371, PRSS2: 1331, out of a total of 69,008 available ESTs). The low relative level of mesotrypsinogen mRNAs might not be due to transcriptional regulation because the three genes are identical in the region hypothesized to be a pancreatic-specific cis-regulatory element (caggtgtgtttgt) 70–80 bases 5' of the TATA box (Stevenson, Hagenbuchle, and Wellauer 1986; Cockell et al. 1989). However, other transcription factors and cis-control elements could contribute to the expression of pancreatic trypsinogens, and one or more of these elements could be missing in the promoters of various trypsinogen genes. An alternative explanation for differential expression levels is that exon 1 of mesotrypsinogen has a nonconsensus GC splice donor, which might render its splicing less efficient. Mesotrypsinogen and trypsinogen IV show different tissue-distribution patterns of expression (table 1), as would be expected from differences in the cis-regulatory elements 5' of the different first exons for the two transcript variants. Like PRSS1 and PRSS2, mesotrypsinogen expression appears to be restricted to the pancreas. In contrast, trypsinogen IV appears to be expressed at a low level in a variety of tissues and is not restricted to brain as earlier thought (Wiegand et al. 1993) (table 1). Trypsinogen IV's expression pattern may be due to the gene-regulation signals that affect the transcription of LOC120224 from chromosome 11. Electronic Northern data from Unigene indicate that LOC120224 transcripts with and without exon 1 are expressed in a variety of tissues, with an emphasis on colon and stomach, but not in normal pancreas (table 1). The tissue distributions of trypsinogen IV and LOC120224 ESTs containing the first exon show no overlap, but transcripts of both genes are rare.

    Table 1 Expression Patterns of PRSS3a and LOC120224b

    Functional Significance of PRSS3

    The proteolytic functions of the PRSS1 and PRSS2 trypsins in the digestive tract are well documented (Craik and Halfon 1998), but the physiological role of mesotrypsin is a matter of debate. Mesotrypsinogen comprises a minor portion (3%) of the total trypsinogen protein in the pancreas (Szmola, Kukor, and Sahin-Toth 2003). Mesotrypsin is significantly more resistant to trypsin inhibitors (Rinderknecht et al. 1984) than are the cationic and anionic trypsins, presumably due to disruption of the inhibitor-binding site caused by the presence of arginine rather than glycine at position 198 in the mesotrypsin amino acid sequence (Nyaruhucha, Kito, and Fukuoka 1997). One of the rhesus monkey trypsinogens also has this substitution, showing possible convergent evolution. Due to this amino acid change, mesotrypsinogen can degrade soybean inhibitor (SBTI) and human pancreatic secretory inhibitor (SPINK1) (Szmola, Kukor, and Sahin-Toth 2003). The normal physiological role for mesotrypsin is surmised to be to digestively degrade naturally occurring trypsin inhibitors found in food such as soybeans. However, should mesotrypsinogen become inappropriately activated in the pancreas, it could degrade the trypsin inhibitors that protect the pancreas from damage caused by residual levels of trypsin, thereby causing or contributing to pancreatitis (Szmola, Kukor, and Sahin-Toth 2003).

    The function of trypsinogen IV variants, if there is one, is not yet known, but the expression data suggest a possible significance in tissues other than pancreas. All trypsinogen IV variants contain the activation peptide sequence of mesotrypsinogen (fig. 2) and thus have the potential to encode functional trypsins identical to the mesotrypsin produced in the pancreas. The intracellular transport and activation of the trypsinogen IV variants might differ, however, from trypsinogens expressed in pancreatic cells. The leader peptide required for secretion is encoded by the first exon of other trypsinogens, and is therefore missing in trypsinogen IV. Assuming that translation starts in the chromosome 11–derived exon, trypsinogen IVa/b contains four RXXR furin cleavage–recognition sites (Molloy et al. 1992), supporting the possibility of this alternative pathway for secretion (Wiegand et al. 1993; Cottrell et al. 2004).

    Trypsin IV is among the various proteases found to activate protease-activated receptors (PARs) 2 and 4, and the protease inhibitor resistance of trypsin IV has been postulated to promote prolonged PAR-mediated signaling in nonpancreatic cells (Cottrell et al. 2004). Trypsin IV has also been implicated in the increased production of glial fibrillary acidic protein and ?-amyloid in the brain of transgenic mice constructed to express trypsinogen IVa/b in neurons (Minn et al. 1998). Additionally, levels of PRSS1, PRSS2 (see references in Yamamoto et al. 2003), and PRSS3 (Diederichs et al. 2004) are elevated in various tumors, and trypsin might stimulate cancer cell proliferation via PAR activation (Miyata et al. 2000).

    Nothing is known about the expression and possible functions of trypsinogen IV isoform B. If the translation start of this variant is indeed within the second exon as predicted, the protein contains no furin cleavage–recognition sites and might not be extracellularly secreted.

    Accretion of a novel first exon and promoter region in the trypsinogen IV splice form appears to have expanded PRSS3's tissue expression beyond the pancreas, where the trypsin protein could have physiological roles distinct from the pancreaticaly expressed, inhibitor-resistant mesotrypsin. Competing selection pressures might arise between these divergent functions. If both functions are selectively advantageous, subfunctionalization could evolve to produce an inhibitor-resistant trypsin from another trypsinogen duplicate allowing the trypsinogen IV form of PRSS3 to evolve independently.

    Purifying Selection Acting on Trypsinogen Genes

    We evaluated the selective pressures acting on 3 human, 5 rhesus monkey, and 11 mouse apparently functional trypsinogen genes, including mesotrypsinogen. We first looked at the dN/dS ratio calculated over the length of the coding regions of the genes. Strong selection pressures will result in differences in the fixation rates of nonsynonymous changes, compared to synonymous changes (Yang and Nielsen 2000). The average dN/dS ratio was 0.26 ± 0.11. All dN/dS ratios were 0.67 except the comparison of mouse try4 versus try5, for which the dN/dS ratio was 1.03. This overall signature of mild to strong purifying selection suggests three possibilities: (1) that the genes are under purifying selection, because there is an advantage to maintaining more than one functional trypsinogen gene, as suggested by Roach et al. (1997), (2) that some paralogs are now under neutral selection, but they have not yet accumulated many changes because they lost their function, or (3) that subfunctionalization is underway with relaxation of selection pressures on certain parts of some of the genes. We consider below the possibility that some amino acid sites in the duplicates are under positive selection while adapting to new function(s).

    Positive Selection Might be Acting on Some Amino Acid Residues

    Because positive selection on a few sites would not be reflected in the whole-gene dN/dS ratios, we used the maximum parsimony method in the ADAPTSITE software and the CODEML sites estimation method in the PAML software to test for positive selection at specific sites. ADAPTSITE finds no sites under positive selection, but identifies 34 sites likely to be under purifying selection (P < 0.05). CODEML models that allow some sites to be under positive selection (Models 2a and 8) fit the data significantly better than models that do not (Models 1a and 7) (table 2). CODEML identifies amino acid sites 99 and 100 in our gap-free alignment (sites 101 and 102 in conventional trypsinogen numbering) as likely to be under positive selection. Seven different amino acids are found at each of these two sites among the 19 genes studied, showing that the high dN/dS ratio of these sites is due to high rates of nonsynonymous substitutions rather than locally low rates of synonymous changes. The side chains of residues at sites 99 and 100 are solvent-exposed and lie in loops ringing the active site (see Protein Data Base entries 1TRN [PDB] and 1H4W). Therefore, such changes are unlikely to affect interactions within the catalytic site. However, these sites could be involved in interactions with the propeptide, thereby affecting autoactivation rates, or with trypsin inhibitors. Amino acid residue 99 was independently identified as a possible determinant of specific interactions with trypsin inhibitors (Gaboriaud et al. 1996; residue 96 in their chymotrypsinogen numbering system). We speculate that these variants could have differential affinity to trypsin inhibitors and, either through resistance to inhibitors or the ability to degrade them, allow a wider variety of trypsin inhibitor-containing foods to be digested.

    Table 2 Tests for Selection Pressures on Sites in Apparently Functional Human, Rhesus Macaque, and Mouse Trypsinogen Genes

    The discrepancy between the results from the ADAPTSITE and CODEML analyses could be due to either a lack of power in ADAPTSITE to detect positively selected sites in a data set of this size or false positives in the CODEML results. ADAPTSITE fails to identify sites under strong positive selection in simulation studies using small (30 sequence) data sets (Wong et al. 2004) like ours. However, CODEML has been shown to have high false-positive rates in some situations (Suzuki and Nei 2004, Wong et al. 2004). This high false-positive rate has been ameliorated in the version of CODEML used here by the use of the Bayes Empirical Bayes method for the inference of sites under positive selection (Yang, Wong and Nielsen 2005). Nevertheless, the sites identified should be subject to further investigation before firm conclusions about their function in the trypsin protein are drawn.

    Conclusion

    Gene duplication provides a mechanism for expanding the functional repertoire of a family of proteins such as pancreatic trypsinogens. Upon duplication, genes have the possibility of acquiring new functions via changes in coding sequences or regulatory regions. Our analyses of the selection forces acting upon the trypsinogen genes in human, rhesus macaque, and mouse suggest that some members of this gene family might be under positive selection for amino acid substitution at some sites, and that these changes might affect their interactions with trypsin inhibitors. Duplicated genes may also become inactivated in order to maintain the proper gene dosage or because they confer no selective advantage. Indeed, some trypsinogen paralogs in various species have become pseudogenes. In this report, we document an unusual variation on the theme of creating new genes by duplication, namely a duplicative transfer of regions from chromosomes 7 and 11 to form a trypsinogen gene, PRSS3, on chromosome 9 with two different promoters and first exons. PRSS3 encodes two different inhibitor-resistant protein products, trypsinogen IV and mesotrypsinogen. By virtue of its location on a different chromosome than the other human trypsinogens, mesotrypsinogen might be less susceptible to gene-conversion events that could reverse the amino acid substitution that confers this resistance. It remains to be tested whether the co-opted first exon of trypsinogen IV adds novel function to the trypsinogen gene family, by allowing expression in different tissues such as brain and/or providing a different mechanism for protein secretion.

    Supplementary Material

    Supplementary data files 1 and 2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

    Supplementary Data File 1.—Nucleotide alignment of the sequences of the 3 human, 5 rhesus macaque, and 11 mouse putatively functional trypsinogen genes used in the analyses of natural selection.

    Supplementary Data File 2.—Amino acid translation of the nucleotide sequence alignment of the human, rhesus macaque, and mouse trypsinogens used in the analyses of natural selection.

    Acknowledgements

    This work was supported by grants from National Institutes of Health (GM057070 and DC04209 to B.T.; HG01791 to L.H.) and the Department of Energy. We thank Ken Kidd and Michael Seamon for generously providing gibbon cell line H39, Nikki Jerome, Dale Baskin, Carol Loretz, Stephen Lasky, Ann Ramsey, and Sung Mo for assistance with mapping and sequencing, and Janet Young, Jared Roach, and Pat Charmley for helpful discussions.

    References

    Bailey, J. A., A. M. Yavor, H. F. Massa, B. J. Trask, and E. E. Eichler. 2001. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11:1005–1017.

    Charmley, P., S. Wei, and P. Concannon. 1993. Polymorphisms in the Tcrb-V2 gene segments localize the Tcrb orphon genes to human chromosome 9p21. Immunogenetics 38:283–286.

    Chen, F. C., and W. H. Li. 2001. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68:444–456.

    Chen, J. M., and C. Ferec. 2000. Gene conversion-like missense mutations in the human cationic trypsinogen gene and insights into the molecular evolution of the human trypsinogen family. Mol. Genet. Metab. 71:463–469.

    Cockell, M., B. J. Stevenson, M. Strubin, O. Hagenbuchle, and P. K. Wellauer. 1989. Identification of a cell-specific DNA-binding activity that interacts with a transcriptional activator of genes expressed in the acinar pancreas. Mol. Cell. Biol. 9:464–476.

    Cottrell, G. S., S. Amadesi, E. F. Grady, and N. W. Bunnett. 2004. Trypsin IV: a novel agonist of protease-activated receptors 2 and 4. J. Biol. Chem. 279:13532–13539.

    Craik, C. S., and S. Halfon. 1998. Trypsin. Pp. 12–21 in A. J. Barrett, N. D. Rawlings and J. F. Woessner, eds. Handbook of proteolytic enzymes. Academic Press, London.

    Daza-Vamenta, R., G. Glusman, L. Rowen, B. Guthrie, and D. Geraghty. 2004. Genetic divergence of the rhesus macaque major histocompatibility complex. Genome Res. 14:1501–1515.

    Diederichs, S., E. Bulk, B. Steffen et al. (14 co-authors). 2004. S100 family members and trypsinogens are predictors of distant metastasis and survival in early-state non-small cell lung cancer. Cancer Res. 64:5564–5569.

    Gaboriaud, C., L. Serre, O. Guy-Crotte, E. Forest, and J.-C. Fontecilla-Camps. 1996. Crystal structure of human trypsin 1: unexpected phosphorylation of Tryp151. J. Mol. Biol. 259:995–1010.

    Hood, L., L. Rowen, and B. F. Koop. 1995. Human and mouse T-cell receptor loci: genomics, evolution, diversity, and serendipity. Ann. NY Acad. Sci. 758:390–412.

    Hurles, M. 2004. Gene duplication: the genomic trade in spare parts. PLoS Biol. 2:E206.

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–123 in H. N. Munro and J. B. Allison, eds. Mammalian protein metabolism. Academic Press, New York.

    Kaestner, K. H., C. S. Lee, L. M. Scearce et al. (20 co-authors). 2003. Transcriptional program of the endocrine pancreas in mice and humans. Diabetes 52:1604–1610.

    Kent, W. J. 2002. BLAT—the BLAST-like alignment tool. Genome Res. 12:656–664.

    Kitamoto, Y., X. Yuan, Q. Wu, D. W. McCourt, and J. E. Sadler. 1994. Enterokinase, the initiator of intestinal digestion, is a mosaic protease composed of a distinctive assortment of domains. Proc. Natl. Acad. Sci. USA 91:7588–7592.

    Krapp, A., M. Knofler, S. Frutiger, G. J. Hughes, O. Hagenbuchle, and P. K. Wellauer. 1996. The p48 DNA-binding subunit of transcription factor PTF1 is a new exocrine pancreas-specific basic helix-loop-helix protein. EMBO J. 15:4317–4329.

    Minn, A., M. Schubert, W. F. Neiss, and B. Muller-Hill. 1998. Enhanced GFAP expression in astrocytes of transgenic mice expressing the human brain-specific trypsinogen IV. Glia 22:338–47.

    Miyata, S., N. Koshikawa, H. Yasumitsu, and K. Miyazaki. 2000. Trypsin stimulates integrin alpha(5)beta(1)-dependent adhesion to fibronectin and proliferation of human gastric carcinoma cells through activation of proteinase-activated receptor-2. J. Biol. Chem. 275:4592–4598.

    Molloy, S. S., P. A. Bresnahan, S. H. Leppla, K. R. Klimpel, and G. Thomas. 1992. Human furin is a calcium-dependent serine endoprotease that recognizes the sequence Arg-X-X-Arg and efficiently cleaves anthrax toxin protective antigen. J. Biol. Chem. 267:16396–16402.

    Nyaruhucha, C. N. M., M. Kito, S.-I. Fukuoka. 1997. Identification and expression of the cDNA-encoding human mesotrypsin(ogen), an isoform of trypsin with inhibitor resistance. J. Biol. Chem. 272:10573–10578.

    Rinderknecht, H., I. G. Renner, S. B. Abramson, and C. Carmack. 1984. Mesotrypsin: a new inhibitor-resistant protease from a zymogen in human pancreatic tissue and fluid. Gastroenterology 86:681–692.

    Roach, J. C., K. Wang, L. Gan, and L. Hood. 1997. The molecular evolution of the vertebrate trypsinogens. J. Mol. Evol. 45:640–652.

    Robinson, M. A., M. P. Mitchell, S. Wei, C. E. Day, T. M. Zhao, and P. Concannon. 1993. Organization of human T-cell receptor beta-chain genes: clusters of V beta genes are present on chromosomes 7 and 9. Proc. Natl. Acad. Sci. USA 90:2433–2437.

    Rowen, L., B. F. Koop, and L. Hood. 1996. The complete 685-kilobase DNA sequence of the human beta T cell receptor locus. Science 272:1755–1762.

    Rowen, L., S. Lasky, and L. Hood. 1999. Deciphering genomes through automated large-scale sequencing. Methods Microbiol. 28:155–192.

    Sawyer, S. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6:526–538.

    Scheele, G., D. Bartelt, and W. Bieger. 1981. Characterization of human exocrine pancreatic proteins by two-dimensional isoelectric focusing/sodium dodecyl sulfate gel electrophoresis. Gastroenterology 80:461–473.

    Seboun, E., M. A. Robinson, T. J. Kindt, and S. L. Hauser. 1989. Insertion/deletion-related polymorphisms in the human T cell receptor beta gene complex. J. Exp. Med. 170:263–270.

    Smit, A. F. A., R. Hubley, and P. Green. 1996–2004. RepeatMasker Open-3.0. http://www.repeatmasker.org.

    Stevenson, B. J., O. Hagenbuchle, and P. K. Wellauer. 1986. Sequence organisation and transcriptional regulation of the mouse elastase II and trypsin genes. Nucleic Acids Res. 14:8307–8330.

    Strachan, T., and A. P. Read. 1999. Human molecular genetics 2. 2nd edition. John Wiley & Sons. New York.

    Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:1315–1328.

    Suzuki, Y., T. Gojobori, and M. Nei. 2001. ADAPTSITE: detecting natural selection at single amino acid sites. Bioinformatics 17:660–661.

    Suzuki, Y., and M. Nei. 2004. False-positive selection identified by ML-based methods: examples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol. Biol. Evol. 21:914–921.

    Swofford, D. L. 2003. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.

    Szmola, R., Z. Kukor, and M. Sahin-Toth. 2003. Human mesotrypsin is a unique digestive protease specialized for the degradation of trypsin inhibitors. J. Biol. Chem. 278:8580–8589.

    Takezaki, N., A. Rzhetsky, and M. Nei. 1995. Phylogenetic test of the molecular clock and linearized trees. Mol. Biol. Evol. 12:823–833.

    Tatusova, T. A., and T. L. Madden. 1999. BLAST 2 sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174:247–250.

    Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.

    Trask, B. 1999. Fluorescence in situ hybridization. Pp. 303–413 in B. Birren, E. D. Green, P. Hieter, S. Klapholz, R. M. Myers, H. Riethman, and J. Roskams, eds. Genome analysis: a laboratory manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.

    Wang, K., L. Gan, I. Lee, and L. Hood. 1995. Isolation and characterization of the chicken trypsinogen gene family. Biochem. J. 307:471–479.

    Whitcomb, D. C., M. C. Gorry, R. A. Preston et al. (15 co-authors). 1996. Hereditary pancreatitis is caused by a mutation in the cationic trypsinogen gene. Nat. Genet. 14:141–145.

    Wiegand, U., S. Corbach, A. Minn, J. Kang, and B. Muller-Hill. 1993. Cloning of the cDNA encoding human brain trypsinogen and characterization of its product. Gene 136:167–175.

    Wong, W. S. W., Z. Yang, N. Goldman, and R. Nielsen. 2004. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168:1041–1051.

    Yamamoto, H., S. Iku, Y. Adachi et al. (11 co-authors). 2003. Association of trypsin expression with tumour progression and matryilysin expression in human colorectal cancer. J. Pathol. 199:176–184.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555–556.

    Yang, Z., and R. Nielsen. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32–43.

    Yang, Z., R. Nielsen, N. Goldman, and A. K. Pedersen. 2000. Codon-subsitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449.

    Yang, Z., W. S. W. Wong, and R. Nielsen. 2005. Bayes Empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22:1107–1118.

    Yi, S., D. L. Ellsworth, and W. H. Li. 2002. Slow molecular clocks in Old World monkeys, apes, and humans. Mol. Biol. Evol. 19:2191–2198.(Lee Rowen*,1, Eleanor Wil)