当前位置: 首页 > 期刊 > 《核酸研究》 > 2005年第Da期 > 正文
编号:11368715
DED: Database of Evolutionary Distances
http://www.100md.com 《核酸研究医学期刊》
     Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, University Park, PA 16802, USA

    * To whom correspondence should be addressed at Department of Biology, Pennsylvania State University, 514 Mueller Lab, University Park, PA 16802, USA. Tel: +1 814 865 5025; Fax: +1 814 865 9366; Email: wojtek@psu.edu

    ABSTRACT

    A large database of homologous sequence alignments with good estimates of evolutionary distances can be a valuable resource for molecular evolutionary studies and phylogenetic research in particular. We recently created a database containing 159 921 transcripts from human, mouse, rat, zebrafish and fugu species. Approximately 16 000 homology groups were identified with the help of Ensembl homology evidence. At the macro-level, the database allows us to answer queries of the form:

    What is the average k-distance between 5' untranslated regions of human and mouse?

    List the 10 groups with the highest Ka/Ks ratio between mouse and rat.

    List all identical proteins between human and rat.

    Researchers interested in specific proteins can use a simple web interface to retrieve the homology groups of interest, examine all pairwise distances between members of the group and study the conservation of exon–intron gene structures using a graphical interface. The database is available at http://warta.bio.psu.edu/DED/.

    INTRODUCTION

    The previous decade in biology witnessed unprecedented accumulation of molecular sequence data. However, as Sydney Brenner remarked ‘The great challenge in biological research today is how to turn data into knowledge’ (1). Evolution, inspite of being recognized for decades as crucially important for understanding life, was until recently the most speculative area of biology. This situation has been radically changed with the molecular approaches that are now possible, thanks to the availability of large amounts of molecular sequences. However, in order to be useful for evolutionary studies, sequences have to be carefully selected and grouped into homology clusters. This is the most important preparatory step and the most tedious one in any evolutionary analysis. For many analyses, homologous sequences have to be further classified as orthologous (i.e. sequences that shared their last common ancestor during speciation time) or paralogous (i.e. sequences that were created by ancestral gene duplication). This distinction is especially important for molecular phylogeny as it is necessary to work with orthologous genes to infer species phylogeny based on gene phylogeny. Interestingly, despite vast amount of sequence data from different organisms, there have been surprisingly few large scale gene comparison studies between different species or groups of organisms (2–6). Information on expected evolutionary distances or protein/gene identity between different organisms (e.g. human and zebrafish) or taxonomy groups (e.g. mammals and reptiles) is difficult to obtain. To fill this gap, we have created the Database of Evolutionary Distances (DED) which contains sequence information from several vertebrate species clustered into homology groups. It also includes multiple sequence alignments for both protein and nucleotide sequences along with the phylogenetic trees and graphical representation of sequence relationships within a homology group. Large number of links to external databases makes further data exploration ‘as easy as a click of a mouse’.

    Our DED should be useful for gene function assignment, molecular phylogenetic studies, search for lateral gene transfer, reconstruction of identification of biochemical pathways in poorly characterized organisms and sequence evolution patterns. Simple, yet powerful, web interfaces provide a convenient way to access the data. The results are displayed in easy-to-understand tabulated and/or graphical forms.

    SEQUENCE DATA

    The basic objects stored in our database are genes and their associated transcripts. For each gene, we maintain all its known transcript variants and for each transcript we store its sequence, coding region annotation and exon–intron structure. Currently, our database is based on Ensembl release 20 (7) of human, mouse, rat, zebrafish and fugu data (see Table 1). A total of 159 921 vertebrate transcripts stored in the database represent 126 842 unique genes clustered in homology groups (see later).

    Table 1. Number of genes and transcripts stored in the DED (August 2004)

    Based on the information retrieved from Ensembl, the gene and transcript objects in our database were cross-referenced with objects in external databases such as RefSeq, Pfam, GO, etc. As expected, the human genes and transcripts have the most external links associated with them (140 302), while those of zebrafish have the least (34 008). Surprisingly, rat records have relatively few external links (42 896) possibly reflecting the transient status of the rat genome annotation. Obviously, Ensembl is the most frequently linked external database, followed by EMBL database, and LocusLink (for details see Table 2).

    Table 2. Number of external links present in the DED

    HOMOLOGY GROUPS

    Single linkage clustering was used to create homology groups from pairwise homology information obtained through Ensmart (8). Overall, 16 127 groups are formed from 150 158 pairwise homology relations. Although not all species are present in each group, there are 8402 groups that contain transcripts from all five species. There are several one-to-many homology relationships annotated in Ensembl. In such cases, our use of single-linkage clustering results in homology groups that contain multiple genes from the same species. Figure 1 shows the distribution of group sizes. For each homology group, CLUSTAL W (9) is used to compute two multiple sequence alignments—one from the mRNA sequences and one from the amino acid sequences.

    Figure 1. The distribution of group sizes.

    The multiple sequence alignments are then stored in a compressed format within the Mysql database. Compression is achieved by noting that a gapped sequence that belongs to an alignment can be obtained from the ungapped transcript (or protein) sequence already stored in the database if one knows the location of the gaps. Instead of storing a whole alignment, we store only information about location and length of gaps in the alignment. This procedure results in a 100-fold reduction of the required storage.

    DISTANCE COMPUTATION

    In calculating distances, only the transcript with the longest coding region is taken into consideration. mRNA alignments are used for calculation of p and k distances of coding sequences and untranslated regions. We use Kimura's two-parameters model to compute k distances. In case the coding regions do not align perfectly with each other, only the common part of each distinct mRNA region is considered for calculation.

    Protein sequence alignments are used for protein identity calculations and serve as a template for the coding sequence alignment that is used in synonymous (Ks) and non-synonymous (Kn) distance calculations. Currently Ks, Kn are obtained using the Nei–Gojobori method (10) as implemented in the PAML package (11). All other pairwise comparison analyses were carried out using Bioperl 1.4 modules (12).

    USER INTERFACES

    A simple search interface allows users to search the database by keyword or accession number from Ensembl (or other databases linked to Ensembl records such as Swiss-Prot, RefSeq, Gene Ontology, etc.). Genes matching the search criteria and the homology groups that they belong to are displayed. Clicking on the hyperlink for a homology group listed in the search results leads to a page with the full description of the group consisting of seven sections (see Figure 2): (i) description of group members with links to external databases; (ii) pairwise comparison analysis results in a tabular format; (iii) pictorial representation of alignments mapped to exon–intron structures which help visualize conservation of gene structure; (iv) protein alignment; (v) mRNA alignment; (vi) phylogenetic tree; (vii) group structure picture which shows the pairwise homology relationships that resulted in the construction of the group (Figure 3 shows a case where one possibly false homology relationship resulted in the merging of two distinct homology groups).

    Figure 2. Sample homology group details. The member section has been truncated. Note that while the proteins are 100% identical, the alignment picture shows that the gene structure is not—there appears to be an intron gain in the rat lineage.

    Figure 3. Group structure and phylogenetic tree for a homology group. Pairwise comparison analysis suggests that the homology relationship between fugu and zebrafish genes can be ignored and the group split into two smaller groups.

    By default, only the description of group members and pairwise comparison results are shown. User preferences stored in a cookie are used to determine the set of sections to be shown.

    A more elaborate accession search interface can be used for larger scale analyses. It enables calculation of some evolutionary parameters at a global scale (i.e. it summarizes results for a selected group of genes or if there is no limit specified, for all genes present in the database). Extensive filtering options allow a user to restrict analysis to alignments which satisfy certain length and similarity constraints. This helps avoid some statistical biases due to data sampling artefacts or erroneous comparison of paralogous genes. The summary of overall evolutionary statistics, shown in Figure 4, is in agreement with published literature (2,3,13–15).

    Figure 4. Overall evolutionary statistics with mean and standard deviation shown for all distances. Analysis was restricted to pairs with direct homology evidence. UTR comparisons were made only when UTR size was at least 30 bp. Alignments in which start (or stop) codons were separated by more than 20 columns were ignored.

    This interface makes it convenient to verify published results regarding evolutionary rates of groups of proteins. For instance, it was shown in a recent study that sperm-specific proteins evolve at a faster rate than other proteins (16). The paper listed either the RefSeq id or the EMBL accession number for each of the analyzed proteins. By entering the RefSeq ids in one entry box and the EMBL accession numbers in another entry box, one can confirm these results in seconds in the accession search page.

    CONCLUSIONS AND PERSPECTIVES

    Evolutionary analysis is a key step in many biological investigations from classical systematics to comparative genomics and bioinformatics. Very often, researchers are interested in knowing how the results of a comparison of a single gene or set of genes fit a ‘global picture’. However, such global information is hard to obtain or does not exist. To fill this gap, we have created the DED, which contains sequence information from several vertebrate species clustered into homology groups. This database should be useful in a wide range of biological investigations including gene function assignment, molecular phylogenetic studies and sequence evolution patterns.

    Our database depends on other primary databases for sequence, structure and homology information. However, because of the extensive post-processing involved, it is not possible to update our database and keep it synchronized with the primary (source) databases at all times. At present, we plan to update the DED at least twice a year and add new genomes at the time of scheduled updates. In addition, we also plan to add sequence information from organisms whose genomes are not yet fully sequenced.

    REFERENCES

    Brenner,S. ( (2002) ) Ontology recapitulates philology. Scientist, , 16, , 12. .

    Makalowski,W., Zhang,J. and Boguski,M.S. ( (1996) ) Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res., , 6, , 846–857. .

    Makalowski,W. and Boguski,M.S. ( (1998) ) Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl Acad. Sci. USA, , 95, , 9407–9412. .

    Mushegian,A.R., Garey,J.R., Martin,J. and Liu,L.X. ( (1998) ) Large-scale taxonomic profiling of eukaryotic model organisms: a comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes. Genome Res., , 8, , 590–598. .

    Wheelan,S.J., Boguski,M.S., Duret,L. and Makalowski,W. ( (1999) ) Human and nematode orthologs—lessons from the analysis of 1800 human genes and the proteome of Caenorhabditis elegans. Gene, , 238, , 163–170. .

    Glazko,G.V. and Nei,M. ( (2003) ) Estimation of divergence times for major lineages of primate species. Mol. Biol. Evol., , 20, , 424–434. .

    Birney,E., Andrews,T.D., Bevan,P., Caccamo,M., Chen,Y., Clarke,L., Coates,G., Cuff,J., Curwen,V., Cutts,T. et al. ( (2004) ) An overview of Ensembl. Genome Res., , 14, , 925–928. .

    Kasprzyk,A., Keefe,D., Smedley,D., London,D., Spooner,W., Melsopp,C., Hammond,M., Rocca-Serra,P., Cox,T. and Birney,E. ( (2004) ) EnsMart: a generic system for fast and flexible access to biological data. Genome Res., , 14, , 160–169. .

    Thompson,J.D., Higgins,D.G. and Gibson,T.J. ( (1994) ) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., , 22, , 4673–4680. .

    Nei,M. and Gojobori,T. ( (1986) ) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., , 3, , 418–426. .

    Yang,Z. ( (1997) ) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci., , 13, , 555–556. .

    Stajich,J.E., Block,D., Boulez,K., Brenner,S.E., Chervitz,S.A., Dagdigian,C., Fuellen,G., Gilbert,J.G., Korf,I., Lapp,H. et al. ( (2002) ) The Bioperl toolkit: Perl modules for the life sciences. Genome Res., , 12, , 1611–1618. .

    Waterston,R.H., Lindblad-Toh,K., Birney,E., Rogers,J., Abril,J.F., Agarwal,P., Agarwala,R., Ainscough,R., Alexandersson,M., An,P. et al. ( (2002) ) Initial sequencing and comparative analysis of the mouse genomeI. Nature, , 420, , 520–562. .

    Aparicio,S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M., Dehal,P., Christoffels,A., Rash,S., Hoon,S., Smit,A. et al. ( (2002) ) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, , 297, , 1301–1310. .

    Gibbs,R.A., Weinstock,G.M., Metzker,M.L., Muzny,D.M., Sodergren,E.J., Scherer,S., Scott,G., Steffen,D., Worley,K.C., Burch,P.E. et al. ( (2004) ) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature, , 428, , 493–521. .

    Torgerson,D.G., Kulathinal,R.J. and Singh,R.S. ( (2002) ) Mammalian sperm proteins are rapidly evolving: evidence of positive selection in functionally diverse genes. Mol Biol. Evol., , 19, , 1973–1980. .(Vamsi Veeramachaneni and Wojciech Makaow)