当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第18期 > 正文
编号:11369914
‘Conserved hypothetical’ proteins: prioritization of targets for exper
http://www.100md.com 《核酸研究医学期刊》
     National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

    * To whom correspondence should be addressed. Tel: +1 301 435 5913; Fax: +1 301 435 7794; Email: koonin@ncbi.nlm.nih.gov

    ABSTRACT

    Comparative genomics shows that a substantial fraction of the genes in sequenced genomes encodes ‘conserved hypothetical’ proteins, i.e. those that are found in organisms from several phylogenetic lineages but have not been functionally characterized. Here, we briefly discuss recent progress in functional characterization of prokaryotic ‘conserved hypothetical’ proteins and the possible criteria for prioritizing targets for experimental study. Based on these criteria, the chief one being wide phyletic spread, we offer two ‘top 10’ lists of highly attractive targets. The first list consists of proteins for which biochemical activity could be predicted with reasonable confidence but the biological function was predicted only in general terms, if at all (‘known unknowns’). The second list includes proteins for which there is no prediction of biochemical activity, even if, for some, general biological clues exist (‘unknown unknowns’). The experimental characterization of these and other ‘conserved hypothetical’ proteins is expected to reveal new, crucial aspects of microbial biology and could also lead to better functional prediction for medically relevant human homologs.

    INTRODUCTION

    Over the last decade, more than 150 complete genomes of diverse bacteria, archaea and eukaryotes have been sequenced, and many more are currently in the pipeline (1,2). The availability of complete genome sequences has led to radical changes in life sciences, from molecular biology to entomology and plant biology. A popular phrase now refers to the pre-genome era as the ‘dark ages’ (3). However, it is well known that, in any newly sequenced bacterial genome, as many as 30–40% of the genes do not have an assigned function (4). This figure is even higher for archaeal and eukaryotic genomes and for the relatively large genomes of bacteria with a complex life style, such as Anabaena, Streptomyces or Pirellula (5–8). Remarkably, species- or genus-specific genes comprise a relatively small fraction of the uncharacterized genes; the majority of such ‘hypothetical’ genes have a wider phyletic distribution and therefore are usually referred to as ‘conserved hypothetical’ (9,10).

    Although species-specific ‘ORFans’ should not be neglected (11–14), ‘conserved hypothetical’ proteins pose a challenge not just to functional genomics, but also to biology in general (9). As long as there are hundreds of conserved proteins of unknown function even in model organisms, such as Escherichia coli, Bacillus subtilis or yeast Saccharomyces cerevisiae, any discussion of a ‘complete’ understanding of these organisms as biological systems will remain in the realm of wishful thinking. Indeed, elementary logic suggests that, before attempting to disentangle the riveting complexity of interactions between the parts of biological machines and to develop theoretical and experimental models of these machines, it is necessary to gain the basic understanding of the role of each part. Although it appears likely that the central pathways of information processing and metabolism are already known, crucial elements of these systems could still be lurking among the ‘conserved hypotheticals’, and important mechanisms of signaling and stress response, in all likelihood, remain to be uncovered. The recent discoveries of several nearly universal but not characterized previously tRNA modification enzymes (15–17), of the deoxyxylulose pathway (18,19), and of the central role of cyclic diguanylate in bacterial signaling (20,21) emphasize how much remains to be learnt (for additional examples see Table 1). In addition, it is important to note that, for numerous experimentally characterized enzymes, there is still no available sequence information (22); the ‘conserved hypothetical’ genes are the pool where biologists can fish for these ‘homeless’ activities. Furthermore, there is an important ‘side benefit’: it is often easier to conduct experiments with bacterial than with eukaryotic proteins, and functional characterization of microbial ‘conserved hypothetical’ proteins may facilitate prediction and subsequent experimental study of their human homologs .

    Table 1. Some recently characterized ‘conserved hypothetical’ proteinsa

    In the epilog to our 2002 book on comparative and evolutionary genomics, we noted that systematic identification of functions of genes, that are conserved in many genomes but remain uncharacterized experimentally, is one of the most tantalizing opportunities offered by the recent progress in genome sequencing and analysis (1) (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=sef). More recently, Richard Roberts (10) issued a call for community action in identifying functions of ‘conserved hypothetical’ proteins. He stressed the need for ‘a consortium of bioinformaticians to produce a list of all of the conserved hypothetical proteins that are found in multiple genomes, to carry out the best possible bioinformatics analysis, and then to offer those proteins to the biochemical community as potential targets for research’ (10). Here, we briefly discuss some recent experimental studies on functional characterization of ‘conserved hypothetical’ proteins and outline possible criteria for selecting the priority targets. Based on these criteria, we offer a ‘top 10’ list of proteins for which a general biochemical but not biological function could be confidently predicted and another ‘top 10’ list of proteins that lack any functional assignment at this time. We hope that this effort stimulates discussion on the prioritization criteria and will help attract the attention of biologists worldwide to the challenge of ‘conserved hypothetical’ proteins.

    Recent progress

    The families of ‘conserved hypothetical’ proteins can be easily identified by similarity-based clustering or using more complex approaches (24,25). Representative collections of such protein families are available in several public databases. The PROSITE database (26) contains an Uncharacterized Protein Families (UPF) list, http://www.expasy.org/cgi-bin/lists?upflist.txt, which currently includes 295 families of proteins from the SWISS-PROT database (27); some come with a family-specific sequence pattern (family signature). The latest release of the Pfam database (28) includes 1418 Domains of Unknown Function (DUFs), http://www.sanger.ac.uk/Software/Pfam/browse/DUF.shtml; each entry comes with a multiple sequence alignment, HMM sequence logo, species distribution and a phylogenetic tree of its members. The latest release of the Clusters of Orthologous Groups of proteins (COG) database (29) includes 2009 uncharacterized COGs (see Supplementary Material). The abundance of UPFs makes their study a formidable task. There is a clear need for rational criteria that would allow sorting these protein families and selecting the most important ones, i.e. prioritizing the targets for experimental studies; two obvious criteria are the number of proteins in the family and its phyletic spread.

    Since the advent of comparative genomics, wide (better yet, universal) phylogenetic distribution and indispensability for cell growth have been taken into consideration by some researchers when choosing uncharacterized genes for experimental study (30).

    Significant positive correlation between the phyletic spread of a gene and the likelihood that it is essential for cell growth has been demonstrated (31). On a number of occasions, experiments with proteins that met one or both of these criteria led to major discoveries. For example, a structural genomics project yielded the three-dimensional (3D) structure and led to subsequent functional characterization of the Methanococcus jannaschii protein MJ0226 (32), a member of the widely distributed HAM1 protein family (Table 1), whose only known function until then had been modulation of sensitivity to 6-N-hydroxylaminopurine mutagenesis in yeast (33). Characterization of this protein as a XTP- and ITP-specific pyrophosphatase immediately explained its role in mutagenesis control. Moreover, this protein perfectly fits the description of the ITP pyrophosphatase (ITPase) from human erythrocytes (EC 3.6.1.19 ) that had been first reported in 1964 (34), purified and extensively characterized five years later (35), but had never been identified with a gene. Based on the sequence of the MJ0226 protein, its human homolog has been characterized and shown to account for the ITPase activity in humans (36). Furthermore, although mutations in the ITPase gene did not seem to have a clear disease phenotype, ITPase deficiency has been associated with adverse reactions to purine analog azathioprine, which is used as immunosuppressant in the treatment of cancer and inflammatory bowel disease (37). This example shows how a supposedly arcane study of an archaeal ‘conserved hypothetical’ protein can have immediate consequences for understanding human physiology and might be relevant for human health.

    The recent identification of NAD kinase (EC 2.7.1.23 ) follows a similar pattern. Again, the enzyme has been experimentally characterized many years ago, both in avian tissues and in yeast (38,39), but the cognate gene remained unknown. Again, the characterization of the bacterial enzyme allowed assigning this function to a family of previously uncharacterized ‘conserved hypothetical’ proteins (40). Finally, studies on the bacterial enzyme paved the way for the identification of the orthologous enzyme in humans and in yeast (41,42).

    Of course, it would be wrong to assume that functional characterization of a ‘conserved hypothetical’ protein would always turn up a previously described enzymatic activity. In many cases, the underlying biology and/or biochemistry could be unknown or at least not properly appreciated. Thus, the now-famous case of the identification of the product of E.coli hemK gene as a glutamine N5-methyltransferase of peptide release factors (43,44) pointed out the importance of this post-translational modification that had been previously largely overlooked (10,45). The orthologs of HemK in humans and other eukaryotes are still annotated (without experimental support) as DNA methyltransferases; glutamine methylation in eukaryotic proteins remains to be investigated.

    Similarly, the recent recognition of the roles of the suf genes in the assembly of the iron–sulfur clusters in bacteria (46,47) has important implications for understanding the functioning of chloroplasts, where these processes seem to be basically the same (48). As in the case of HemK, the original annotation of SufC as ‘ABC-type transporter ATPase’ turned out to be less than precise: although SufC is certainly an ATPase of the ABC-type ATPase family, it does not seem to participate in transport (49).

    Several recent discoveries made in the course of characterization of conserved hypothetical genes are listed in Table 1.

    Criteria for target prioritization

    Phyletic distribution

    Different researchers will have widely different priorities when it comes to selecting targets for experimental analysis among the conserved hypothetical genes. In general, however, it is hard to deny that ubiquitous or nearly ubiquitous genes are particularly attractive. The number of genes that are conserved in all cellular life forms (as far as can be judged from the current collection of sequenced genomes) is very small, only 60 (50), and, to our knowledge, only two of these remain experimentally uncharacterized. It seems logical that these genes should be on top of the functional genomics' ‘deck of cards’ (Table 2). Almost all known ubiquitous genes encode proteins involved in translation and transcription (50). This and additional considerations suggest that the remaining unknowns in this elite group of genes have important roles in the same or closely associated processes (see below).

    Table 2. A ‘top 10’ list of ‘known unknowns’a

    A considerably greater number of conserved hypothetical genes are either ‘nearly ubiquitous’, i.e. are missing only in some small genomes, primarily those of parasites, or are universal within one or more of the major divisions of life, e.g. in bacteria or archaea. These genes are also likely to perform crucial functions in the organisms that have them. Analysis of phyletic patterns of genes, however, provides research incentives beyond the straightforward ‘the more the merrier’. Genes with relatively patchy distribution that are, however, found in diverse bacteria, archaea and eukaryotes could be as interesting targets for experimental characterization as the (nearly) ubiquitous ones. Owing to the common phenomenon of non-orthologous gene displacement (51), when the same function in different organisms is performed by two or more unrelated proteins, a substantial fraction of such genes, while not ubiquitous themselves, still might be responsible for universal functions. Moreover, genes that show phyletic patterns (partially) complementary to those of known essential genes can be predicted to perform the same function (52). The recent discoveries of an alternative lysyl-tRNA synthetase, thymidylate synthase, shikimate kinase and several enzymes of thiamine biosynthesis are excellent proofs of this principle (53–56). Holes in known metabolic pathways, when a subset of sequenced genomes do not seem to encode familiar enzymes for some of the reactions comprising a pathway whereas the majority of the enzymes are present, remain numerous (1,57–59). Similar holes in signal transduction pathways and other functional systems are harder to identify but there is little doubt that many exist. The conserved hypothetical proteins are the pool from which these missing molecular parts of biological systems can and (we believe) should be ‘fished out’, partly with the aid of systematic computational prediction and partly by trial and error.

    Conserved hypothetical genes that are represented more narrowly could be excellent targets for those researchers who are interested in the functional determinants of particular phenotypes or individual taxa. For example, the set of genes that appear to be specifically associated with the hyperthermophilic phenotype includes a sizable fraction of conserved hypotheticals (60). Similarly, conserved hypotheticals that are shared by cyanobacteria and plants, but rarely found in other bacteria, could play a role in photosynthesis and would be useful models for photosynthetic studies (61). Parochial considerations, which suggest a different view of phyletic patterns of genes, cannot be overlooked either. Microbial genes that have orthologs in eukaryotes, especially in humans, certainly have added attraction as targets for experimental study; it is often easier to characterize microbial genes both computationally (especially by genomic context, given the operon organization of prokaryotic genes) and experimentally.

    Essentiality

    Owing to the recent genome-wide analyses of knockout phenotypes in several organisms, direct data supporting biological importance (in many cases, even essentiality) is available for numerous widespread genes (62–68). Often, members of the same protein family are recognized as essential for survival in more than one organism (Tables 1 and 2), confirming that these proteins, including uncharacterized or poorly characterized ones, play crucial roles in the cell. However, essentiality for growth is a complex phenomenon (67,69). Studies of auxotrophic mutants show that many genes that are essential for survival in a minimal (poor with nutrients) medium become dispensable for growth in a rich medium. Similarly, mechanisms which are essential for maintaining cell integrity in a hypo-osmotic medium become dispensable for survival in an iso-osmotic medium (70). Furthermore, in the general case, it is a function that is essential for the survival of an organism rather than a particular gene. Partial redundancy is common even in organisms with relatively small genomes, such that two paralogous or analogous (unrelated) genes can each provide an essential function and, accordingly, are synthetically lethal, even as the knockout of each of them is not (71). Finally, examples of the ITPase and rhomboid-like serine protease (Table 1) clearly show that certain genes might have important functions even if their knockout is non-lethal. However, absence of easily observable phenotype is a major obstacle to deciphering the functions of many conserved hypothetical genes (Table 3).

    Table 3. A ‘top 10’ list of ‘unknown unknowns’a

    Protein structure

    The availability of a 3D structure (X-ray or NMR) can be used as an additional criterion for prioritization. Structural genomics projects keep churning out new structures at an ever-increasing pace (72,73). As a result, many families of conserved hypothetical proteins already have one or more representatives of a known 3D structure (Tables 2 and 3). Although knowledge of the 3D structure rarely allows unequivocal functional prediction, it often provides valuable clues that substantially narrow down the range of possible functions (74). For example, the demonstration that members of the vast superfamily of proteins homologous to the universal stress protein A of E.coli have a distinct nucleotide-binding fold and, indeed, bind nucleotides (75,76) suggests a specific range of functions for the numerous uncharacterized members of this superfamily (77) and is likely to channel experimental studies.

    Expression and binding information

    Another criterion for prioritization is the availability of experimental data on the expression of a given gene or binding properties of its protein product. The proliferation of whole-genome expression and protein–protein interaction studies often provides evidence that a particular ‘conserved hypothetical’ protein is overexpressed under certain conditions (nutritional or oxidative stress, ultraviolet irradiation, and so on) (65,78–81). Although this rarely furnishes direct indications of the protein's function, its likely involvement in the given stress response could make it a priority target for researchers in the respective field. Similarly, involvement of an uncharacterized protein in protein–protein interaction with a particular signaling or regulatory protein or its ability to bind RNA or DNA would attract attention of researchers studying these regulators or responses.

    Other criteria

    Obviously, prediction of biochemical function is an important guide, e.g. for projects aimed at functional characterization of enzymes with a particular activity (such as protein kinases or phosphatases) encoded in a given genome. Last but not least, practical considerations, whether a protein is likely to be soluble, highly expressed, could be easily purified etc., by no means can be dismissed. Only experiment will be the final judge but indications of a protein's behavior can often be gleaned from its size and properties predicted by sequence analysis.

    ‘Known unknowns’ and ‘unknown unknowns’

    As the list of the ‘conserved hypothetical’ proteins keeps growing, comparative genomics can help in identifying (based on the criteria outlined above and, undoubtedly, additional ones) the most interesting proteins in each genome and, in many cases, constructing reasonable and testable hypotheses on their functions. As noted previously (9), when an open reading frame is annotated as a ‘conserved hypothetical’ protein, this does not necessarily mean that the function of its product is completely unknown, let alone that its very existence is questionable. Indeed, certain exceptions notwithstanding, if a gene is conserved in several sufficiently different genomes, it is not really hypothetical anymore. A general prediction of its function often can be made based on a conserved protein sequence motif, subtle sequence similarity to previously characterized proteins or the presence of diagnostic structural features. Many ‘conserved hypothetical’ proteins can be confidently predicted to be ATPases, GTPases, methyltransferases, metalloproteases, DNA- or RNA-binding proteins or membrane transporters (1,29,82–84). Additional hints regarding protein functions come from genome context analysis (known and predicted operons and more complex gene neighborhoods), similar or complementary phyletic patterns, domain fusions and protein–protein interaction data (52,85–92). Many databases and tools for gene context analysis are available on the web (93–98). Homology-based methods typically result in the prediction of structural properties and biochemical activity, often only generally defined (e.g. a phosphatase or kinase of unknown specificity). In contrast, context-based methods usually suggest a general biological function (e.g. participation in cell division) but explain little or nothing about the mechanistic role of the given protein in this process. To better define these two groups of proteins, which differ in terms of the nature of the attainable functional prediction, we refer to those that have at least a general prediction of a biochemical function as ‘known unknowns’. In contrast, those proteins that have no assignment of biochemical function are referred to as ‘unknown unknowns’, even when they have been associated with a certain biological process through genome context analysis or some indirect experimental evidence.

    In the latest release of the COG database , among 4873 COGs, 668 should be considered ‘known unknowns’ (biochemical activity but not biological function predicted) and 1341 have the status of ‘unknown unknowns’ (see Supplementary Material). Some of the more attractive targets for detailed experimental studies in these two categories are listed in Tables 2 and 3. A degree of subjective judgment is inevitable here. We chose not to apply a single criterion but to hand-pick interesting uncharacterized proteins on the basis of a combination of the criteria discussed above; the lists of ‘known unknowns’ and ‘unknown unknowns’ from the COG databases, ranked in the order of decreasing number of species in which they are represented, are available for formal target selection in the Supplementary Material. The criteria employed for the selection of targets listed in Tables 2 and 3 were as follows: (i) wide phyletic distribution, i.e. presence in several distinct phylogenetic lineages (with one exception, these protein families are encoded in representatives of all three domains of life, bacteria, archaea and eukaryotes); (ii) presence in model or otherwise interesting organisms (in particular, we chose to include in Tables 2 and 3 only those genes that have orthologs in humans); (iii) apparent essentiality in at least some organisms; involvement in a fundamental biological function suggested by homology-based and/or context analysis; and (iv) small number of paralogs (which translates into greater confidence in any functional prediction). To illustrate the complexities involved in predicting functions of these proteins, we discuss the top four proteins in each group in greater detail.

    ‘Known unknowns’

    The list of genes in Table 2, which is more informative thanks to the prediction of biochemical activities, includes proteins with predicted diverse fundamental functions. It is no accident that this list is dominated by proteins implicated in translation because this is the function of the majority of ubiquitous or nearly ubiquitous genes (50). Thus, members of the O-sialoglycoprotease family (Table 2) are homologous to a single experimentally characterized protein, a neutral metalloprotease from Pasteurella haemolytica that has been reported to cleave O-glycosylated but not N-glycosylated or non-glycosylated proteins (100). Sequence analysis of this protein family revealed a conserved fold and a characteristic ATP-binding site, which are typical of proteins of the actin/HSP70 superfamily, suggesting that these proteins are metal- and ATP-dependent proteases (101). Genome context data show an association (divergent operon) with the rpsU gene encoding ribosomal protein S21 in many enterobacteria and fusion to a Ser/Thr protein kinase domain in some archaea. All this evidence notwithstanding, it is hard to predict the exact function of this protein; gene neighborhood analysis and presence in all completely sequenced genomes suggest association with translation, e.g. co-translational degradation of misfolded proteins (102).

    YchF proteins, members of the second universally represented family of uncharacterized proteins (DUF933 in Pfam), contain a typical GTP-binding N-terminal domain and are confidently predicted to have GTPase activity (Table 2) (103). Structures of two members of this family, Haemophilus influenzae protein HI0393 and Schizosaccharomyces pombe protein Spac27e2.03c, have been solved (PDB entries 1jal and 1ni3 , respectively) and revealed a three-domain organization, with a putative nucleic acid-binding domain and a flexible hinge, in addition to the GTPase domain (104). The combination of predicted GTPase activity, universal phylogenetic distribution, coexpression with peptidyl-tRNA hydrolase and the experimentally demonstrated ability to bind double-stranded nucleic acids (104) strongly suggests that YchF proteins function as GTP-dependent translation factors. Owing to the preference given to genes with a wide phyletic spread, it is not surprising that several proteins in our list of ‘known unknowns’ are implicated in translation and associated functions. Notably, the list includes two more predicted GTPases, which are expected to function as translation factors (Table 2) (103).

    The nearly ubiquitous YrdC protein is the ortholog of yeast SUA5 gene product that has been identified as a suppressor of a translational defect in cytochrome c production (105). A subsequent genetic study showed that mutation of the E.coli yrdC gene caused a defect in 16S rRNA maturation (106). These experiments linked YrdC with translation and ribosomal biogenesis but would place this protein into the ‘unknown unknowns’ category. However, once the structure of the protein had been determined, revealing a novel fold, and its nucleic acid-binding properties, with preference for double-stranded RNA have been demonstrated, YrdC became a ‘known unknown’, an RNA-binding protein associated with ribosomal maturation and function (107).

    Two recently characterized families of widespread, essential proteins turned out to be enzymes involved in cofactor biosynthesis, namely, NAD kinase and dephospho-CoA kinase (Table 1). One could expect that at least some of the remaining uncharacterized proteins also will be metabolic enzymes. In particular, the widespread YbeM family belongs to the nitrilase superfamily (108), which includes experimentally characterized amidases, hydrolyzing the C–N bond in various substrates. Some of the members of the YbeM family are fused to the NadE domain, which accounts for the glutamine-dependent NAD synthetase activity in humans and yeast, as opposed to the NH3-dependent NAD synthetase activity of the NadE domain alone (109). Apparently, some YbeM-like nitrilases can provide the glutaminase component of NAD synthetase in trans (when they are not fused in a single polypeptide chain) by forming a complex with NadE subunits (K. Shatalin and A. Osterman, personal communication). This opens the intriguing possibility that all or at least the majority of the proteins currently annotated as NH3-dependent NAD synthetases function as glutamine-dependent enzymes in vivo and only become NH3-dependent when stripped of their YbeM-like subunits.

    ‘Unknown unknowns’

    Families of ‘conserved hypothetical’ proteins listed in Table 3 are remarkable in that none of them has a recognizable experimentally characterized homolog. Many of them do not seem to be essential for cell growth (Table 3). Hence, phyletic spread had to be the principal criterion for ranking these targets; additional indications of their importance come from the still sparse genome-context data.

    Although the function of YebC proteins (UPF0082 and DUF28), the most widespread of the ‘unknown unknowns’, remains unknown, this family has been extensively characterized from the structural perspective. The X-ray structures of YebC proteins from Aquifex aeolicus, E.coli and Helicobacter pylori have been solved (PDB entries 1lfp , 1kon and 1mw7 , respectively). Structural analysis revealed a large cavity with a predominance of negatively charged residues on the surface of this protein (110). Given the strong contextual association of the yebC gene in various bacteria with ruvC and other ruv genes, which encode subunits of the Holliday junction resolvase, the YebC protein might be involved in DNA recombination and repair, perhaps as an auxiliary subunit of the resolvasome. However, eukaryotes, which lack the RuvABC resolvasome, have highly conserved orthologs of YebC which are predicted to localize to the mitochondria (M. Y. Galperin and E. V. Koonin; unpublished data). Therefore, the proteins of this COG might have different functions in bacteria and in eukaryotes.

    The second most common ‘unknown unknown’ protein, NIF3, has been originally described in yeast as an NGG1-interacting protein (111) and later identified in humans and mouse as a protein involved in transcriptional regulation of neural differentiation (112,113). Although its exact function remains unknown, this gene has been indirectly implicated in such diseases as juvenile amyotrophic lateral sclerosis (ALS2) and Williams–Beuren syndrome (114,115). The function of this protein in bacteria and archaea remains enigmatic.

    The iojap (Iowa-japonica) mutation affecting a representative of the third most common ‘unknown unknown’ family of proteins (YbeB/DUF143) has been described back in 1924 as causing a characteristic pattern of white (plastid-less) and green (normal) stripes in the leaves of maize (116). Studies of the past two decades showed that iojap mutants lack functional plastid ribosomes; neither plastid-encoded proteins nor nuclear-encoded proteins that are associated with thylakoid membranes are detectable in their chloroplasts (117–119). Nevertheless, the actual function of the Iojap protein remains unknown; the suggestion that it might be involved in RNA editing has been recently investigated and proven incorrect (120). In bacteria, the ortholog of iojap, the ybeB gene, forms a conserved operon with another uncharacterized gene, ybeA, which encodes a predicted methyltransferase, and is often found next to nadD, a gene for an enzyme of NAD biosynthesis (nicotinate-mononucleotide adenylyltransferase). In two bacteria, NadD and YbeB seem to form a two-domain fusion protein. The significance of this association remains unknown. A recent work has identified the corresponding E.coli gene as one of the two genes whose mutations that confer lethality on minCDE mutants, suggesting involvement of the Iojap protein in bacterial cell division (121).

    Inspection of the available evidence for the YjeF protein (UPF0029) shows that the distinction between ‘known unknowns’ and ‘unknown unknowns’ sometimes becomes murky. Specific prediction of the biochemical activity of this protein does not seem feasible, which is why we kept it among ‘unknown unknowns’ (Table 3). However, combination of detailed structural and sequence analyses, with examination of the domain fusion context strongly suggests that this protein is a Rossmann-fold enzyme involved in RNA metabolism (122,123).

    Conclusions and perspective

    Sequencing of multiple genomes from all walks of life and the concomitant development of computational approaches of comparative genomics create an opportunity for biology that was hardly imaginable even 10 years ago: a directed, systematic effort aimed at producing a complete catalog of biochemical activities, biological functions and the responsible genes, at least for simpler, prokaryotic life forms. A coordinated program on elucidation of the functions of conserved hypothetical proteins has the potential of taking us (as a community) a long way on the road to this lofty goal. It is worth emphasizing that the number of conserved hypotheticals that are widely represented among diverse life forms is not huge, a few thousand on the outside. However incomplete the current collection of genomes turns out to be, genes from new genomes increasingly fall within already established orthologous gene sets (124). Thus, although a truly comprehensive gene catalog might belong in the distant future, a concise dictionary of the ‘main’ functions and the corresponding genes is likely to be well within reach of the current generation of researchers, provided a reasonable degree of coordination of research projects is achieved.

    Certainly, science does not live by plans conceived by any group of researchers. Nevertheless, community research programs can be viable as illustrated, most relevantly for our subject here, by the brief (so far) but notable history of the structural genomics initiative. Although the development of structural genomics might not have been as rapid as initially hoped, the overall success is undeniable (73,125–127). The number of protein structures solved within this initiative, including those of many conserved hypothetical proteins that might have never been tackled, if not for the structural genomics paradigm, is already in the hundreds, and the movement is just picking up speed. There seems to be no reason why systematic, genome-wide identification of gene functions could not proceed along the same lines, perhaps with the benefit of even better coordination between researchers.

    SUPPLEMENTARY MATERIAL

    Supplementary Material is available at NAR Online.

    NOTE ADDED IN PROOF

    While this article was in production, an in-depth discussion of the uncharacterized genes of yeast Saccharomyces cerevisiae and strategies for completing the ‘encyclopedia of the yeast cell’ has been published .

    ACKNOWLEDGEMENTS

    We thank Rich Roberts, Eugene Kolker, Andrei Osterman and other participants of the American Academy of Microbiology colloquium ‘An Experimental Approach to Genome Annotation’ (July 19–20, Washington, DC) for inspiring discussions and Daniel Rigden for helpful comments.

    REFERENCES

    Koonin,E.V. and Galperin,M.Y. ( (2002) ) Sequence-Evolution-Function. Computational Approaches in Comparative Genomics. Kluwer Academic Publishers, Boston, MA.

    Bernal,A., Ear,U. and Kyrpides,N. ( (2001) ) Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res., , 29, , 126–127.

    Dunham,I. ( (2000) ) Genomics—the new rock and roll? Trends Genet., , 16, , 456–461.

    Bork,P. ( (2000) ) Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res., , 10, , 398–400.

    Kaneko,T., Nakamura,Y., Wolk,C.P., Kuritz,T., Sasamoto,S., Watanabe,A., Iriguchi,M., Ishikawa,A., Kawashima,K., Kimura,T. et al. ( (2001) ) Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC 7120. DNA Res., , 8, , 205–213.

    Omura,S., Ikeda,H., Ishikawa,J., Hanamoto,A., Takahashi,C., Shinose,M., Takahashi,Y., Horikawa,H., Nakazawa,H., Osonoe,T. et al. ( (2001) ) Genome sequence of an industrial microorganism Streptomyces avermitilis: deducing the ability of producing secondary metabolites. Proc. Natl Acad. Sci. USA, , 98, , 12215–12220.

    Bentley,S.D., Chater,K.F., Cerdeno-Tarraga,A.M., Challis,G.L., Thomson,N.R., James,K.D., Harris,D.E., Quail,M.A., Kieser,H., Harper,D. et al. ( (2002) ) Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature, , 417, , 141–147.

    Gl?ckner,F.-O., Kube,M., Bauer,M., Teeling,H., Lombardot,T., Ludwig,W., Gade,D., Beck,A., Borzym,K., Heitmann,K. et al. ( (2003) ) Complete genome sequence of the marine planctomycete Pirellula sp. strain 1. Proc. Natl Acad. Sci. USA, , 100, , 8298–8303.

    Galperin,M.Y. ( (2001) ) Conserved ‘hypothetical’ proteins: new hints and new puzzles. Comp. Funct. Genomics, , 2, , 14–18.

    Roberts,R.J. ( (2004) ) Identifying protein function—a call for community action. PLoS Biol., , 2, , E42.

    Fischer,D. ( (1999) ) Rational structural genomics: affirmative action for ORFans and the growth in our structural knowledge. Protein Eng., , 12, , 1029–1030.

    Siew,N. and Fischer,D. ( (2003) ) Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins, , 53, , 241–251.

    Siew,N., Azaria,Y. and Fischer,D. ( (2004) ) The ORFanage: an ORFan database. Nucleic Acids Res., , 32, , D281–D283.

    Daubin,V. and Ochman,H. ( (2004) ) Bacterial genomes as new gene homes: the genealogy of ORFans in E.coli. Genome Res., , 14, , 1036–1042.

    Alexandrov,A., Martzen,M.R. and Phizicky,E.M. ( (2002) ) Two proteins that form a complex are required for 7-methylguanosine modification of yeast tRNA. RNA, , 8, , 1253–1266.

    Jackman,J.E., Montange,R.K., Malik,H.S. and Phizicky,E.M. ( (2003) ) Identification of the yeast gene encoding the tRNA m1G methyltransferase responsible for modification at position 9. RNA, , 9, , 574–585.

    Soma,A., Ikeuchi,Y., Kanemasa,S., Kobayashi,K., Ogasawara,N., Ote,T., Kato,J., Watanabe,K., Sekine,Y. and Suzuki,T. ( (2003) ) AnRNA-modifying enzyme that governs both the codon and amino acid specificities of isoleucine tRNA. Mol. Cell, , 12, , 689–698.

    Eisenreich,W., Rohdich,F. and Bacher,A. ( (2001) ) Deoxyxylulose phosphate pathway to terpenoids. Trends Plant Sci., , 6, , 78–84.

    Eisenreich,W., Bacher,A., Arigoni,D. and Rohdich,F. ( (2004) ) Biosynthesis of isoprenoids via the non-mevalonate pathway. Cell. Mol. Life Sci., , 61, , 1401–1426.

    Jenal,U. ( (2004) ) Cyclic di-guanosine-monophosphate comes of age: a novel secondary messenger involved in modulating cell surface structures in bacteria? Curr. Opin. Microbiol., , 7, , 185–191.

    Galperin,M.Y. ( (2004) ) Bacterial signal transduction network in a genomic perspective. Environ. Microbiol., , 6, , 552–567.

    Karp,P.D. ( (2004) ) Call for an enzyme genomics initiative. Genome Biol., , 5, , 401.

    Daugherty,M., Polanuyer,B., Farrell,M., Scholle,M., Lykidis,A., de Crecy-Lagard,V. and Osterman,A. ( (2002) ) Complete reconstitution of the human coenzyme A biosynthetic pathway via comparative genomics. J. Biol. Chem., , 277, , 21431–21439.

    Tatusov,R.L., Koonin,E.V. and Lipman,D.J. ( (1997) ) A genomic perspective on protein families. Science, , 278, , 631–637.

    Coin,L., Bateman,A. and Durbin,R. ( (2004) ) Enhanced protein domain discovery using taxonomy. BMC Bioinformatics, , 5, , 56.

    Hulo,N., Sigrist,C.J., Le Saux,V., Langendijk-Genevaux,P.S., Bordoli,L., Gattiker,A., De Castro,E., Bucher,P. and Bairoch,A. ( (2004) ) Recent improvements to the PROSITE database. Nucleic Acids Res., , 32, , D134–D137.

    Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C., Phan,I. et al. ( (2003) ) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., , 31, , 365–370.

    Bateman,A., Coin,L., Durbin,R., Finn,R.D., Hollich,V., Griffiths-Jones,S., Khanna,A., Marshall,M., Moxon,S., Sonnhammer,E.L. et al. ( (2004) ) The Pfam protein families database. Nucleic Acids Res., , 32, , D138–D141.

    Tatusov,R.L., Fedorova,N.D., Jackson,J.D., Jacobs,A.R., Kiryutin,B., Koonin,E.V., Krylov,D.M., Mazumder,R., Mekhedov,S.L., Nikolskaya,A.N. et al. ( (2003) ) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, , 4, , 41.

    Arigoni,F., Talabot,F., Peitsch,M., Edgerton,M.D., Meldrum,E., Allet,E., Fish,R., Jamotte,T., Curchod,M.L. and Loferer,H. ( (1998) ) A genome-based approach for the identification of essential bacterial genes. Nat. Biotechnol., , 16, , 851–856.

    Jordan,I.K., Rogozin,I.B., Wolf,Y.I. and Koonin,E.V. ( (2002) ) Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res., , 12, , 962–968.

    Hwang,K.Y., Chung,J.H., Kim,S.H., Han,Y.S. and Cho,Y. ( (1999) ) Structure-based identification of a novel NTPase from Methanococcus jannaschii. Nature Struct. Biol., , 6, , 691–696.

    Noskov,V.N., Staak,K., Shcherbakova,P.V., Kozmin,S.G., Negishi,K., Ono,B.C., Hayatsu,H. and Pavlov,Y.I. ( (1996) ) HAM1, the gene controlling 6-N-hydroxylaminopurine sensitivity and mutagenesis in the yeast Saccharomyces cerevisiae. Yeast, , 12, , 17–29.

    Liakopoulou,A. and Alivisatos,S.G.A. ( (1964) ) Distribution of nucleoside triphosphatases in human erythrocytes. Biochim. Biophys. Acta, , 89, , 158–161.

    Chern,C.J., MacDonald,A.B. and Morris,A.J. ( (1969) ) Purification and properties of nucleoside triphosphate pyrophosphohydrolase from red cells of the rabbit. J. Biol. Chem., , 244, , 5489–5495.

    Lin,S., McLennan,A.G., Ying,K., Wang,Z., Gu,S., Jin,H., Wu,C., Liu,W., Yuan,Y., Tang,R. et al. ( (2001) ) Cloning, expression, and characterization of a human inosine triphosphate pyrophosphatase encoded by the ITPA gene. J. Biol. Chem., , 276, , 18695–18701.

    Marinaki,A.M., Ansari,A., Duley,J.A., Arenas,M., Sumi,S., Lewis,C.M., el Shobowale-Bakre,M., Escuredo,E., Fairbanks,L.D. and Sanderson,J.D. ( (2004) ) Adverse drug reactions to azathioprine therapy are associated with polymorphism in the gene encoding inosine triphosphate pyrophosphatase (ITPase). Pharmacogenetics, , 14, , 181–187.

    Apps,D.K. ( (1975) ) Pigeon-liver NAD kinase. The structural and kinetic basis of regulation of NADPH. Eur. J. Biochem., , 55, , 475–483.

    Tseng,Y.M., Harris,B.G. and Jacobson,M.K. ( (1979) ) Isolation and characterization of yeast nicotinamide adenine dinucleotide kinase. Biochim. Biophys. Acta, , 568, , 205–214.

    Kawai,S., Mori,S., Mukai,T., Hashimoto,W. and Murata,K. ( (2001) ) Molecular characterization of Escherichia coli NAD kinase. Eur. J. Biochem., , 268, , 4359–4365.

    Kawai,S., Suzuki,S., Mori,S. and Murata,K. ( (2001) ) Molecular cloning and identification of UTR1 of a yeast Saccharomyces cerevisiae as a gene encoding an NAD kinase. FEMS Microbiol. Lett., , 200, , 181–184.

    Lerner,F., Niere,M., Ludwig,A. and Ziegler,M. ( (2001) ) Structural and functional characterization of human NAD kinase. Biochem. Biophys. Res. Commun., , 288, , 69–74.

    Nakahigashi,K., Kubo,N., Narita,S., Shimaoka,T., Goto,S., Oshima,T., Mori,H., Maeda,M., Wada,C. and Inokuchi,H. ( (2002) ) HemK, a class of protein methyl transferase with similarity to DNA methyl transferases, methylates polypeptide chain release factors, and hemK knockout induces defects in translational termination. Proc. Natl Acad. Sci. USA, , 99, , 1473–1478.

    Heurgue-Hamard,V., Champ,S., Engstrom,A., Ehrenberg,M. and Buckingham,R.H. ( (2002) ) The hemK gene in Escherichia coli encodes the N5-glutamine methyltransferase that modifies peptide release factors. EMBO J., , 21, , 769–778.

    Clarke,S. ( (2002) ) The methylator meets the terminator. Proc. Natl Acad. Sci. USA, , 99, , 1104–1106.

    Loiseau,L., Ollagnier de Choudens,S., Nachin,L., Fontecave,M. and Barras,F. ( (2003) ) Biogenesis of Fe-S cluster by the bacterial Suf system: SufS and SufE form a new type of cysteine desulfurase. J. Biol. Chem., , 278, , 38352–38359.

    Outten,F.W., Wood,M.J., Munoz,F.M. and Storz,G. ( (2003) ) The SufE protein and the SufBCD complex enhance SufS cysteine desulfurase activity as part of a sulfur transfer pathway for Fe-S cluster assembly in Escherichia coli. J. Biol. Chem., , 278, , 45713–45719.

    Muhlenhoff,U. and Lill,R. ( (2000) ) Biogenesis of iron–sulfur proteins in eukaryotes: a novel task of mitochondria that is inherited from bacteria. Biochim. Biophys. Acta, , 1459, , 370–382.

    Rangachari,K., Davis,C.T., Eccleston,J.F., Hirst,E.M., Saldanha,J.W., Strath,M. and Wilson,R.J. ( (2002) ) SufC hydrolyzes ATP and interacts with SufB from Thermotoga maritima. FEBS Lett., , 514, , 225–228.

    Koonin,E.V. ( (2003) ) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature Rev. Microbiol., , 1, , 127–136.

    Koonin,E.V., Mushegian,A.R. and Bork,P. ( (1996) ) Non-orthologous gene displacement. Trends Genet., , 12, , 334–336.

    Galperin,M.Y. and Koonin,E.V. ( (2000) ) Who's your neighbor? New computational approaches for functional genomics. Nat. Biotechnol., , 18, , 609–613.

    Ibba,M., Bono,J.L., Rosa,P.A. and Soll,D. ( (1997) ) Archaeal-type lysyl-tRNA synthetase in the Lyme disease spirochete Borrelia burgdorferi. Proc. Natl Acad. Sci. USA, , 94, , 14383–14388.

    Myllykallio,H., Lipowski,G., Leduc,D., Filee,J., Forterre,P. and Liebl,U. ( (2002) ) An alternative flavin-dependent mechanism for thymidylate synthesis. Science, , 297, , 105–107.

    Daugherty,M., Vonstein,V., Overbeek,R. and Osterman,A. ( (2001) ) Archaeal shikimate kinase, a new member of the GHMP-kinase family. J. Bacteriol., , 183, , 292–300.

    Morett,E., Korbel,J.O., Rajan,E., Saab-Rincon,G., Olvera,L., Olvera,M., Schmidt,S., Snel,B. and Bork,P. ( (2003) ) Systematic discovery of analogous enzymes in thiamin biosynthesis. Nat. Biotechnol., , 21, , 790–795.

    Osterman,A. and Overbeek,R. ( (2003) ) Missing genes in metabolic pathways: a comparative genomics approach. Curr. Opin. Chem. Biol., , 7, , 238–251.

    Green,M.L. and Karp,P.D. ( (2004) ) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics, , 5, , 76.

    Krieger,C.J., Zhang,P., Mueller,L.A., Wang,A., Paley,S., Arnaud,M., Pick,J., Rhee,S.Y. and Karp,P.D. ( (2004) ) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res., , 32, , D438–D442.

    Makarova,K.S., Wolf,Y.I. and Koonin,E.V. ( (2003) ) Potential genomic determinants of hyperthermophily. Trends Genet., , 19, , 172–176.

    Raymond,J., Zhaxybayeva,O., Gogarten,J.P., Gerdes,S.Y. and Blankenship,R.E. ( (2002) ) Whole-genome analysis of photosynthetic prokaryotes. Science, , 298, , 1616–1620.

    Akerley,B.J., Rubin,E.J., Novick,V.L., Amaya,K., Judson,N. and Mekalanos,J.J. ( (2002) ) A genome-scale analysis for identification of genes required for growth or survival of Haemophilus influenzae. Proc. Natl Acad. Sci. USA, , 99, , 966–971.

    Thanassi,J.A., Hartman-Neumann,S.L., Dougherty,T.J., Dougherty,B.A. and Pucci,M.J. ( (2002) ) Identification of 113 conserved essential genes using a high-throughput gene disruption system in Streptococcus pneumoniae. Nucleic Acids Res., , 30, , 3152–3162.

    Forsyth,R.A., Haselbeck,R.J., Ohlsen,K.L., Yamamoto,R.T., Xu,H., Trawick,J.D., Wall,D., Wang,L., Brown-Driver,V., Froelich,J.M. et al. ( (2002) ) A genome-wide strategy for the identification of essential genes in Staphylococcus aureus. Mol. Microbiol., , 43, , 1387–1400.

    Giaever,G., Chu,A.M., Ni,L., Connelly,C., Riles,L., Veronneau,S., Dow,S., Lucau-Danila,A., Anderson,K., Andre,B. et al. ( (2002) ) Functional profiling of the Saccharomyces cerevisiae genome. Nature, , 418, , 387–391.

    Kobayashi,K., Ehrlich,S.D., Albertini,A., Amati,G., Andersen,K.K., Arnaud,M., Asai,K., Ashikaga,S., Aymerich,S., Bessieres,P. et al. ( (2003) ) Essential Bacillus subtilis genes. Proc. Natl Acad. Sci. USA, , 100, , 4678–4683.

    Gerdes,S.Y., Scholle,M.D., Campbell,J.W., Balazsi,G., Ravasz,E., Daugherty,M.D., Somera,A.L., Kyrpides,N.C., Anderson,I., Gelfand,M.S. et al. ( (2003) ) Experimental determination and system-level analysis of essential genes in Escherichia coli MG1655. J. Bacteriol., , 185, , 5673–5684.

    Zhang,R., Ou,H.Y. and Zhang,C.T. ( (2004) ) DEG: a database of essential genes. Nucleic Acids Res., , 32, , D271–D272.

    Koonin,E.V. ( (2000) ) How many genes can make a cell: the minimal-gene-set concept. Annu. Rev. Genomics Hum. Genet., , 1, , 99–116.

    Harold,F.M. and Van Brunt,J. ( (1977) ) Circulation of H+ and K+ across the plasma membrane is not obligatory for bacterial growth. Science, , 197, , 372–373.

    Tong,A.H., Lesage,G., Bader,G.D., Ding,H., Xu,H., Xin,X., Young,J., Berriz,G.F., Brost,R.L., Chang,M. et al. ( (2004) ) Global mapping of the yeast genetic interaction network. Science, , 303, , 808–813.

    Vitkup,D., Melamud,E., Moult,J. and Sander,C. ( (2001) ) Completeness in structural genomics. Nature Struct. Biol., , 8, , 559–566.

    Frishman,D. ( (2003) ) What we have learned about prokaryotes from structural genomics. OMICS, , 7, , 211–224.

    Kolker,E., Makarova,K.S., Shabalina,S., Picone,A.F., Purvine,S., Holzman,T., Cherny,T., Armbruster,D., Munson,R.S.,Jr, Kolesov,G. et al. ( (2004) ) Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae. Nucleic Acids Res., , 32, , 2353–2361.

    Zarembinski,T.I., Hung,L.W., Mueller-Dieckmann,H.J., Kim,K.K., Yokota,H., Kim,R. and Kim,S.H. ( (1998) ) Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. Proc. Natl Acad. Sci. USA, , 95, , 15189–15193.

    Sousa,M.C. and McKay,D.B. ( (2001) ) Structure of the universal stress protein of Haemophilus influenzae. Structure, , 9, , 1135–1141.

    Makarova,K.S., Aravind,L., Galperin,M.Y., Grishin,N.V., Tatusov,R.L., Wolf,Y.I. and Koonin,E.V. ( (1999) ) Comparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell. Genome Res., , 9, , 608–628.

    Tao,H., Bausch,C., Richmond,C., Blattner,F.R. and Conway,T. ( (1999) ) Functional genomics: expression analysis of Escherichia coli growing on minimal and rich media. J. Bacteriol., , 181, , 6425–6440.

    Price,C.W., Fawcett,P., Ceremonie,H., Su,N., Murphy,C.K. and Youngman,P. ( (2001) ) Genome-wide analysis of the general stress response in Bacillus subtilis. Mol. Microbiol., , 41, , 757–774.

    Liu,Y., Zhou,J., Omelchenko,M.V., Beliaev,A.S., Venkateswaran,A., Stair,J., Wu,L., Thompson,D.K., Xu,D., Rogozin,I.B. et al. ( (2003) ) Transcriptome dynamics of Deinococcus radiodurans recovering from ionizing radiation. Proc. Natl Acad. Sci. USA, , 100, , 4191–4196.

    Kolker,E., Purvine,S., Galperin,M.Y., Stolyar,S., Goodlett,D.R., Nesvizhskii,A.I., Keller,A., Xie,T., Eng,J.K., Yi,E. et al. ( (2003) ) Initial proteome analysis of model microorganism Haemophilus influenzae strain Rd KW20. J. Bacteriol., , 185, , 4593–4602.

    Bork,P., Dandekar,T., Diaz-Lazcoz,Y., Eisenhaber,F., Huynen,M. and Yuan,Y. ( (1998) ) Predicting function: from genes to genomes and back. J. Mol. Biol., , 283, , 707–725.

    Li,W. and Godzik,A. ( (2002) ) Discovering new genes with advanced homology detection. Trends Biotechnol., , 20, , 315–316.

    Ouzounis,C.A. and Karp,P.D. ( (2002) ) The past, present and future of genome-wide re-annotation. Genome Biol., , 3, , COMMENT2001.

    Overbeek,R., Fonstein,M., D'Souza,M., Pusch,G.D. and Maltsev,N. ( (1999) ) The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA, , 96, , 2896–2901.

    Dandekar,T., Snel,B., Huynen,M. and Bork,P. ( (1998) ) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci., , 23, , 324–328.

    Pellegrini,M., Marcotte,E.M., Thompson,M.J., Eisenberg,D. and Yeates,T.O. ( (1999) ) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, , 96, , 4285–4288.

    Marcotte,E.M., Pellegrini,M., Thompson,M.J., Yeates,T.O. and Eisenberg,D. ( (1999) ) A combined algorithm for genome-wide prediction of protein function. Nature, , 402, , 83–86.

    Marcotte,E.M., Pellegrini,M., Ng,H.L., Rice,D.W., Yeates,T.O. and Eisenberg,D. ( (1999) ) Detecting protein function and protein–protein interactions from genome sequences. Science, , 285, , 751–753.

    Huynen,M., Snel,B., Lathe,W.,III and Bork,P. ( (2000) ) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res., , 10, , 1204–1210.

    van Noort,V., Snel,B. and Huynen,M.A. ( (2003) ) Predicting gene function by conserved co-expression. Trends Genet., , 19, , 238–242.

    Rogozin,I.B., Makarova,K.S., Wolf,Y.I. and Koonin,E.V. ( (2004) ) Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes. Brief. Bioinformatics, , 5, , 131–149.

    von Mering,C., Huynen,M., Jaeggi,D., Schmidt,S., Bork,P. and Snel,B. ( (2003) ) STRING: a database of predicted functional associations between proteins. Nucleic Acids Res., , 31, , 258–261.

    Overbeek,R., Larsen,N., Walunas,T., D'Souza,M., Pusch,G., Selkov,E.,Jr, Liolios,K., Joukov,V., Kaznadzey,D., Anderson,I. et al. ( (2003) ) The ERGO genome analysis and discovery system. Nucleic Acids Res., , 31, , 164–171.

    Kolesov,G., Mewes,H.W. and Frishman,D. ( (2002) ) SNAPper: gene order predicts gene function. Bioinformatics, , 18, , 1017–1019.

    Salwinski,L., Miller,C.S., Smith,A.J., Pettit,F.K., Bowie,J.U. and Eisenberg,D. ( (2004) ) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., , 32, , D449–D451.

    Suhre,K. and Claverie,J.M. ( (2004) ) FusionDB: a database for in-depth analysis of prokaryotic gene fusion events. Nucleic Acids Res., , 32, , D273–D276.

    Enault,F., Suhre,K., Poirot,O., Abergel,C. and Claverie,J.M. ( (2004) ) Phydbac2: improved inference of gene function using interactive phylogenomic profiling and chromosomal location analysis. Nucleic Acids Res., , 32, , W336–W339.

    Rawlings,N.D., Tolle,D.P. and Barrett,A.J. ( (2004) ) MEROPS: the peptidase database. Nucleic Acids Res., , 32, , D160–D164.

    Abdullah,K.M., Udoh,E.A., Shewen,P.E. and Mellors,A. ( (1992) ) A neutral glycoprotease of Pasteurella haemolytica A1 specifically cleaves O-sialoglycoproteins. Infect. Immun., , 60, , 56–62.

    Aravind,L. and Koonin,E.V. ( (1999) ) Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol., , 287, , 1023–1040.

    Wolf,Y.I., Rogozin,I.B., Kondrashov,A.S. and Koonin,E.V. ( (2001) ) Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res., , 11, , 356–372.

    Leipe,D.D., Wolf,Y.I., Koonin,E.V. and Aravind,L. ( (2002) ) Classification and evolution of P-loop GTPases and related ATPases. J. Mol. Biol., , 317, , 41–72.

    Teplyakov,A., Obmolova,G., Chu,S.Y., Toedt,J., Eisenstein,E., Howard,A.J. and Gilliland,G.L. ( (2003) ) Crystal structure of the YchF protein reveals binding sites for GTP and nucleic acid. J. Bacteriol., , 185, , 4031–4037.

    Na,J.G., Pinto,I. and Hampsey,M. ( (1992) ) Isolation and characterization of SUA5, a novel gene required for normal growth in Saccharomyces cerevisiae. Genetics, , 131, , 791–801.

    Kaczanowska,M. and Ryden-Aulin,M. ( (2004) ) Temperature sensitivity caused by mutant release factor 1 is suppressed by mutations that affect 16S rRNA maturation. J. Bacteriol., , 186, , 3046–3055.

    Teplova,M., Tereshko,V., Sanishvili,R., Joachimiak,A., Bushueva,T., Anderson,W.F. and Egli,M. ( (2000) ) The structure of the yrdC gene product from Escherichia coli reveals a new fold and suggests a role in RNA binding. Protein Sci., , 9, , 2557–2566.

    Pace,H.C. and Brenner,C. ( (2001) ) The nitrilase superfamily: classification, structure and function. Genome Biol., , 2, , REVIEWS0001.

    Bieganowski,P., Pace,H.C. and Brenner,C. ( (2003) ) Eukaryotic NAD+ synthetase Qns1 contains an essential, obligate intramolecular thiol glutamine amidotransferase domain related to nitrilase. J. Biol. Chem., , 278, , 33049–33055.

    Shin,D.H., Yokota,H., Kim,R. and Kim,S.H. ( (2002) ) Crystal structure of conserved hypothetical protein Aq1575 from Aquifex aeolicus. Proc. Natl Acad. Sci. USA, , 99, , 7980–7985.

    Martens,J.A., Genereaux,J., Saleh,A. and Brandl,C.J. ( (1996) ) Transcriptional activation by yeast PDR1p is inhibited by its association with NGG1p/ADA3p. J. Biol. Chem., , 271, , 15884–15890.

    Tascou,S., Uedelhoven,J., Dixkens,C., Nayernia,K., Engel,W. and Burfeind,P. ( (2000) ) Isolation and characterization of a novel human gene, NIF3L1, and its mouse ortholog, Nif3l1, highly conserved from bacteria to mammals. Cytogenet. Cell Genet., , 90, , 330–336.

    Akiyama,H., Fujisawa,N., Tashiro,Y., Takanabe,N., Sugiyama,A. and Tashiro,F. ( (2003) ) The role of transcriptional corepressor Nif3l1 in early stage of neural differentiation via cooperation with Trip15/CSN2. J. Biol. Chem., , 278, , 10752–10762.

    Hadano,S., Yanagisawa,Y., Skaug,J., Fichter,K., Nasir,J., Martindale,D., Koop,B.F., Scherer,S.W., Nicholson,D.W., Rouleau,G.A. et al. ( (2001) ) Cloning and characterization of three novel genes, ALS2CR1, ALS2CR2, and ALS2CR3, in the juvenile amyotrophic lateral sclerosis (ALS2) critical region at chromosome 2q33-q34: candidate genes for ALS2. Genomics, , 71, , 200–213.

    Merla,G., Howald,C., Antonarakis,S.E. and Reymond,A. ( (2004) ) The subcellular localization of the ChoRE-binding protein, encoded by the Williams–Beuren syndrome critical region gene 14, is regulated by 14-3-3. Hum. Mol. Genet., , 13, , 1505–1514.

    Jenkins,M.T. ( (1924) ) Heritable characters of maize. XX. Iojap-striping, a chlorophyll defect. J. Hered., , 15, , 467–472.

    Walbot,V. and Coe,E.H.,Jr ( (1979) ) Nuclear gene iojap conditions a programmed change to ribosome-less plastids in Zea mays. Proc. Natl Acad. Sci. USA, , 76, , 2760–2764.

    Han,C.D., Coe,E.H.,Jr and Martienssen,R.A. ( (1992) ) Molecular cloning and characterization of iojap (ij), a pattern striping gene of maize. EMBO J., , 11, , 4037–4046.

    Han,C.D., Patrie,W., Polacco,M. and Coe,E.H.,Jr ( (1993) ) Aberrations in plastid transcripts and deficiency of plastid DNA in striped and albino mutants in maize. Planta, , 191, , 552–563.

    Halter,C.P., Peeters,N.M. and Hanson,M.R. ( (2004) ) RNA editing in ribosome-less plastids of iojap maize. Curr. Genet., , 45, , 331–337.

    Bernhardt,T.G. and de Boer,P.A. ( (2004) ) Screening for synthetic lethal mutants in Escherichia coli and identification of EnvC (YibP) as a periplasmic septal ring factor with murein hydrolase activity. Mol. Microbiol., , 52, , 1255–1269.

    Anantharaman,V. and Aravind,L. ( (2004) ) Novel conserved domains in proteins with predicted roles in eukaryotic cell-cycle regulation, decapping and RNA stability. BMC Genomics, , 5, , 45.

    Albrecht,M. and Lengauer,T. ( (2004) ) Novel Sm-like proteins with long C-terminal tails and associated methyltransferases. FEBS Lett., , 569, , 18–26.

    Tatusov,R.L., Natale,D.A., Garkavtsev,I.V., Tatusova,T.A., Shankavaram,U.T., Rao,B.S., Kiryutin,B., Galperin,M.Y., Fedorova,N.D. and Koonin,E.V. ( (2001) ) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res., , 29, , 22–28.

    Kim,S.H., Shin,D.H., Choi,I.G., Schulze-Gahmen,U., Chen,S. and Kim,R. ( (2003) ) Structure-based functional inference in structural genomics. J. Struct. Funct. Genomics, , 4, , 129–135.

    Harrison,S.C. ( (2004) ) Whither structural biology? Nature Struct. Mol. Biol., , 11, , 12–15.

    Stevens,R.C. ( (2004) ) Long live structural biology. Nature Struct. Mol. Biol., , 11, , 293–295.

    Mishra,P., Park,P.K. and Drueckhammer,D.G. ( (2001) ) Identification of yacE (coaE) as the structural gene for dephosphocoenzyme A kinase in Escherichia coli K-12. J. Bacteriol., , 183, , 2774–2778.

    De Bie,L.G., Roovers,M., Oudjama,Y., Wattiez,R., Tricot,C., Stalon,V., Droogmans,L. and Bujnicki,J.M. ( (2003) ) The yggH gene of Escherichia coli encodes a tRNA (m7G46) methyltransferase. J. Bacteriol., , 185, , 3238–3243.

    Takahashi,Y. and Nakamura,M. ( (1999) ) Functional assignment of the ORF2-iscS-iscU-iscA-hscB-hscA-fdx-ORF3 gene cluster involved in the assembly of Fe-S clusters in Escherichia coli. J. Biochem., , 126, , 917–926.

    Jensen,L.T. and Culotta,V.C. ( (2000) ) Role of Saccharomyces cerevisiae ISA1 and ISA2 in iron homeostasis. Mol. Cell Biol., , 20, , 3918–3927.

    Kaya,Y. and Ofengand,J. ( (2003) ) A novel unanticipated type of pseudouridine synthase with homologs in bacteria, archaea, and eukarya. RNA, , 9, , 711–721.

    Zalacain,M., Biswas,S., Ingraham,K.A., Ambrad,J., Bryant,A., Chalker,A.F., Iordanescu,S., Fan,J., Fan,F., Lunsford,R.D. et al. ( (2003) ) A global approach to identify novel broad-spectrum antibacterial targets among proteins of unknown function. J. Mol. Microbiol. Biotechnol., , 6, , 109–126.

    Hutchison,C.A., Peterson,S.N., Gill,S.R., Cline,R.T., White,O., Fraser,C.M., Smith,H.O. and Venter,J.C. ( (1999) ) Global transposon mutagenesis and a minimal Mycoplasma genome. Science, , 286, , 2165–2169.

    Mittenhuber,G. ( (2001) ) Comparative genomics of prokaryotic GTP-binding proteins (the Era, Obg, EngA, ThdF (TrmE), YchF and YihA families) and their relationship to eukaryotic GTP-binding proteins (the DRG, ARF, RAB, RAN, RAS and RHO families). J. Mol. Microbiol. Biotechnol., , 3, , 21–35.

    Pekarsky,Y., Campiglio,M., Siprashvili,Z., Druck,T., Sedkov,Y., Tillib,S., Draganescu,A., Wermuth,P., Rothman,J.H., Huebner,K. et al. ( (1998) ) Nitrilase and Fhit homologs are encoded as fusion proteins in Drosophila melanogaster and Caenorhabditis elegans. Proc. Natl Acad. Sci. USA, , 95, , 8744–8749.

    Dassain,M., Leroy,A., Colosetti,L., Carole,S. and Bouche,J.P. ( (1999) ) A new essential gene of the ‘minimal genome’ affecting cell division. Biochimie, , 81, , 889–895.

    Lehoux,I.E., Mazzulla,M.J., Baker,A. and Petit,C.M. ( (2003) ) Purification and characterization of YihA, an essential GTP-binding protein from Escherichia coli. Protein Expr. Purif., , 30, , 203–209.

    Aravind,L., Galperin,M.Y. and Koonin,E.V. ( (1998) ) The catalytic domain of the P-type ATPase has the haloacid dehalogenase fold. Trends Biochem. Sci., , 23, , 127–129.

    Kim,Y., Yakunin,A.F., Kuznetsova,E., Xu,X., Pennycooke,M., Gu,J., Cheung,F., Proudfoot,M., Arrowsmith,C.H., Joachimiak,A. et al. ( (2004) ) Structure- and function-based characterization of a new phosphoglycolate phosphatase from Thermoplasma acidophilum. J. Biol. Chem., , 279, , 517–526.

    Haft,C.R., de la Luz Sierra,M., Bafford,R., Lesniak,M.A., Barr,V.A. and Taylor,S.I. ( (2000) ) Human orthologs of yeast vacuolar protein sorting proteins Vps26, 29, and 35: assembly into multimeric complexes. Mol. Biol. Cell, , 11, , 4105–4116.

    Verges,M., Luton,F., Gruber,C., Tiemann,F., Reinders,L.G., Huang,L., Burlingame,A.L., Haft,C.R. and Mostov,K.E. ( (2004) ) The mammalian retromer regulates transcytosis of the polymeric immunoglobulin receptor. Nature Cell Biol., , 6, , 763–769.

    Chen,S., Yakunin,A.F., Kuznetsova,E., Busso,D., Pufan,R., Proudfoot,M., Kim,R. and Kim,S.H. ( (2004) ) Structural and functional characterization of a novel phosphodiesterase from Methanococcus jannaschii. J. Biol. Chem., , 279, , 31854–31862.

    Bassler,J., Grandi,P., Gadal,O., Lessmann,T., Petfalski,E., Tollervey,D., Lechner,J. and Hurt,E. ( (2001) ) Identification of a 60S preribosomal particle that is closely linked to nuclear export. Mol. Cell, , 8, , 517–529.

    Liu,S.J., Cai,Z.W., Liu,Y.J., Dong,M.Y., Sun,L.Q., Hu,G.F., Wei,Y.Y. and Lao,W.D. ( (2004) ) Role of nucleostemin in growth regulation of gastric cancer, liver cancer and other malignancies. World J. Gastroenterol., , 10, , 1246–1249.

    Abrahams,B.S., Mak,G.M., Berry,M.L., Palmquist,D.L., Saionz,J.R., Tay,A., Tan,Y.H., Brenner,S., Simpson,E.M. and Venkatesh,B. ( (2002) ) Novel vertebrate genes and putative regulatory elements identified at kidney disease and NR2E1/fierce loci. Genomics, , 80, , 45–53.

    Ritter,M., Buechler,C., Boettcher,A., Barlage,S., Schmitz-Madry,A., Orso,E., Bared,S.M., Schmiedeknecht,G., Baehr,C.H., Fricker,G. et al. ( (2002) ) Cloning and characterization of a novel apolipoprotein A-I binding protein, AI-BP, secreted by cells of the kidney proximal tubules in response to HDL or ApoA-I. Genomics, , 79, , 693–702.

    Okamura,K., Hagiwara-Takeuchi,Y., Li,T., Vu,T.H., Hirai,M., Hattori,M., Sakaki,Y., Hoffman,A.R. and Ito,T. ( (2000) ) Comparative genome analysis of the mouse imprinted gene impact and its nonimprinted human homolog IMPACT: toward the structural basis for species-specific imprinting. Genome Res., , 10, , 1878–1889.

    Genschik,P., Drabikowski,K. and Filipowicz,W. ( (1998) ) Characterization of the Escherichia coli RNA 3'-terminal phosphate cyclase and its sigma54-regulated operon. J. Biol. Chem., , 273, , 25516–25526.

    Henne,A., Daniel,R., Schmitz,R.A. and Gottschalk,G. ( (1999) ) Construction of environmental DNA libraries in Escherichia coli and screening for the presence of genes conferring utilization of 4-hydroxybutyrate. Appl. Environ. Microbiol., , 65, , 3901–3907.

    Enyenihi,A.H. and Saunders,W.S. ( (2003) ) Large-scale functional genomic analysis of sporulation and meiosis in Saccharomyces cerevisiae. Genetics, , 163, , 47–54.

    Vitelli,F., Piccini,M., Caroli,F., Franco,B., Malandrini,A., Pober,B., Jonsson,J., Sorrentino,V. and Renieri,A. ( (1999) ) Identification and characterization of a highly conserved protein absent in the Alport syndrome (A), mental retardation (M), midface hypoplasia (M), and elliptocytosis (E) contiguous gene deletion syndrome (AMME). Genomics, , 55, , 335–340.

    Kryukov,G.V. and Gladyshev,V.N. ( (2004) ) The prokaryotic selenoproteome. EMBO Rep., , 5, , 538–543.(Michael Y. Galperin and Eugene V. Koonin)