当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第10期 > 正文
编号:11371996
PeerGAD: a peer-review-based and community-centric web application for
http://www.100md.com 《核酸研究医学期刊》
     1 Boyce Thompson Institute for Plant Research, Ithaca, NY 14853, USA and 2 Department of Plant Pathology, Cornell University, Ithaca, NY 14853, USA

    *To whom correspondence should be addressed. Tel: +1 607 254 8795; Fax: +1 607 255 6695; Email: mdd9@cornell.edu

    ABSTRACT

    PeerGAD is a web-based database-driven application that allows community-wide peer-reviewed annotation of prokaryotic genome sequences. The application was developed to support the annotation of the Pseudomonas syringae pv. tomato strain DC3000 genome sequence and is easily portable to other genome sequence annotation projects. PeerGAD incorporates several innovative design and operation features and accepts annotations pertaining to gene naming, role classification, gene translation and annotation derivation. The annotator tool in PeerGAD is built around a genome browser that offers users the ability to search and navigate the genome sequence. Because the application encourages annotation of the genome sequence directly by researchers and relies on peer review, it circumvents the need for an annotation curator while providing added value to the annotation data. Support for the Gene OntologyTM vocabulary, a structured and controlled vocabulary used in classification of gene roles, is emphasized throughout the system. Here we present the underlying concepts integral to the functionality of PeerGAD.

    INTRODUCTION

    Once a genome has been sequenced, the process begins of defining and annotating open reading frames (ORFs) and ascribing possible roles to the encoded proteins. The early stages of this effort are to typically facilitated by the use of computational methods that draw inferences from previously sequenced and annotated genomes (1–3). Such methods rapidly identify stretches of sequence likely to code for individual genes and assign their putative biological roles (1,3). Human input is required to validate computational observations and to add knowledge where gaps in current biological understanding exist (4).

    A variety of methods are used to gather the observations of research scientists as part of the annotation process (2,5). These methods vary in scope and scale, but the objective of applying current knowledge to un-annotated or partially annotated genomes is common to each. Historically, these methods have often incorporated the efforts of individuals working in close proximity as a group of annotators (6,7), but this paradigm is slowly changing.

    The concept of open annotation spawned from the Human Genome Project is an effort to decentralize the annotation process (2). Given the large size of the human genome sequence, it was necessary to divide efforts among multiple groups, much like a multiprocessor might share computational tasks in an effort to complete a computation in less time with the potential of increased efficiency. Decentralizing the annotation process might result in increased efficiency; however, it might also lead to inconsistencies in annotation because of the involvement of a large number of scientists with diverse backgrounds and various levels of genome annotation experience, thus decreasing efficiency and increasing the time required to perform genome annotation. Therefore important goals of decentralizing the annotation process are to ensure that annotations are properly reviewed, consistency is enforced and communication among annotators is facilitated.

    For decades, publishers of scientific literature have looked to the peer review process as a means of maintaining the quality of their journals. It has been suggested that the incorporation of similar elements into the annotation process is appropriate (2). Doing so distributes annotation responsibility over multiple individuals, pairs expertise with corresponding review responsibilities and creates conditions where an annotation may undergo multiple submit-reject iterations until accepted. Such forces applied to the annotation process could result in a more efficiently derived and accurate annotation dataset. However, despite these likely advantages, no established model of a community-based peer review annotation system currently exists.

    With the proliferation of genomics-based research, it is becoming increasingly evident that organisms often share subsets of genes coding for essential biological functions. Developing an annotation approach that recognizes this observation is an objective of the Gene Ontology Consortium (8). Working toward this objective, efforts to develop Gene OntologyTM (GO) annotation methodologies started in 1998 and continue to evolve as additional verified biological data are incorporated into the vocabulary. In addition to providing an established vocabulary by which to assign an annotation, GO allows the association of confidence through pre-established GO evidence codes and data source statements and also facilitates interoperability between annotation databases (8,9).

    Here, we describe a web-based annotation application, PeerGAD, which incorporates peer review, promotes consistent syntax usage within an annotation through GO terminology and facilitates communication among annotators. Few currently available public annotation databases have attempted to address these issues in such an integrated manner (10). In addition, PeerGAD provides user-friendly interface designs to further encourage community participation.

    MATERIALS AND METHODS

    The development of PeerGAD addresses needs presented by the Pseudomonas syringae community; however, the design and concepts implemented are transferable to other genomes. PeerGAD is available upon request and includes sequence and annotation data used during development along with additional install instructions. The following offers a brief overview of the necessary requirements and considerations required to implement PeerGAD successfully.

    PeerGAD runs on the Microsoft ASP.NET platform and utilizes Microsoft SQL Server 2000 for database storage. The development and example application is run on a Dual Processor 2.4GHz P4 Dell Server with a 2 GB RAM per processor. The application is written primarily in C# (developed using Visual Studio .Net) and ASP.NET. SQL queries are written as stored procedures using transact-SQL in conjunction with Microsoft SQL Server 2000. Deployment requires that a current version of ASP.NET (Framework v1.0 or greater), Internet Information Service (IIS) web server and Microsoft SQL Server be installed on the Windows server. Microsoft libraries: MSXML.dll and DTS.dll are required for XML support and SQL server backup scripts, respectively. Third-party libraries obout_ASPTreeView_Pro_Net.dll (http://www.obout.com) and DundasWebChart.dll (http://www.dundas.com) are required for ontology tree and charting functionality, respectively. The blastall.exe executable is required for BLAST search support (ftp://ftp.ncbi.nih.gov/blast).

    Slight cosmetic modifications will need to be made to customize the interface for use with other genome projects. Simple cosmetic modifications would be best performed using Web Matrix, a free Microsoft ASP.NET editor that is downloadable from the Official Microsoft ASP.NET Site (http://www.asp.net). Changes to the code and database structure are not necessary to achieve full functionality of the system; however, it may be necessary to create or drop sequence tables to allow the schema to correspond appropriately to the number of sequence molecules (chromosome, plasmids, etc) associated with a particular organism.

    The schema for PeerGAD (Fig. 1) maintains integrity of annotation and associated data stored within the database. The schema is developed with extensibility in mind and allows additional annotation types (Table 1) to be easily added. Database connectivity is achieved through an implementation of the ADO.NET model. All SQL queries used by PeerGAD are written in transact-SQL as stored procedures and are executed from the C# programming environment through a dynamic library interface layer. The library is dynamically built by an add-in developed for the purpose of this project. This add-in, SQLSP.dll, autogenerates client side objects with SQL stored procedure awareness and compiles them into the library interface ISQLSP.dll. Modification to this library is not necessary unless changes are made to INPUT/OUTPUT variables associated with the stored procedures.

    Figure 1. Generalized PeerGAD Schema. Represented in the schema are major tables and their associations (not all tables utilized by PeerGAD are shown). Functionality is grouped into four categories: Community, Genome, Annotation and Gene Ontology. Community tables store annotator-associated data such as username, address, preferences and research interests. Genome tables store sequence data; a single table is allocated for each of the three P.syringae pv. tomato D3000 sequences (two plasmids, one chromosome) for example. Annotation tables store and maintain relationships among the annotation data, the user responsible for submitting the annotation data and annotation state. AnnotationRecords maintains a record for each annotation submitted to the system and maintains relationships among annotation type tables FeatureIdentity, Notes, Coordinates and Gene Ontology. Gene Ontology tables maintain relationships between GO terms necessary for generating GO tree structures and node definitions.

    Table 1. Annotation typesa,

    The Gene Ontology vocabulary, developed by the GO Consortium (8), forms the basis for the assignment of an Ontology annotation. The database schema contains three tables (graph_path, term2term and term) that maintain GO term relationships. A backend component performs automated data retrieval, file decompression and table updates to maintain concurrency between these three tables and monthly GO releases. This component is configurable via the Update GO Terms page linked from the Annotator Resource page and is accessible only by users with administrator privileges.

    The schema is designed such that a single sequence table exists for each sequence molecule. For Pseudomonas syringae DC3000, where three sequence molecules exist, a genomic sequence (Entrez Locus: AE016853 ) and two circular plasmids (Entrez Loci: AE016854 , AE016855 ), the corresponding schema contains three sequence tables. Sequence tables maintain an association between the nucleic acid bases within a sequence molecule and corresponding coordinate values. A single coordinate value is assigned for each base in a sequence molecule. A coordinate is defined as an integer i, where 0 > i l and l is the length of the sequence molecule. A coordinate pair or range (i1, i2), defines where a feature begins and ends within the context of the specified sequence molecule, where i1 is termed the start coordinate and i2 is termed the stop coordinate. The coordinate system is unidirectional such that i1 < i2 and requires that the forward or reverse strand be specified by supplying the strand value s, where s is a character and s = ‘+’ or s = ‘–’. Under this coordinate system, feature length is defined by (i2 – i1) + 1 and is independent of s. Range and s values are stored within the coordinates table. For previously un-annotated genomes, a program such as Glimmer (11) is used to identify ORFs and output coordinate pairs that must be formatted to the above coordinate pair specifications. For previously annotated genomes present in GenBank, the Perl script GBXML_Parser.pl may be used to parse GenBank XML files; output derived from this script is used as a basis for populating the coordinates table.

    It is necessary to import a formatted sequence file into a corresponding sequence table. Only the nucleic acid sequence is stored within the sequence tables; the amino acid sequence is translated in real time by the system to maximize storage resources and to allow real time translation during the assignment of a Coordinate annotation. Thus it is only necessary to populate the database with nucleic acid sequence. A FASTA sequence file is formatted to a flat file using the Perl script SeqToFlat.pl and Data Transformation Services (DTS) are used to import each flat file into respective sequence tables. The stored procedure sp_Admin_PopulateGenomeData.sql populates initial annotation data into the appropriate tables. Coordinates must be supplied for all known ORFs to allow their annotation. Other annotation data associated with Ontology, Identity, Note and Gene Name annotations are not required.

    Two stored procedures are supplied to facilitate modifications of the base genome sequence: sp_Sequence_ InsertFragment and sp_Sequence_DeleteFragment. Although the exact parameters required vary between the two procedures, parameters such as sequenceID (string), baseNum (int) and fragment (string) may be specified to insert or remove sequence fragment from given sequence tables. These procedures automatically update feature coordinates downstream of the insertion or deletion. These procedures will ultimately be associated with the peer review process and be assigned web-based user interfaces, but at the moment they are accessible only by a system administrator.

    RESULTS

    PeerGAD is a database-driven application run over the internet and is available to users via a web browser. The application front-end consists of numerous web pages, each serving a variety of tasks and giving the application robust functionality (see Fig. 2 for an overview of the resources offered by PeerGAD). PeerGAD was developed as a part of a project to sequence and annotate the genome of the plant pathogen P.syringae pv. tomato DC3000; an example of the application of PeerGAD to this project is accessible at the project annotation site (http://genome.pseudomonas-syringae.org). Further details related to the sequencing and initial annotation efforts associated with this project are published elsewhere (12).

    Figure 2. PeerGAD site overview. Page content and functionality currently offered by the PeerGAD web application are shown. Each block represents a single page. Stacked blocks (e.g. ‘Annotation Entry’ and ‘Annotation Review’) represent functionality that is presented to the user in a series of pages rather than a single page. Lines connecting blocks represent physical links between pages and arrows clarify the application flow. The color of each block indicates the level of privileges needed to access each page. A user with annotator privileges has all the rights of a guest and an administrator has all the rights granted to a user with annotator privileges.

    User privileges

    PeerGAD defines three user groups: annotator, guest and administrator. Users who wish to participate in the annotation process must register with the system and are given annotator access. Users who wish to gain access to the site but do not wish to register are given guest access. This level of access is assigned transparently when a user visits the site and does not log in to the system. Guests may browse the site and utilize various site resources, but may not participate in the annotation process. The majority of the site is publicly accessible and does not require registration. Lastly, administrator privileges are granted to the users who maintain the site. Administrators maintain all annotator rights, but may also access various resources that facilitate site maintenance. Figure 2 offers an overview of these privileges and how each one relates to the various pages that make up the PeerGAD web application.

    Application hub

    The main entry page, or Welcome page, is a hub to the entire application. It provides the site description, project objectives, links to other pages, annotation statistics and a log on prompt for registered users. The site description is intended for new users unfamiliar with the site. An overview of the annotation process is supplied along with descriptions of the registration process and the types of contributions possible by annotators upon registration. Primary help files and links to a library of help files and ‘HOW TO’ guides are prominently displayed to make the site user-friendly and to entice new users to learn more about the system. Statistics displaying top annotators and most recently annotated genes are prominently displayed.

    Genome sequence navigation

    Linked from the main entry page is the Genome Browser page. Contained within this page is the Genome Browser (Fig. 3). The browser allows users to navigate the genome graphically and is the main graphical user interface for visually accessing genome data. While the main entry page is considered the primary hub to the application, the Genome Browser can be considered the primary hub for accessing genome data, viewing current annotations and contributing to the annotation process.

    Figure 3. Genome Browser. The browser allows users to navigate the genome visually, view ORF context and access genome sequence. Navigation buttons allow users to jump, scroll, zoom and move to specific coordinates in the genome. The ORFs, represented by the colored rectangles flanking the genome coordinates, are clickable features. ORF colors are determined by their pre-assigned biological roles. For genomes containing multiple plasmids or chromosomes, a sequence selector facilitates toggling between sequences. Sequence data can be viewed visually or flagged for download. Flagged features may be downloaded as FASTA formatted files containing amino acid and nucleic acid sequences. The browser also features text-based search functionality. The gene identities shown in the scrollable list above the Text Search field represent results of such a text-base search for the term ‘kinase’. Highlighting a gene identity from the list and clicking the ‘Select’ button makes the gene the currently selected feature.

    The browser and database structure supports the display and annotation of a variety of feature types (ORF, tRNA etc.) but is, for the purpose of the P.syringae project, configured to display only ORFs. Each feature has a unique PeerGAD identifier (ID) associated with it. Passing the mouse over a feature reveals this ID (e.g. ORF00231 and the feature’s Identity (e.g. ‘heavy metal sensor histidine kinase’) via a popup text display. Clicking a feature makes it the currently selected feature and updates annotation displays correspondingly. Annotation displays are present for each annotation type (Table 1). Thus the user may scroll through the page to reveal currently assigned Coordinates, Identity, Ontology, Notes and Gene Name annotations (Fig. 4A, B and C). Figure 4B represents the ontology annotation display visible within the Genome Browser page. The displays for both the Ontology and Note (Fig. 4C) annotation, which can contain large amounts of data, minimize the vertical space needed to present their content via an internal scroll bar. The Genome Browser page displays the DNA sequence and translated protein sequence associated with the currently selected feature. The user can also display the promoter region of the selected feature and specify the number of upstream bases to display via a pulldown menu. Useful properties such as DNA length, protein length, GC% and coding direction are also displayed. A unique locus and NCBI identifier (GI) are also displayed and the GI is outlinked to the NCBI Entrez database (http://www.ncbi.nlm.nih.gov/entrez/).

    Figure 4. (A). Feature Profile. Located on the Genome Browser page, the Feature Profile displays static and dynamic data relevant to the currently selected feature. Values for CG%, protein length and DNA length are dynamically calculated from sequence and/or coordinate values. ‘Coordinate’ and ‘Gene Name’ values represent Current annotation data. The Annotation Profile (Fig. 5) is accessible via the ‘’ button. A multi-state annotation button (e.g. ‘’ or ‘’), visible only to annotators but not to guests, is a portal to systems annotation functionality. The ‘Putative ID’ or Identity is displayed just below the Feature Profile. (B) GO Annotation Display. Visible from the Genome Browser page, the GO Annotation Display is the primary means by which GO data is presented to users. The display presents GO terms in a hierarchical manner as part of a DAG. The display interface features collapsible nodes and a scroll bar, features useful for complex trees. Each node contains a GO ID and a GO Term. ORF03844represents the currently selected ORF. The ‘’ link directs users to a more detailed GO annotation interface containing detailed GO evidence data and a history of all previously submitted Ontology annotations. (C). Notes Annotation. Just below the GO Annotation Display is the Notes Annotation Browser. This browser allows users to quickly page through Note annotations submitted by annotators. The browser is collapsible and may be partially hidden from view via the ‘’ button.

    Search functionality is available from multiple pages within the site and can be broadly classified as text-based or sequence-based. The Genome Browser supports text-based searches allowing users to locate ORFs based on a query term that may match against Gene Name or Identity values. Sequence-based searches are performed from a ‘Sequence Search’ page separate from the Genome Browser. Sequence searches which are transparent to users are performed using the BLAST algorithm (13). Such searches allow users to quickly locate and navigate to matching regions of the genome. Results generated by the execution of the blastall.exe are streamed to an XML file and parsed into a table displaying coordinates, sequence name and BLAST scores to the user. Results are clickable and allow users to link back to the matching sequence in the context of the Genome Browser.

    Annotation process

    The annotation process is built around concepts derived from a peer review model (Fig. 5). The model integrates three central principles: (i) no single individual should be allowed to reject a Pending annotation, (ii) one individual is sufficient to accept a Pending annotation and (iii) an annotator should be given the option to submit an annotation that does not undergo the peer review process, but should also be discouraged from doing so. These concepts form the foundation for the underlying application architecture and shape over a dozen pages that guide annotators through annotation entry in a stepwise process. Each page flows from one to the next by means of a ‘Next’ button and accepts and confirms data entry throughout the process. Upon commencement of a new annotation entry, annotators are presented with an initial page that allows the annotator to specify which of the five currently available annotation types (Table 1) are to be entered. Multiple types can be specified as part of a single entry. Subsequent pages present to the user specialized interfaces responsible for acquiring annotation data specific to each type. The annotator is presented only with interfaces relevant to the annotation types selected at the onset of the entry process. Each interface controls and monitors the input of annotation information to decrease erroneous entries. Annotator’s comments are accepted with each annotation.

    Figure 5. Peer Review Model. The model represents annotation paths leading to various annotation states. Circle nodes represent an annotation event; square nodes represent resting states. An annotation occupies one of four states: Pending, Current, Rejected and Obsolete. Once an annotation is submitted, it remains Pending until either accepted or rejected. Two rejection requests, submitted by two separate annotators, are required for a Pending annotation to be rejected. An annotation will remain Current until a new annotation is proposed and accepted. Once an annotation is submitted, further annotation cannot be performed until a resting state is reached. The process iterates until annotators submit no further annotations.

    A record of all annotations submitted to the system is maintained. The Annotation Profile page (Fig. 6) displays annotation details pertaining to an individual feature. The profile is linked from several locations throughout the site and allows any user (guest or annotator) to access a feature’s annotation history.

    Figure 6. Annotation Profile. The Annotation Profile displays a complete history of annotations submitted by registered annotators alongside pertinent information indicating an annotation ID, type of annotation submitted, annotation status, corresponding Primary Annotator (PA) and annotation submission date. The button displays additional information surrounding the annotation submission. More specifically, the column in which it is located determines the type of information it displays. If in the ‘Annotation’ column, a user click displays an annotation . If in the ‘Annotators’ column, an overview of contributing annotators is displayed. The Ontology and Gene Name annotations submitted by user ‘TIGR’ (ID = 63602) represent annotations that have become Obsolete due to the submission of newer annotations of the same type (IDs 63747 and 63760, respectively). Note annotations do not become Obsolete upon submission of newer Note annotations. The Gene Name annotation submitted by user ‘sbrownieuk’ (ID 63762) represents a Pending annotation awaiting review by the community. Annotations of different types with the same ID number represent annotations submitted simultaneously by the Primary Annotator.

    Annotation data are only accepted from registered users and user registration is controlled via a Registration page. An annotator id is issued via email to project leaders upon their registration. Project members wishing to register are required to enter the annotator id issued to their project leader in order to complete the registration process. It is expected that project leaders will maintain responsibility for distributing their annotator id only to project members qualified to perform annotation assessments.

    Users must specify a username and password as a part of the registration process. Upon logging onto the system, annotators are directed to the Annotator Resource page. This page is specific to the user and contains annotation statistics and links pertaining to the annotation process. Links lead annotators to pages that maintain a personal annotation history as well as annotations submitted by other annotators. The status of annotations may be reviewed and annotators may cancel their own pending annotations.

    Annotators may set up a research profile describing their research. This profile is used by the system to recommend pending annotations awaiting review to users who may have an interest in reviewing them. Recommendations are made via email alerts and by flagging annotations displayed by the system.

    Background information for each annotator, such as research institute, location and email address, are shared with other users throughout the site. All annotations have associated with them the annotators’ background information and annotation comments submitted during the annotation process. An annotator search is available for all users of PeerGAD that allows viewing of all currently registered users or a subset based on a query. The search results in the relevant user names and links to a list of annotations contributed by each user.

    Gene ontologies

    Multiple pages integrate GO functionality. The Genome Browser page conveys current GO assignments to users via a GO tree component (Fig. 4B). This component graphically presents term relationships in tree form and maintains characteristics associated with a directed acyclic graph (DAG). Linked from the Genome Browser page is a more detailed interface that is available to users. This interface contains the GO tree component along with evidence data and a GO annotation browser that allows rapid comparison of all GO data submitted for the currently selected feature whether Current or not. Available to annotators only, the GO Annotation Tool (Fig. 7) allows users to assign individual GO terms as a part of the annotation process. Assignments can be made from any of the currently established GO terms classified under the parent GO categories: molecular function, biological process and cellular component.

    Figure 7. GO Annotation Tool. Available to registered annotators, the tool allows GO terms to be interactively assigned to an Ontology annotation. The tool features two windows, (1) Gene Ontology Terms (upper window) and (2) Gene Ontology Annotation (lower window), and an Annotation Browser. The upper window makes accessible the entire GO vocabulary via a dynamic GO tree that actively retrieves child nodes from the database as parent nodes are expanded. Obscure terms may be located by means of the keyword filter (A) and by linking to the external GO database ‘AmiGO’ (B). Terms selected and added via the ‘’ button (C) to a newly formed Ontology annotation appear in the lower window as indicated by the arrow (D). Terms selected in the lower-window may be removed from the Ontology annotation via the ‘’ button (E). Upon selecting terms in the upper window, the corresponding GO definition is displayed to assist the annotator in choosing an appropriate term. Further information relating to the selected term is available to the annotator via the ‘’ button (F). Existing GO annotations may be updated or created via the ‘’ and ‘’ buttons, respectively (G). The Annotation Browser, located at the bottom of the tool (H), allows all previously submitted Ontology annotations to be viewed.

    A key feature to the GO vocabulary is that it is dynamic. As the GO Consortium adds new terms to the GO vocabulary, monthly updates are made available to the research community via the consortium’s web page (http://www.geneontology. org). PeerGAD maintains concurrency with these updates through an automated update system. A system administrator maintains these updates via the Update GO Terms page.

    Performance

    PeerGAD is a multi-user application that allows multiple members of the community to access it simultaneously. When providing a resource to a large number of simultaneous users, performance considerations and issues relating to resource allocation must be considered. Degradation of a user’s experience due to inadequate system performance such as sluggish page loads and socket errors would likely lead to decreased use of the system by its intended community. To ensure that PeerGAD’s performance is adequate during heavy usage, it is necessary to simulate user load conditions experimentally. Termed stress testing, such conditions may be simulated using Microsoft Application Center Test (ACT). The Genome Browser page is the most resource-intensive portion of the entire application; it must retrieve annotation data from the database and dynamically build content such as the genome map and translated protein sequence. To assess how PeerGAD would perform under heavy user load, the Genome Browser page and a less resource-intensive page, the Welcome page, were targeted during the stress test. As expected, the Welcome page outperforms the Genome Browser page (Fig. 8). The performance of the Genome Browser page does not begin to show signs of decline until 100 simultaneous users, which should be adequate for even very large genome annotation communities. In fact, even given a community size of 3000 users and an assumed rate of 1% simultaneous system usage, users would not notice any significant decrease in system performance.

    Figure 8. Performance results. These plots represent the results derived from stress tests performed to evaluate how PeerGAD performs under various user loads. Two areas of the PeerGAD application were targeted during stress tests: the Welcome page and the Genome Browser page. Line color in each plot indicates to which of these two pages the results refer: green, Welcome page; orange, Genome Browser page. (A) Latency, or the amount of time required to load and deliver a page request. (B) Number of page request served per second. (C) Total number of request delivered by the web server during the 5 min stress test. In all plots, the number of simultaneous connections is indicated on the x-axis.

    To evaluate PeerGAD’s potential to accommodate large genomes (e.g. >100 Mb), potential scaling bottlenecks were considered. A critical potential bottleneck is access time required to retrieve sequence data from sequence tables. The P.syringae DC3000 chromosome sequence is 6.4 Mb in length. The table containing this sequence was doubled in size sequentially four times to ultimately accommodate a sequence of 102 Mb in length

    Measuring access time required to retrieve data using a SELECT statement on this table after each doubling is an accurate indicator of how the table would scale to accommodate larger genomes. Results indicate that there is no noticeable difference in access time for a SELECT statement issued on a sequence table containing 6.4 Mb versus 102 Mb of sequence data. Access times after each doubling were <18 ms regardless of the genome size tested. These results indicate that PeerGAD’s sequence tables can easily accommodate larger genome sequences without experiencing a decrease in overall system performance. PeerGAD is intended for use with prokaryotic genomes, making such scaling capabilities unnecessary; however, these results indicate that the sequence table design can also accommodate large eukaryotic genomes.

    DISCUSSION

    Efficient and accurate annotation of genome sequences is an essential first step toward using these sequences for comparative and functional genomics. PeerGAD presents a convenient way for genome projects to offer a centralized resource in which to collect annotation data from the scientific community. It allows for the efficient exchange of scientific ideas while providing a mechanism to ensure the integrity of the scientific data it maintains. A key premise underlying PeerGAD is that a community of researchers will both contribute and directly benefit from a peer-reviewed annotation database and that this self-interest will lead to continued improvements in the annotation as well as foster better communication among scientists.

    The extent to which a large group of researchers will participate in a community-based annotation effort is unknown. Academic scientists are motivated to publish in peer-reviewed journals for reasons related to both dissemination of their ideas and, importantly, career advancement. Especially for the latter motivation, greater recognition among fellow scientists and tenure review committees for meaningful contributions to genome annotation would be one way to recognize and encourage community involvement in genome annotation. Obviously, this is probably a long-term prospect and is not directly addressable by developers of annotation software.

    Several research groups have recognized the potential of harnessing the web to facilitate the annotation of prokaryotic genome sequences. Examples include ASAP (14), CYORF (15), Manatee (http://manatee.sourceforge.net/) and PseudoCAP (http://www.pseudomonas.com/). Table 2 compares the major functionality of each of these web-based annotations packages. Although each of these packages shares areas of similar functionality, several key areas set them apart. PeerGAD’s greatest distinction among these packages is its method of regulating the review of annotation submissions.

    Table 2. Summary of selected web-based annotation packages,

    CYORF’s method of annotation review is most closely related to PeerGAD in that it too does not require curatorial review of submissions before public release. However, like Manatee, annotations submitted to CYORF are automatically released to the public upon submission. PeerGAD more tightly regulates the public release of annotation data until one or several members from its online annotation community have reviewed it. In each package, annotations may be resubmitted multiple times until annotators derive an accurate annotation. Further, both CYORF and PeerGAD attempt to facilitate communication between registered annotators by means of email notifications.

    Notable strengths to the ASAP package are that it encompasses a broad range of feature types to which annotation can be assigned, allows high-throughput experimental data (e.g. microarray data) to be linked to annotations and accepts annotation that establishes relationships (e.g. protein–protein interactions, multi-genome orthologies) between multiple features. Similarly, CYORF also facilitates intergenomic comparisons by allowing features from multiple related genomes to be linked. A notable strength of the Manatee package is its GO functionality and use of prediction models (e.g. hidden Markov model) to suggest GO terms to annotators.

    It should be emphasized that not all of these packages have been developed with other genome communities in mind, but they are highlighted because, at the very least, they incorporate functionality that address needs which are universal. Functionality and intended purpose vary among all of these systems, and each one likely has functionality to which each of the others could aspire.

    A method to encourage community participation in genome annotation that is under the control of software developers is to make the process user-friendly and efficient, and to provide immediate useful tools to the researcher accessible via the annotation website. This is the strategy we have employed in the design of PeerGAD. By incorporating functionality that goes beyond the simple acceptance of annotation data, we anticipate increased usage of the system as a whole, thus increasing the likelihood of annotation contributions by the community. The Genome Browser (Fig. 3), for example, includes many rich features for viewing and accessing the genome sequence contextually. Functionality that allows researchers to download sequence data, search the genome or rapidly change between molecules is part of a simple but valuable toolbox that will likely bring users back to the system.

    Further building on the assumption that a loyal user base will increase the likelihood of annotation contributions, PeerGAD attempts to draw users back to the system through the use of email alerts. Subsequent to the registration process, annotators are encouraged to set up user profiles describing their specific research interests. While this information is not directly shared with other annotators, it is used to alert annotators with related expertise that a new annotation has been submitted to the system.

    Annotation statistics have also been used as a way to encourage further annotation of the genome. During mock-up and feedback sessions performed during the early states of PeerGAD development, it became apparent that scientists like to see their names in print. By displaying a ‘Top 5 Annotators’ table on the Welcome and Annotator Resource pages, scientists’ names are displayed to other community members visiting the site while also setting up an element of friendly competition. Further, PeerGAD incorporates summaries of pending annotations as well as a summary of ORFs with unassigned Ontology annotations to encourage further annotation as well as to avoid the possible pitfall that researchers will not contribute because they are unsure where to contribute.

    Drawing users to the system is one task; maintaining similar annotation standards among annotators is another task entirely. PeerGAD attempts to promote the use of similar annotation standards among annotators using several approaches: (i) offering extensive annotation guidelines throughout the annotation process; (ii) use of text boxes, pulldown menus and controlled vocabularies; (iii) displaying email addresses of annotators throughout the annotation pages; and (iv) through the peer review process. We anticipate that the implementation of these approaches will together reduce confusion and ambiguities among annotators. By offering guidelines to annotators and constraining data entry through pulldown menus, for example, we hope to dispel obvious annotation entry inconsistencies. Displaying emails to annotators offers annotators the option for off-line discussion of annotation standards. Additionally, the peer review process itself helps to facilitate communication among annotators and also offers a venue by which annotators may enforce standards upon other members of the annotation community.

    Microsoft technology has been used throughout development, data storage and delivery of the PeerGAD application to the web. Although Microsoft technology is not historic ally associated with bioinformatics, it is widely and successfully used in web-based data-centric multi-user environments throughout the world (http://www.asp.net/Default.aspx?tabindex=8&tabid=40). In particular, there are several examples where Microsoft ASP.NET technology has been integrated into bioinformatics and LIMS environments. For example, Cornell University, where PeerGAD has been developed, has an extensive commitment to Microsoft technology and has successfully adopted the use of Windows Clusters for use in parallel processing used in BLAST and protein modeling (http://cbsu.tc.cornell.edu/main.htm, http://cbsu.tc.cornell.edu/projects.htm, http://www.tc.cornell.edu/). Non-academic units such as Perlegen Sciences (http://www.perlegen.com/) use Microsoft technology for core LIMS development. Microsoft has released a case study of how its technology is being used at Perlegen, and this is available at http://www.microsoft.com/resources/casestudies. These examples indicate that, although much of Microsoft’s commitment has been to the industrial and financial worlds, the same principles used in those data-centric environments can be applied to bioinformatics models.

    The language C# is an object-oriented language that parallels the structure of Java (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dncenet/html/c_n_java.asp). Microsoft provides the Java Language Conversion Assistant (JCLA) application that assists developers in converting Java language files to Visual C. As a consequence, porting bio.java into C# could be an avenue to explore if development in Microsoft environments becomes of greater interest to the mainstream bioinformatics community.

    We recognize that our use of ASP.NET for PeerGAD might hinder its portability to UNIX/Linux based systems, a cornerstone of many successful bioinformatics efforts. However, the Mono project (http://www.go-mono.com), spearheaded by Ximian (http://www.ximian.com/), is currently working toward an ASP.NET standard for Unix/Linux platforms. Ximian has recently been acquired by Novel, which has pledged continued support to the Mono initiative. While porting PeerGAD to work within the Mono framework has not been within the scope of our current efforts, Mono makes the porting of ASP.NET-based projects such as PeerGAD possible in the near future.

    Use of a peer review model for genome annotation projects has not yet been broadly tested. Thus PeerGAD offers a focused means of testing this model and can evolve to fit its users’ needs. With a community of several hundred active Pseudomonas researchers, we expect ample feedback of this system to be generated in the coming year. Our future efforts will be aimed at assessing its strengths as well as those areas that should evolve further to fit community needs. While this report attempts to highlight the functionality and design considerations incorporated into PeerGAD, we intend to report the successes and shortcomings of our approaches in future publications.

    ACKNOWLEDGEMENTS

    The Cornell Bioinformatics Service Unit at Cornell University (CBSU) hosts the P.syringae DC3000 annotation project site. Qi Sun and Lucy Walle are CBSU system administrators responsible for maintaining server resources. Thanks are given to Baohua Wang of the Cornell Theory Center for offering database design insights. We thank Robin Buell and Michelle Giglio of The Institute for Genomic Research for their assistance in providing sequence and annotation data. Jeffrey Gordon is thanked for his efforts in developing a Visual Basic XML parser responsible for obtaining the NCBI accession numbers used in this project. Both Magdalen Lindeberg and Candace Collmer are thanked for their editorial input of system help files and text presented throughout PeerGAD. Jonathan Cohn, Jeffrey Gordon, Magdalen Lindeberg, Jeff Anderson, Michelle Gwinn, Adriana Ferreira and Phil Bronstein are thanked for their comments relating to site flow and GUI development. Nicole Perna (Project Leader, ASAP), Michelle Gwinn (TIGR Annotator, Manatee), Fiona Brinkman (Project Leader, PseudoCAP) and Jeff Elhai (Developer, CYORF) are acknowledged and thanked for their contributions to Table 2. The development of PeerGAD was supported by National Science Foundation grant no. DBI-0077622 (to A.C. and G.B.M.).

    REFERENCES

    Andrade,M.A., Brown,N.P., Leroy,C., Hoersch,S., de Daruvar,A., Reich,C., Franchini,A., Tamames,J., Valencia,A., Ouzounis,C. et al. ( (1999) ) Automated genome sequence analysis and annotation. Bioinformatics, , 15, , 391–412.

    Stein,L. ( (2001) ) Genome annotation: from sequence to biology. Nat. Rev. Genet., , 2, , 493–503.

    Gattiker,A., Michoud,K., Rivoire,C., Auchincloss,A.H., Coudert,E., Lima,T., Kersey,P., Pagni,M., Sigrist,C.J., Lachaize,C. et al. ( (2003) ) Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem., , 27, , 49–58.

    Gerlt,J.A. and Babbitt,P.C. ( (2000) ) Can sequence determine function? Genome Biol., , 1, , reviews0005.1–0005.10.

    Rust,A.G., Mongin,E. and Birney,E. ( (2002) ) Genome annotation techniques: new approaches and challenges. Drug Discov. Today, , 7, , S70–S76.

    Pennisi,E. ( (2000) ) Are sequencers ready to ‘annotate’ the human genome? Science, , 287, , 2183.

    Gerstein,M. ( (2000) ) Annotation of the human genome. Science, , 288, , 1590.

    Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. ( (2000) ) Gene ontology: tool for the unification of biology. Nature Genet., , 25, , 25–29.

    Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K., Dwight,S.S., Eppig,J.T. et al. ( (2001) ) Creating the gene ontology resource: design and implementation. Genome Res., , 11, , 1425–1433.

    Gribskov,M. ( (2003) ) Challenges in data management for functional genomics. Omics, , 7, , 3–5.

    Delcher,A.L., Harmon,D., Kasif,S., White,O. and Salzberg,S.L. ( (1999) ) Improved microbial gene identification with GLIMMER. Nucleic Acids Res., , 27, , 4636–4641.

    Buell,C.R., Joardar,V., Lindeberg,M., Selengut,J., Paulsen,I.T., Gwinn,M.L., Dodson,R.J., Deboy,R.T., Durkin,A.S., Kolonay,J.F. et al. ( (2003) ) The complete genome sequence of the Arabidopsis and tomato pathogen Pseudomonas syringae pv. tomato DC3000. Proc. Natl Acad. Sci. USA, , 100, , 10181–10186.

    Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. ( (1990) ) Basic local alignment search tool. J. Mol. Biol., , 215, , 403–410.

    Glasner,J.D., Liss,P., Plunkett,G.,3rd, Darling,A., Prasad,T., Rusch,M., Byrnes,A., Gilson,M., Biehl,B., Blattner,F.R. et al. ( (2003) ) ASAP, a systematic annotation package for community analysis of genomes. Nucleic Acids Res., , 31, , 147–151.

    Furumichi,M., Sato,Y., Omata,T., Ikeuchi,M. and Kanehisa,T. ( (2002) ) CYORF: community annotation of Cyanobacteria genes. Genome Informatics, , 13, , 402–403.(Mark D. D’Ascenzo*,1, Alan Collmer2 and )