当前位置: 首页 > 期刊 > 《核酸研究》 > 2006年第Da期 > 正文
编号:11366857
Database resources of the National Center for Biotechnology Informatio
http://www.100md.com 《核酸研究医学期刊》
     National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA

    *To whom correspondence should be addressed. Tel: +1 301 496 2475; Fax: +1 301 480 9241; Email: wheeler@ncbi.nlm.nih.gov

    ABSTRACT

    In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups, Retroviral Genotyping Tools, HIV-1, Human Protein Interaction Database, SAGEmap, Gene Expression Omnibus, Entrez Probe, GENSAT, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized datasets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov.

    INTRODUCTION

    The National Center for Biotechnology Information (NCBI) at the National Institutes of Health was created in 1988 to develop information systems for molecular biology. In addition to maintaining the GenBank(R) (1) nucleic acid sequence database, to which data are submitted by the scientific community, NCBI provides data retrieval systems and computational resources for the analysis of GenBank data and a variety of other biological data. For the purposes of this article, the NCBI suite of database resources are grouped into six broad categories. All resources discussed are available from the NCBI home page at: (http://www.ncbi.nlm.nih.gov). In most cases, the data underlying these resources are available for bulk download at (ftp.ncbi.nih.gov), a link from the NCBI home page.

    DATABASE RETRIEVAL TOOLS

    Entrez

    Entrez (2) is an integrated database retrieval system that enables text searching, using simple Boolean queries, of a diverse set of 30 databases, six added during the past year. Global Query, the default search on the NCBI homepage, searches across all the Entrez databases and rapidly returns the counts of matching records in each database. A user may then display results or further refine searches in any individual database. The Entrez databases include almost 70 million DNA and protein sequences derived from several sources (1,3–6), the NCBI taxonomy, genomes, population sets, gene expression data, 800 000 gene-oriented sequence clusters in UniGene, almost half a million sequence-tagged sites in UniSTS, 20 million genetic variations in dbSNP, 30 000 protein structures from the Molecular Modeling Database (MMDB) (6), 138 000 3D and 11 000 alignment-based protein domains and the biomedical literature via PubMed, PubmedCentral, Online Mendelian Inheritance in Man (OMIM), and online Books. PubMed includes 15 million citations from 4700 life science journals for biomedical articles back to the 1950s, most with abstracts and many with links to the full-text article. The Books database contains more than 49 online scientific textbooks including the NCBI Handbook, a comprehensive guide to NCBI resources. To enable researchers to quickly reach the appropriate NCBI resource, the content of the NCBI web pages and FTP directories has been incorporated into an Entrez database of its own. Searches of the NCBI web site using the same powerful queries available for the biological databases are therefore possible.

    Entrez provides extensive links within and between database records. In their simplest form, these links may be simple cross-references between a sequence and the abstract of the paper in which it is reported, or between a protein sequence and its coding DNA sequence or, perhaps, its 3D structure. Other examples are links between a genomic assembly and its components or between a genomic sequence and those sequences derived from its annotation. Computationally derived links between ‘neighboring records’, such as those based on computed similarities among sequences or among PubMed abstracts, allow rapid access to groups of related records. A service called LinkOut expands the range of links to include external services, such as organism-specific genome databases. To accommodate the growing number of links, Entrez provides a Links pull down menu that appears in the top, right hand corner of record displays.

    The records retrieved in Entrez can be displayed in many formats and downloaded singly or in batches. A redirection control allows results to be saved in a local file, shown in the browser as plain text, or sent to the Entrez clipboard. PubMed results may be emailed directly from Entrez. Formats available for GenBank records include the GenBank Flatfile, FASTA, XML, ASN.1 and others. Graphical display formats are offered for some types of records, including genomic records. For sequence records, a formatting control allows the display or download of a particular range of residues.

    Entrez has introduced ‘MyNCBI’ which allows users to store personal configuration options, such as search filters, LinkOut preferences and document delivery providers. MyNCBI also saves searches and can automatically e-mail updated search results. Also new in Entrez is a set of up to five default filter tabs used to display subsets of database results. The tabs vary according to Entrez database; examples of some defaults include, ‘mRNA’ and ‘RefSesq’ subsets for Nucleotide, a ‘Review’ subset for PubMed, ‘NMR’ and ‘X-ray’ subsets for Structure. Default filter tabs can be changed using MyNCBI. Additional MyNCBI features include changing the way Entrez links are displayed to standard html links or pull downs, and highlighting PubMed search terms.

    Access to Entrez via user-scripting is facilitated using the Entrez Programming Utilities (E-Utilities), a suite of eight server-side programs supporting a uniform set of parameters used to search, link between and download from, the Entrez databases. A search history, available via interactive Entrez as well as via the E-Utilities, allows users to recall the results of previous searches during an Entrez session and combine them using Boolean logic. The ‘einfo’ utility can be used to retrieve detailed information about the Entrez databases, such as lists of supported search fields or the date of the last database update, while ‘egquery’ returns the number of matches to a single query in every Entrez database. An automated system may use E-Utilities, such as ‘efetch’ or ‘esummary’, to retrieve the data. Recently, support for new download formats was added to the E-Utilities, for several of the Entrez databases, such as Gene, SNP and Taxonomy while a new E-Utility, ‘espell’, was implemented to check spelling within Entrez queries and to offer suggestions in cases where a misspelling might cause key records to be missed. A Simple Object Access Protocol (SOAP) interface to the E-Utilities supports structured, XML-based access. Instructions for using the E-Utilities are found under the ‘Entrez Tools’ link on the NCBI home page.

    PubMed Central

    PubMed Central (PMC) (7) is a digital archive of peer reviewed journals in the life sciences providing access to 400 000 full-text articles, an increase of 100 000 from the past year. More than 200 journals, including Nucleic Acids Research, deposit the full text of their articles in PMC. Participation in PMC requires a commitment to free access to full text, either immediately after publication or within a 12-month period. All PMC free articles are identified in PubMed search results and PMC itself can be searched using Entrez.

    Taxonomy

    The NCBI taxonomy database, growing at the rate of 3000 new taxa a month, indexes 205 000 named organisms that are represented in the databases with at least 1 nt or protein sequence. The Taxonomy Browser can be used to view the taxonomic position or retrieve data from any of the principal Entrez databases for a particular organism or group. The Taxonomy Browser also displays links to the Map Viewer, Genomic BLAST services, the Trace Archive, and to external model organism and taxonomic databases via LinkOut. Searches of the NCBI taxonomy may be made on the basis of whole, partial or phonetically spelled organism names. Entrez Taxonomy displays include custom taxonomic trees representing user-defined subsets of the full NCBI taxonomy.

    THE BLAST FAMILY OF SEQUENCE-SIMILARITY SEARCH PROGRAMS

    The Basic Local Alignment Search Tool (BLAST) programs (8–10) perform sequence-similarity searches against a variety of sequence databases, returning a set of gapped alignments with links to full database records, to UniGene, Gene, the MMDB or GEO. One variant, BLAST2Sequences (11), compares two DNA or protein sequences and produces a dot-plot representation of the alignments.

    Each alignment returned by BLAST is scored and assigned a measure of statistical significance, called the Expectation Value (E-value). BLAST takes into account the amino acid composition of the query sequence in its estimation of statistical significance. This composition-based statistical treatment, used in conventional protein BLAST searches as well as PSI-BLAST searches, tends to reduce the number of false-positive database hits (12). The alignments returned can be limited by an E-value threshold or range.

    Standard output formats include the default pairwise alignment, several query-anchored multiple sequence alignment formats, an easily-parsable Hit Table and a taxonomically organized output. Database sequences appearing in BLAST results may be marked for batch retrieval using check boxes. A new, enhanced, formatter displays alignments against database sequences that are >200 000 bp in length with links to nearby features, such as genes. A new ‘Pairwise with identities’ mode better highlights differences between the query and a target sequence. An option to display masked characters in lower-case or using distinct colors is now available.

    The web BLAST interface allows both the initial search and the results displayed to be restricted to a database subset using an Entrez query as a filter. Web BLAST uses a standard URL-API that allows complete search specifications, including BLAST parameters, such as Entrez restrictions and the search query, to be contained in a URL posted to the web page.

    MegaBLAST (13), designed to find nearly exact matches, offers a Web interface that handles batch nucleotide queries and operates up to 10 times faster than standard nucleotide BLAST. MegaBLAST is the default search program for NCBIs Genomic BLAST pages. MegaBLAST is also used to search the rapidly growing Trace Archive and is available for the standard BLAST databases as well. For rapid cross-species nucleotide queries, NCBI offers Discontiguous MegaBLAST which uses a non-contiguous word match (14) as the nucleus for its alignments. Discontiguous MegaBLAST is far more rapid than a translated search such as blastx, yet maintains a competitive degree of sensitivity when comparing coding regions.

    A new strategy has been designed to improve the speed of BLAST searches. The system, called ‘SplitD’, splits the databases into a number of segments to spread the calculations across multiple back-end machines. SplitD keeps track of the database segments that have been used most recently and are, therefore, most likely to remain in memory. These segments can be reused in the next search to avoid a slow read from disk storage.

    BLink

    BLAST Link (BLink) displays pre-computed BLAST alignments to similar sequences for each protein sequence in the Entrez databases. BLink can display alignment subsets limited by taxonomic criteria, by database of origin, relation to a complete genome, membership in a COG (15) or by relation to a 3D structure or conserved protein domain. BLink links are displayed for protein records in Entrez as well as within Entrez Gene reports.

    RESOURCES FOR GENE-LEVEL SEQUENCES

    Entrez Gene

    Entrez Gene (5), the successor to LocusLink, provides an interface to curated sequences and descriptive information about genes with links to NCBI's Map Viewer, Evidence Viewer, Model Maker, BLink, protein domains from NCBI's Conserved Domain Database and other gene-related resources. Data are accumulated and maintained through several international collaborations in addition to curation by in-house staff. Links within Gene to the newest citations in PubMed are maintained by curators and provided as Gene References into Function (GeneRIF). The GeneRIF link within Gene reports leads to a form allowing researchers to add GeneRIFs to a Gene report. Entrez Gene displays have recently been enhanced with a collapsible navigation panel containing a table of contents for the record, the set of links to other resources, and links to related NCBI tools. The complete Entrez Gene dataset, as well as organism-specific subsets, is available in the compact NCBI ASN.1 format on the NCBI FTP site. A new tool that converts the native Gene ASN.1 format into XML, called ‘gene2xml’, is available for several popular computer platforms at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/gene2xml/. The tool supports filtering by organism so that organism-specific XML files can easily be generated from the comprehensive ASN.1 FTP file.

    UniGene

    UniGene (16) is a system for partitioning GenBank sequences, including expressed sequence tags (ESTs), into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene and is linked to the tissue types in which the gene is expressed, model organism protein similarities and Entrez Gene. UniGene clusters are created for all organisms for which there are 70 000 or more ESTs in GenBank and includes ESTs for more than 30 animals and over 26 plants and fungi. For the human UniGene July 2005 release (build 186), 5.1 million human ESTs in GenBank were reduced almost a 100-fold in number to 53 000 sequence clusters. When sufficient genomic sequence is available, UniGene clusters are now built using a genome-based clustering system to identify sets of transcript sequences which correspond to distinct transcription loci or to annotated genes. The procedure used for genome-based clustering of transcript sequences is described at http://www.ncbi.nlm.nih.gov/UniGene/g_build.html. The UniGene collection has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression (17). UniGene databases are updated weekly with new EST sequences, and bimonthly with newly characterized sequences.

    ProtEST

    ProtEST, tightly coupled to UniGene, presents pre-computed BLAST alignments between protein sequences from model organisms and the 6-frame translations of nucleotide sequences in UniGene. ProtEST links are displayed in UniGene reports with model organism protein similarities.

    The Trace and Assembly Archives

    The Trace Archive is a rapidly growing database of 800 million sequencing traces. More than 750 organisms are represented, an increase of more than 350 over the past year. The Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in GenBank. An Assembly Viewer allows displays of multiple sequence alignments as well as the sequence chromatograms for traces that are part of assemblies. The Trace Assembly Archives are linked from the NCBI home page.

    HomoloGene

    HomoloGene is a system for automated detection of homologs among the annotated genes of 18 completely sequenced eukaryotic genomes, including those of Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, Saccharomyces pombe, S.cerevisiae, Eremothecium gossypii, Neurospora crassa, Magnaporthe grisea, Arabidopsis thaliana and Oryza sativa. The HomoloGene build procedure is guided by the taxonomic tree and relies on conserved gene order and measures of DNA similarity among closely related species, while making use of protein similarity for more distantly related organisms. HomoloGene reports include homology and phenotype information drawn from Online Mendelian Inheritance in Man (OMIM), Mouse Genome Informatics (MGI) (18), Zebrafish Information Network (ZFIN) (19), Saccharomyces Genome Database (SGD) (20), Clusters of Orthologous Groups (COG) (15) and FlyBase. A Pairwise Scores display gives a table of statistics for prtein and nucleotide sequences among members of a Homologene group. HomoloGene entries now include paralogs in addition to orthologs. HomoloGene can be queried using Entrez at www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene.

    The Database for the major histocompatability complex (dbMHC)

    dbMHC contains variations found in alleles of the MHC, a highly variable array of genes playing a vital role in the success of organ transplants and susceptibility to infectious diseases. dbMHC contains hundreds of sequences for MHC alleles and data on typing kits used by academic, clinical and industrial laboratories. The database includes data arising from a survey of human leukocyte antigen (HLA) allele frequency distributions as well as a project to collect HLA genotype and clinical outcome information on hematopoietic cell transplants performed worldwide. Access to the data, lists of contributors, as well as a number of online tools for data analysis are provided at http://www.ncbi.nlm.nih.gov/mhc/MHC.cgi?cmd=init.

    A database of single nucleotide polymorphisms (dbSNP)

    The database of dbSNPs (21), a repository for single base nucleotide substitutions and short deletion and insertion polymorphisms, contains 10 million human SNPs and another 10 million from a variety of other organisms, with 5 million of these added over the past year. SNP reports link to 3D structures from the MMDB via NCBI's interactive macromolecular structure viewer, Cn3D (22), to highlight implied amino acid changes in coding regions. dbSNP provides additional information about the validation status, population-specific allele frequencies and individual genotypes for dbSNP submission. These data are available on the dbSNP FTP site in XML-structured genotype reports that include information and about cell lines, pedigree IDs and error flags for genotype inconsistencies and incompatibilities. Haplotype and linkage disequilibrium data are being incorporated in dbSNP as data are released from the International HapMap project. Functional variants are identified when dbSNP submissions can be matched to OMIM records and mutation reports in the biomedical literature. Entrez SNP supports searches for SNPs lying between two markers and batch downloads.

    Reference sequences (RefSeq)

    The RefSeq database (23) provides curated references for transcripts, proteins and genomic regions, plus computationally derived nucleotide sequences and proteins. The complete RefSeq database is provided in the RefSeq directory on the NCBI FTP site. The number of sequences in RefSeq has doubled over the past year. As of Release 12, RefSeq contained 2.8 million sequences, including more than 1.6 million protein sequences, representing almost 3000 organisms. To register for the ‘refseq-announce’ mailing list and be informed of new releases or to read more about the RefSeq project, visit the RefSeq home page.

    Specialized tools

    Open reading frame (ORF) Finder

    ORF Finder performs a six-frame translation of a nucleotide sequence and returns the location of each ORF within a specified size range. Translations of the ORFs detected can be analyzed via BLAST against the standard BLAST or COGs databases.

    Spidey

    Spidey aligns a set of eukaryotic mRNA sequences to against a single genomic sequence taking into account predicted splice sites and using one of four splice-site models (Vertebrate, Drosophila, C.elegans, Plant). Spidey returns exon alignments, protein translations, and a summary showing the alignment quality and goodness of match to splice junction patterns for each putative exon.

    Splign

    Splign (24) is a utility for computing cDNA-to-genomic, or spliced sequence alignments that is accurate in determining splice sites, tolerant of sequencing errors and supports cross-species alignments. Splign uses a version of the Needleman–Wunsch algorithm that accounts for splice signals in combination with BLAST to identify possible locations of genes and their copies as well as to speed up the core dynamic programming. The web version of Splign is able to compute and display the spliced alignment of a transcript sequence to a genomic sequence of up to 50 Mb in seconds. A standalone version that operates on longer genomic sequences is found at http://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi?textpage=downloads.

    Electronic PCR (e-PCR)

    Two types of e-PCR can be performed from the e-PCR home page at www.ncbi.nlm.nih.gov/sutils/e-pcr. Forward e-PCR searches for matches to STS primer pairs in the UniSTS database of almost 470 000 markers. Reverse e-PCR is used to estimate the genomic binding site, amplicon size and specificity for sets of primer pairs by searching against the genomic and transcript databases of A.gambiae, A.thaliana, C.elegans, D.rerio, D.melanogaster, H.sapiens, M.musculus and R.norvegicus. To increase sensitivity, Forward e-PCR allows the size of the primer segment to be matched, number of mismatches, number of gaps and the size of the STS to be adjusted. Binaries for several computer platforms, along with the source code, are available via FTP at ftp.ncbi.nlm.nih.gov/pub/schuler/e-PCR.

    Web interfaces to OrfFinder, Spidey, Splign and E-PCR are available via the ‘Tools’ link on the NCBI home page.

    RESOURCES FOR GENOME-SCALE ANALYSIS

    Entrez Genome

    Entrez Genome (25) provides access to over 250 complete microbial genomic sequences (70 added over the past year), more than 2100 viral genomic sequences (500 added) and over 800 reference sequences for eukaryotic organelles (250 added). Over 20 higher eukaryotic genomes are also included, such as the recent arrival, Strovgylocentrotus purpuratus, the sea urchin. The Plant Genomes Central web page serves as a portal to completed plant genomes, to information on plant genome sequencing projects or to plant-related resources at NCBI, such as plant Genomic BLAST pages or Map Viewer. Specialized viewers and BLAST pages are also available for eukaryotic organelles and viruses. Genomes are chosen from an alphabetical listing or a phylogenetic tree and can be examined at increasing levels of detail ranging from a graphical overview of an entire genome to the level of a single gene. At the level of a genome or a chromosome, a Coding Regions display gives the locations coding regions, and the lengths, names and GenBank identification numbers of the protein products. An RNA Genes view lists the location and names for ribosomal and transfer RNA genes. A summary of COG functional groups is also presented. At the level of a single gene, links are provided to sequence neighbors for the implied protein with links to the Clusters of Orthologous Groups (COGs) database.

    For complete microbial genomes, pre-computed BLAST neighbors for protein sequences, including their taxonomic distribution and links to 3D structures, are given in TaxTables and PDBTables, respectively. Pairwise sequence alignments are linked to the Cn3D macromolecular structure viewer (22) to generate displays of 3D structures coupled to sequence alignments. A TaxPlot tool plots similarities in the proteomes of two organisms to that of a reference organism for both more than 320 prokaryotic and almost 50 eukaryotic genomes. A related tool, GenePlot, generates plots of protein similarity for a pair of complete microbial genomes for the visualization of deleted, transposed or inverted genomic segments.

    A new ‘gMap’ tool combines the results of pre-computed whole microbial genome comparisons with on-the-fly BLAST comparisons clustering genomes with similar nucleotide sequences and then represents the pre-computed segments of similarity graphically. A novel sequence may be introduced into the display of pre-computed alignments using an on-the-fly BLAST comparison. Using gMap, one can quickly navigate from a high-level overview of the similarity between a set of genomes, to a nucleotide-level view of individual segments of alignment.

    Genome project

    The new Entrez Genome Project database supplements Entrez Genome by providing an overview of the status of complete and in-progress large-scale sequencing, assembly, annotation and mapping projects. Genome Project links to project data in the other Entrez databases, such as Entrez Nucleotide and Genome, and to a variety of other NCBI and external resources. For prokaryotic organisms, Genome Project indexes a number of characteristics of interest to biologists such as organism morphology and motility; environmental requirements, such as salinity, temperature and pH range; oxygen requirements and pathogenicity. The database allows genome sequencing centers to register their project early in the sequencing process so that project data can be linked to other NCBI-hosted data at the earliest opportunity.

    COGs

    The rapid progress in sequencing has produced sequences for over 250 prokaryotic genomes comprising more than 150 species within 95 different taxonomic genera. The COGs database (15), presents a compilation of orthologous groups of proteins from 66 completely sequenced organisms. A eukaryotic version, KOGs, is available for seven organisms including H.sapiens; C.elegans, D.melanogaster and A.thaliana. Alignments of sequence from COGS have been incorporated into the Conserved Domain Database described below.

    Retroviral genotyping tools

    NCBI offers a Web-based genotyping tool that employs a blastn comparison between a retroviral sequence to be subtyped and either a default panel of reference sequences or a panel provided by the user. An HIV-1-specific subtyping tool uses a set of reference sequences taken from the principle HIV-1 variants.

    Eukaryotic genomic resources

    Map Viewer

    The NCBI Map Viewer displays genome assemblies, genetic and physical markers, and the results of annotation and other analyses using sets of aligned maps. The Map Viewer home page http://www.ncbi.nlm.nih.gov/mapview/ provides links to both Map Viewer and Genomic BLAST pages from a taxonomically organized organism list of 36 organisms including H.sapiens, M.musculus, and R.norvegicus. Maps available for display in the Map Viewer vary by organism but may include cytogenetic maps, physical maps, maps showing predicted gene models, EST alignments with links to UniGene clusters and mRNA alignments used to construct gene models. Maps from multiple organisms or multiple assemblies for the same organism can be displayed in a single view. The Map Viewer supports queries using various identifiers such as gene names or symbols, marker names, SNP identifiers or accession numbers. Plant genomes in the Map Viewer can be queried in tandem using a cross species query page to generate a display of the chromosome maps from multiple species. The Map Viewer can generate a tabular display for convenient export to other programs and segments of a genomic assembly may be downloaded using a Download/View Sequence link. Map Viewer displays link to Entrez Gene, and to tools such as the Evidence Viewer and Model Maker. Map Viewer links in the Entrez Links menu for nucleotide or protein sequences shown in the Map Viewer provide a convenient route to a Map Viewer display for a region of interest.

    Model Maker

    Model Maker (MM) is used to construct transcript models using combinations of putative exons derived from ab initio predictions or from the alignment of GenBank transcripts, including ESTs and RefSeqs, to the NCBI human genome assembly. Previously observed exon splice patterns are indicated as guides to model building. Completed models may be saved locally or analyzed with OrfFinder.

    Evidence Viewer (Ev)

    The EV displays the alignments to genomic contigs of RefSeq and GenBank transcripts, and ESTs supporting gene models. Mismatches between transcript and genomic sequences are highlighted. Exon-by-exon transcript alignments, including flanking genomic sequence for each exon, are given along with protein translations. Proteins annotated on the transcript sequences are shown and mismatches between proteins annotated on the aligned transcripts are highlighted.

    Cancer chromosomes

    Three databases, the NCI/NCBI SKY (Spectral Karyotyping)/M-FISH (Multiplex-FISH) and CGH (Comparative Genomic Hybridization) Database, the NCI Mitelman Database of Chromosome Aberrations in Cancer (26) and the NCI Recurrent Chromosome Aberrations in Cancer databases comprise the new Cancerchromosomes Entrez database found at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=CancerChromosomes.

    Three search formats are available: a conventional Entrez query, a Quick/Simple Search and an Advanced Search. The Simple Search offers a set of menus used to select a disease site or diagnosis that can be combined with specifications for a particular chromosomal location and anomaly. The Advanced Search offers a combination of forms for more complex queries. Search results may list all cases matching the query terms, a case-based report, or list each clone or cell separately, the clone/cell report. Similarity reports show terms common to a group or records within several term categories, such as diagnosis or disease site and cytogenetic abnormalities, among the selected cases or clones/cells.

    RESOURCES FOR THE ANALYSIS OF PATTERNS OF GENE EXPRESSION AND PHENOTYPES

    SAGEmap

    NCBI's SAGEmap (27) provides a two-way mapping between regular (10 base) and LongSAGE (14, 17 and 22 base) SAGE tags and UniGene clusters. The SAGEmap repository presently contains 623 SAGE experiments from 18 organisms. A new tool maps the user's SAGE library to gene identifiers on-the-fly based on the pre-computed UniGene mappings. SAGEmap can also construct a user-configurable table of data comparing one group of SAGE libraries with another. SAGEmap is updated biweekly, immediately following the update of UniGene and the data appear in the Map Viewer as the Expression track.

    Gene Expression Omnibus (GEO)

    GEO (28) is a data repository and retrieval system for any high-throughput gene expression or molecular abundance data. GEO contains data from microarray-based experiments measuring the abundance of mRNA, genomic DNA and protein molecules, as well as non-array-based technologies, such as SAGE and mass spectrometry peptide profiling. The GEO repository accepts data via web or in batch. The repository can be browsed from the GEO home page and may be queried from both experiment- (Entrez GEO DataSets) and gene-centric (Entrez GEO Profiles) perspectives. At the time of writing, the repository contains high-throughput gene expression data from 50 000 hybridization experiments, has about 1600 array definitions and approximately a billion individual spot measurement data, derived from 100 organisms.

    GENSAT

    GENSAT is an gene expression atlas of the mouse central nervous system produced with data supplied by the National Institute of Neurological Disorders and Stroke. GENSAT catalogs images of histological sections of the mouse brain in which tags, such as Enhanced Green Fluorescence Protein, have been used to visualize the relative degree of localized expression for a wide array of genes. Images are available for the mouse brain at various developmental stages. GENSAT records link to to Entrez Gene, Unigene, PubMed and PubMed Central.

    Probe

    Nucleic acid probes are molecules that complement a specific gene transcript or DNA sequence and are useful in gene silencing, genome mapping and genome variation analysis. The new Entrez Probe database serves as an archive of probe sequences along with data on their experimental utility. Probe entries indicate the intended experimental application and include the experimental results generated using the probe. Entrez Probe is linked to the scientific literature in PubMed as well as to Entrez Nucleotide with pre-computed alignments to RefSeqs providing a bridge to the genomic information in Entrez Gene.

    OMIM

    NCBI provides the online version of the OMIM catalog of human genes and genetic disorders authored and edited by Victor A. McKusick at The Johns Hopkins University (29). The database contains information on disease phenotypes and genes, including extensive descriptions, gene names, inheritance patterns, map locations, gene polymorphisms and detailed bibliographies. The OMIM Entrez database contains 16 000 entries, including data on over 10 000 established gene loci and phenotypic descriptions. These records link many important resources, such as locus-specific databases and GeneTests.

    Online Mendelian Inheritance in Animals (OMIA)

    OMIA is a database of genes, inherited disorders and traits in animal species, other than human and mouse, authored by Professor Frank Nicholas of the University of Sydney, Australia, and colleagues. The database contains textual information and references, as well as links to relevant records from OMIM, PubMed, Entrez Gene.

    THE MOLECULAR MODELING DATABASE, THE CONSERVED DOMAIN DATABASE SEARCH, CDART, PROTIEN INTERACTIONS, PUBCHEM, OMSSA

    The NCBI Molecular Modeling Database (MMDB), built by processing entries from the PDB (4), is described in (6). The structures in the MMDB are linked to sequences in Entrez and to the Conserved Domain Database (CDD). The CDD contains 11 000 PSI-BLAST-derived Position Specific Score Matrices representing domains taken from the Simple Modular Architecture Research Tool (Smart) (30), Pfam (31), and from domain alignments derived from COGs. NCBI's Conserved Domain Search (CD-Search) service can be used to search a protein sequence for conserved domains in the CDD. Wherever possible CDD hits are linked to structures which, coupled with a multiple sequence alignment of representatives of the domain hit, can be viewed with NCBI's 3D molecular structure viewer, Cn3D (22), equipped with advanced alignment-building tools that use the PSI-BLAST and threading algorithms. The Conserved Domain Architecture Retrieval Tool (CDART) allows searches of protein databases on the basis of a conserved domain and returns the domain architectures of database proteins containing the query domain. Alignment-based protein domain information from the CDD and 3D domains from the MMDB can be searched via the Entrez interface.

    HIV-1, human protein interaction database

    The Division of Acquired Immunodeficiency Syndrome of the National Institute of Allergy and Infectious Diseases, in collaboration with the Southern Research Institute and NCBI, maintains a comprehensive HIV Protein-Interaction Database of documented interactions between HIV-1 proteins, host cell proteins, other HIV-1 proteins, or proteins from disease organisms associated with HIV or AIDS. Summaries, including protein RefSeq accession numbers, Entrez Gene IDs, lists of interacting amino acids, brief descriptions of interactions, keywords and PubMed IDs for supporting journal articles are presented at www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/index.html. Interaction summaries are selected using pull down phrase lists to apply filters, and batches of summaries may be downloaded. All protein–protein interactions documented in the HIV Protein-Interaction Database are listed in Entrez Gene reports in the HIV-1 protein interactions section.

    PubChem

    PubChem is the informatics backbone for the NIH Roadmap Initiative on molecular libraries. PubChem focuses on the chemical, structural and biological properties of small molecules, particularly their application as diagnostic and therapeutic agents. A suite of three Entrez databases, PCSubstance, PCCompound and PCBioAssay, debuted during the past year to contain the substance information, compound structures and bioactivity data of the PubChem project. The databases comprise records for 4 million compounds with 3 million unique structures. The PubChem databases link to other Entrez databases, such as PubMed and PubMed Central, and also to Entrez Structure and Protein to provide a bridge between the macromolecules of genomics and the small organic molecules of cellular metabolism.

    The open mass spectrometry search algorithm (OMSSA)

    OMSSA is an efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA assigns significant hits a Expect-value computed in the the same way as the E-value of BLAST. The web interface to OMSSA, reached via a link from the ‘Tools’ link on the NCBI home page, allows up to 2000 spectra to be analyzed in a single session using either the BLAST ‘nr’ or ‘refseq’ sequence libraries for comparison. Standalone versions of OMSSA for several popular computer platforms that accept larger batches of spectra and allow searches of custom sequence libraries can be downloaded at http://pubchem.ncbi.nlm.nih.gov/omssa/download.htm.

    FOR FURTHER INFORMATION

    The resources described here include documentation, other explanatory material and references to collaborators and data sources on the respective web sites. The NCBI Handbook, available in the Books database, describes the principal NCBI resources in detail. Several tutorials are also offered under the Education link from NCBI's home page. A Site Map provides a comprehensive table of NCBI resources, and the About NCBI feature provides bioinformatics primers and other supplementary information. A user support staff is available to answer questions at info@ncbi.nlm.nih.gov.

    ACKNOWLEDGEMENTS

    Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health.

    REFERENCES

    Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L. (2005) GenBank. Update Nucleic Acids Res, . 33, D34–D38 .

    Schuler, G.D., Epstein, J.A., Ohkawa, H., Kans, J.A. (1996) Entrez: molecular biology database and retrieval system Methods Enzymol, . 266, 141–162 .

    Wu, C.H., Yeh, L.S.L., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R.S., Suzek, B.E., et al. (2003) The protein information resource Nucleic Acids Res, . 31, 345347 .

    Bourne, P.E., Addess, K.J., Bluhm, W.F., Chen, L., Deshpande, N., Feng, Z., Fleri, W., Green, R., Merino-Ott, J.C., Townsend-Merino, W., et al. (2004) The distribution and query systems of the RCSB Protein Data Bank Nucleic Acids Res, . 32, D223–D225 .

    Pruitt, K., Tatusova, T., Maglott, D. (2005) Entrez Gene Nucleic Acids Res, . 33, D54–D58 .

    Marchler-Bauer, A., Anderson, J., Fedorova, N., DeWeese-Scott, C., Geer, L.Y., Hurwitz, D., Jackson, J.J., Jacobs, A., Lanczycki, C., Liebert, C., et al. (2005) MMDB: Entrez's 3D-structure database Nucleic Acids Res, . 33, D192–D196 .

    Sequeira, E. (2003) PubMed Central—Three Years Old and Growing Stronger ARL, 228, 5–9 .

    Altschul, S.E., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool J. Mol. Biol, . 215, 403–410 .

    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res, . 25, 3389–3402 .

    Mcginnis, S. and Madden, T. (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools Nucleic Acids Res, . 32, W20–W25 .

    Tatusova, T.A. and Madden, T.L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences FEMS Microbiol. Lett, . 174, 247–250 .

    Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements Nucleic Acids Res, . 29, 2994–3005 .

    Zhang, Z., Schwartz, S., Wagner, L., Miller, W. (2000) A greedy algorithm for aligning DNA sequences J. Comput. Biol, . 7, 203–214 .

    Ma, B., Tromp, J., Li, M. (2002) PatternHunter: faster and more sensitive homology search Bioinformatics, 18, 440–445 .

    Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., et al. (2003) The COG database: an updated version includes eukaryotes BMC Bioinformatics, 4, 41 .

    Schuler, G.D. (1997) Pieces of the puzzle: expressed sequence tags and the catalog of human genes J. Mol. Med, . 75, 694–698 .

    Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler, G.D., Bittner, M.L., Chen, Y., Simon, R., Meltzer, P., Trent, J.M., Boguski, M.S. (1998) Data management and analysis for gene expression arrays Nature Genet, . 20, 19–23 .

    Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A., Eppig, J.T. the members of the Mouse Genome Database Group. (2003) MGD: The Mouse Genome Database Nucleic Acids Res, . 31, 193–195 .

    Sprague, J., Doerry, E., Douglas, S., Westerfield, M. (2001) The Zebrafish Information Network (ZFIN): a resource for genetic, genomic and developmental research Nucleic Acids Res, . 29, 87–90 .

    Balakrishnan, R., Christie, K.R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Fisk, D.G., Hirschman, J.E., Hong, E.L., Nash, R., Oughtred, R., Skrzypek, M., Theesfeld, C.L., Binkley, G., Dong, Q., Lane, C., Sethuraman, A., Weng, S., Botstein, D., Cherry, J.M. (2005) Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD) Nucleic Acids Res, . 33, D374–D377 .

    Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Pham, L., Smigielski, E., Sirotkin, K. (2001) dbSNP: The NCBI database of genetic variation Nucleic Acids Res, . 29, 308–311 .

    Wang, Y., Geer, L.Y., Chappey, C., Kans, J.A., Bryant, S.H. (2000) Cn3D: sequence and structure views for Entrez Trends Biochem. Sci, . 25, 300–302 .

    Pruitt, K., Tatusova, T., Maglott, D. (2005) NCBI Reference Sequence Project Nucleic Acids Res, . 33, D501–D504 .

    Kapustin, Y., Souvorov, A., Tatusova, T. (2004) Splign—a Hybrid Approach To Spliced Alignments Proceedings of RECOMB 2004—Research in Computational Molecular Biology, pp. 741 .

    Tatusova, T., Karsch-Mizrachi, I., Ostell, J. (1999) Complete genomes in WWW Entrez: data representation and analysis Bioinformatics, 15, 536–543 .

    Mitelman, F., Mertens, F., Johansson, B. (1997) A breakpoint map of recurrent chromosomal rearrangements in human neoplasia Nature Genet, . 15, 417–474 .

    Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J., Altschul, S.F. (2000) SAGEmap: a public gene expression resource Genome Res, . 7, 1051–1060 .

    Barrett, T., Suzek, T., Troup, D., Wilhite, S., Ngau, W., Ledoux, P., Rudnev, D., Lash, A., Fujibuchi, W., Edgar, R. (2005) NCBI GEO: Mining millions of expression profiles - database and tools Nucleic Acids Res, . 33, D562–D566 .

    McKusick, V.A. Mendelian Inheritance in Man. Catalogs of Human Genes and Genetic Disorders, (1998) 12th edn Baltimore, MD The Johns Hopkins University Press .

    Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P., Bork, P. (2004) SMART 4.0: towards genomic data integration Nucleic Acids Res, . 32, D142–D144 .

    Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. (2004) The Pfam protein families database Nucleic Acids Res, . 32, D138–D141 .(David L. Wheeler*, Tanya Barrett, Dennis)