Each database is different, however, a nucleotide sequence is expected to contain at least the following:
- id and/or accession number
- taxonomic data
- references
- annotation/curation
- keywords
- cross references
- sequences
- documentation
Annotation refers to adding extra information regarding a certain record in a database.
Curation refers to evaluating what goes in the database and what is not fit to go into the database.
First Generation Nucleotide Sequence Databases
The first generation nucleotide sequence databases are essentially sequence archive. The data is present in the database as it was determined and interpreted by its publisher. The original author retains full control of the information he submitted. As one can imagine, this results in a multitude of problems such as:
- data of varying quality and lengths
- highly redundant data
- errors in sequence, annotations, etc.
- lack of consistency
Second Generation Nucleotide Sequence Databases
The second generation nucleotide sequence databases were built with an eye on lessons learned from the first generation nucleotide sequence databases. The goal is to have one sequence entry for every naturally occuring molecule. In RefSeq, a second generation database, chromosome, gene, mRNA, and protein data are curated. Other data such as contigs, model mRNA, and model protein is calculated. A gene can result into multiple products. In such as case, separate RefSeq ids are used for each product and all are linked by a Locus Id. Second generation nucleotide sequences are essentially gene-centric databases.
Gene-Centric Databases
In a gene-centric database, all information relevant to a given gene is made accessible at once. Entrez and RefSeq are the most commonly used. Entrez Gene is tightly linked to RefSeq. RefSeq, the Reference Sequence, collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript RNA, and protein products.
Gene-centric databases contain gene-specific information, which focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's RefSeq and other collaborating databases.
Genome-Centric Databases
Genome-centric databases contain information about the gene sequence, relative position, strand orientation, biochemical functions, etc. Ensembl and TIGR are information management systems that are able to connect specialized sequence collection and browsing tools.