GenBank is a comprehensive public database of nucleotide sequences built and distributed by the NCBI. GenBank is primarily built from the sequence data submissions from authors and from the bulk submission of ESTs, GSS and other high-throughput data from sequencing centers.
EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA.
GSS: Genome Sequence Survey is similar to EST with the exception that most of the sequences are genomic in origin.
GenBank doubles in size every 18 months. WGS and envrionmental sequences now occupy a significant space in the databases.
WGS: Whole Genome Shotgun are contigs of a sequencing project. WGS data can contain annotation and should be updated as sequencing progresses.
Contig: A contig is a DNA sequence assembled from DNA fragments of 100-300 base pairs.
Environmental Sequences: These are all DNA sequences present in a sample. The sample often contains many different organisms and these organisms are very often unknown and unidentified.
Each GenBank entry includes a concise description of:
- sequence
- scientific name and taxonomy of the source organism
- bibliographic references
- listing of areas of biological significance such as coding regions and their protein translations, transcription units, repeat regions and sites of mutations or modifications.
GenBank partitions sequence into divisions that roughly correspond to:
- taxonomic groups such as bacteria (BCT), viruses (VRL), and rodents (ROD).
- sequencing strategies such as EST, GSS, HTG, HTC and environmental sample (ENV) sequences
HTC: High throughput cDNA
HTG: High throughput genomic sequences, single-pass, unfinished genomic sequences
EST and HTC are RNA or cDNA. GSS, HTG, WGS, and ENV are DNA.
The data in GenBank, and the collaborating databases EMBL and DDBJ, are submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTG sequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources. Virtually all records enter GenBank as direct electronic submissions.
EMBL, GenBank, DDJB and Swiss-Prot both identifiers and accession numbers to identify each entry. To make things more complicated, identifiers and accession numbers mean different things on different databases. On Swiss-Prot identifiers are alphanumeric terms that are meaningful to a human being. For example, HBA_HUMAN refers to a human haemoglobin alpha chain. Identifiers can change but they rarely do. Accession number the HBA_HUMAN is P69905. Accession numbers are primary keys so they never change. If two entries are merged, the new entry will have both accession numbers. One would be the primary key and the other would be the secondary key. When the entries are split, new accession numbers are assigned to each entry and the old accession number is noted as the secondary key.
GenBank data can be retrieved by Entrez. Entrez covers over 30 biological databases containing DNA and protein sequence data, genome mapping data, population sets, phylogenetic sets, environmental sample sets, gene expression data, the NCBI taxonomy, protein domain information, protein structures from the Molecular Modeling Database, MMDB, and MEDLINE references via PubMed. Entrez is a very good system to use since it returns much more information than is available on GenBank.
Biological databases often come with useful tools. BLAST is the very powerful tool which allows sequence-similarity comparisons.
GenBank database can be downloaded by ftp at ftp.ncbi.nih.gov.
This page is a brief summary of descriptions of Swiss-Prot, GenBank, and EMBL available on their websites.