In simple terms, a database is an electronic filing system. It allows a user to quickly store, search, retrieve, exchange and remove data. An application that manages a database (DB) is called a DBMS (Database Management System). The big biological databases can be queried through the Internet.
Biological data is very diverse and is growing at an exponential rate. Therefore, no single database can handle all the data and serve the diverse needs of the scientific community. As a result, many different databases exist, each with different capabilities and often redundant data. Right now, there is a large effort underway by different groups around the world to link and interface all the important databases and the data contained within them.
We do not run or maintain any bioinformatics database. We simply lack the expertise and the funds. Here you will find links and brief descriptions to the various important databases. Our list is not exhaustive and it is not meant to be exhaustive. Our goal is the list the best and the most respected databases while offering links to pages or websites offering a comprehensive list.
All biological databases listed on this website come with a set of tools to help its users retrieve, submit, and analyze contained within. Tools evolve overtime, new tools are introduced and obsolete ones are removed. These tools often have to be learned and usually the database website offer help or tutorials to assist its users.
A meta-database is DBMS which is either linked to or collects information from various other databases. A meta database allows users to access information related to a specific topic from several databases on one page.
The MetaDB metadatabase is a sorted, searchable collection of biological databases. Most entries in the metadatabase include a relevant peer-reviewed abstract or excerpt along with a link to the abstract or full text article. Database descriptions surrounded by quotation marks were borrowed from the database websites. It contains links to over 1200 databases.
Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Click on the graphic below for a more detailed view of Entrez integration.
euGenes provides a common summary of gene and genomic information from eukaryotic organism databases. This includes:
GeneCards project defines its goal to be to integrate the fragments of information scattered over a variety of specialized databases into a coherent picture.
SOURCE is a unification tool which dynamically collects and compiles data from many scientific databases, and thereby attempts to encapsulate the genetics and molecular biology of genes from the genomes of Homo sapiens, Mus musculus, Rattus norvegicus into easy to navigate GeneReports. The mission of SOURCE is to provide a unique scientific resource that pools publicly available data commonly sought after for any clone, GenBank accession number, or gene. SOURCE is specifically designed to facilitate the analysis of large sets of data that biologists can now produce using genome-scale experimental approaches.
A picture speaks a thousand words, and the following screenshot of website is self-explanatory.
EMBL Nucleotide Sequence Database
The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing projects and patent applications.
NCBI - National Center For Biotechnology Information
The database is produced in an international collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis.
Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.
DDJB - DNA Data Bank of Japan
DDBJ (DNA Data Bank of Japan) began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG). DDBJ has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank. DNA sequence records the organismic evolution more directly than other biological materials and ,thus, is invaluable not only for research in life sciences, but also human welfare in general. The databases are, so to speak, a common treasure of human beings.
Each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene), together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location.
Ensembl is a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes.
The Institute for Genomic Research (TIGR) is a not-for-profit center dedicated to deciphering and analyzing genomes – the complex molecular chains that constitute each organism’s unique genetic heritage.
PMD - Protein Mutant Database
Compliations of protein mutant data are valuable as a basis for protein engineering. They provide information on what kinds of functional and/or structural influences are brought about by amino acid mutation at a specific position of protein. The Protein Mutant Database (PMD) that we are constructing covers natural as well as artificial mutants, including random and site-directed ones, for all proteins except members of the globin and immunoglobulin families. The PMD is based on literature, not on proteins. That is, each entry in the database corresponds to one article which may describe one, several or a number of protein mutants.
Structural and Functional Annotation of Protein Families
The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence. Proteins are classified by expert biologists into families and subfamilies of shared function, which are then categorized by molecular function and biological process ontology terms. For an increasing number of proteins, detailed biochemical interactions in canonical pathways are captured and can be viewed interactively.
DIP - Database of Interacting Proteins
The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data. Please, check the reference page to find articles describing the DIP database in greater detail.
HPRD - Human Protein Reference Database
The Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. HPRD has been created using an object oriented database in Zope, an open source web application server, that provides versatility in query functions and allows data to be displayed dynamically.
For a more comprehnsive list, please refer to: expasy.
UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
UniProt is comprised of three components, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Reference Clusters (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.
Swiss-Prot and TrEMBL
UniProtKB/Swiss-Prot: a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases
UniProtKB/TrEMBL a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.
Protein Data Bank
The most authentic resource for protein structure information.
BMRDB - Biological Magnetic Resonance Data Bank
Repository for data on proteins, peptides, and nucleic acids from NMR spectroscopy
The SWISS-MODEL Repository is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The repository is developed at the Biozentrum Basel within the Swiss Institute of Bioinformatics.
CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H).
Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures into fold groups according to their topological connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to fold groups and homologous superfamilies are made by sequence and structure comparisons.
The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures. These include computational techniques, empirical and statistical evidence, literature review and expert analysis.
Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.
For a more comprehensive list, please refer to: expasy