Biology is now a data-intensive science and fortunately most of the data is available freely over the Internet. Before beginning, one needs to know what kind of data is available, where, in what format, and how it can be accessed. Most databases provide very useful and powerful tools to help its users access, manipulate, and analyze the data. Knowing and using these tools would help the user avoid lots of unnecessary work.
Several research centers are dedicated to bioinformatics research. Following are most significant.
The invention of various techniques and instruments for analyzing living being at the molecular level has lead to an explosion of scientific data generated by the scientific community. This data cannot be stored on paper. It must be stored, organized, and indexed in an electronic database. In addition we need tools to view, verify, analyze and interface this data with other databases.
An electronic biological database is a large, organized body of persistent data that can be queried to add, update, extract, and remove data. Biological databases have to respond to the needs of its various users. A certain biological data often means very different things to different researchers. For example, a physicist, a biochemist, and a biologist sitting in the same room would be interested in different aspects of the same protein. They might even use different taxonomy to refer to the same protein. Even two biologists would be interested in looking at the protein from different perspectives.
Biological data is often very connected and these connections are essential for comprehension and discovery. A nucleotide sequence is linked to a protein it codes for. Nucleotide sequences are grouped into genes. A gene may code for one protein, several proteins or none at all. This protein might have different names in different species. A protein belongs to protein family and it must be linked to its evolutionary progeny. We would also like to have links to scientific publications related to our protein, find out the methods and instruments used for its discovery, and even the parameters of the instrument used. Researchers frequently repeat experiments conducted by others to verify and improve their processes.
Back in the 70s, researchers refered to the "Atlas of Protein Sequences and Structures" by Margaret Dayhoff to find information on their protein of interest. Since then biological has exploded to a point that we can no longer imagine publishing all the data on paper. One of the earliest electronic database was PIR (http://pir.georgetown.edu) which was essentially run by a group of researchers. This was a significant improvement since it offered the advantage of adding, updating, deleting and most importantly searching the data is a much more effecient manner. Today PIR is no longer in service. It is live but it only serves as an archive. It could not cope with the growing demands while databases such as SwissProt are built to cope with the needs..
Today, biology is a data-rich science where each experiment generates enormous amounts of data. We can no longer analyze all this data by a pair of eyes. We need powerful data analysis tools to help us interpret and understand the significance of this data. Biological databases offer data storage facility and various tools which help understand and analyze the data.
Each database is different, however, a nucleotide sequence is expected to contain at least the following:
Annotation refers to adding extra information regarding a certain record in a database.
Curation refers to evaluating what goes in the database and what is not fit to go into the database.
The first generation nucleotide sequence databases are essentially sequence archive. The data is present in the database as it was determined and interpreted by its publisher. The original author retains full control of the information he submitted. As one can imagine, this results in a multitude of problems such as:
The second generation nucleotide sequence databases were built with an eye on lessons learned from the first generation nucleotide sequence databases. The goal is to have one sequence entry for every naturally occuring molecule. In RefSeq, a second generation database, chromosome, gene, mRNA, and protein data are curated. Other data such as contigs, model mRNA, and model protein is calculated. A gene can result into multiple products. In such as case, separate RefSeq ids are used for each product and all are linked by a Locus Id. Second generation nucleotide sequences are essentially gene-centric databases.
In a gene-centric database, all information relevant to a given gene is made accessible at once. Entrez and RefSeq are the most commonly used. Entrez Gene is tightly linked to RefSeq. RefSeq, the Reference Sequence, collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript RNA, and protein products.
Gene-centric databases contain gene-specific information, which focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's RefSeq and other collaborating databases.
Genome-centric databases contain information about the gene sequence, relative position, strand orientation, biochemical functions, etc. Ensembl and TIGR are information management systems that are able to connect specialized sequence collection and browsing tools.
GenBank is a comprehensive public database of nucleotide sequences built and distributed by the NCBI. GenBank is primarily built from the sequence data submissions from authors and from the bulk submission of ESTs, GSS and other high-throughput data from sequencing centers.
EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA.
GSS: Genome Sequence Survey is similar to EST with the exception that most of the sequences are genomic in origin.
GenBank doubles in size every 18 months. WGS and envrionmental sequences now occupy a significant space in the databases.
WGS: Whole Genome Shotgun are contigs of a sequencing project. WGS data can contain annotation and should be updated as sequencing progresses.
Contig: A contig is a DNA sequence assembled from DNA fragments of 100-300 base pairs.
Environmental Sequences: These are all DNA sequences present in a sample. The sample often contains many different organisms and these organisms are very often unknown and unidentified.
Each GenBank entry includes a concise description of:
GenBank partitions sequence into divisions that roughly correspond to:
HTC: High throughput cDNA
HTG: High throughput genomic sequences, single-pass, unfinished genomic sequences
EST and HTC are RNA or cDNA. GSS, HTG, WGS, and ENV are DNA.
The data in GenBank, and the collaborating databases EMBL and DDBJ, are submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTG sequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporate the most recently available sequence data from all sources. Virtually all records enter GenBank as direct electronic submissions.
EMBL, GenBank, DDJB and Swiss-Prot both identifiers and accession numbers to identify each entry. To make things more complicated, identifiers and accession numbers mean different things on different databases. On Swiss-Prot identifiers are alphanumeric terms that are meaningful to a human being. For example, HBA_HUMAN refers to a human haemoglobin alpha chain. Identifiers can change but they rarely do. Accession number the HBA_HUMAN is P69905. Accession numbers are primary keys so they never change. If two entries are merged, the new entry will have both accession numbers. One would be the primary key and the other would be the secondary key. When the entries are split, new accession numbers are assigned to each entry and the old accession number is noted as the secondary key.
GenBank data can be retrieved by Entrez. Entrez covers over 30 biological databases containing DNA and protein sequence data, genome mapping data, population sets, phylogenetic sets, environmental sample sets, gene expression data, the NCBI taxonomy, protein domain information, protein structures from the Molecular Modeling Database, MMDB, and MEDLINE references via PubMed. Entrez is a very good system to use since it returns much more information than is available on GenBank.
Biological databases often come with useful tools. BLAST is the very powerful tool which allows sequence-similarity comparisons.
GenBank database can be downloaded by ftp at ftp.ncbi.nih.gov.
This page is a brief summary of descriptions of Swiss-Prot, GenBank, and EMBL available on their websites.
There are two major protein sequence resources:
In addition, there are several different specialized protein databases.
UniProt is a central resource for protein sequence and function. The UniProt consortium (since 2003) consists of EMBL, SIB, and PIR. PIR is no longer being updated. It now only functions as a archive. UniProt itself is divided into several components.
UniProtKB/TrEMBL contains computer annotated protein sequences. TrEMBL entries are produced by translating nucleic acid sequences (CDS) in EMBL using computer tools. In addition, it includes data from PIR. TrEMBL suffers from poor submission of annotated CDS.
TrEMBL is a platform for the improvement of automated annotation tools. A TrEMBL entry is created after applying many annotation tools such as SignalP, TMHMM, REP, etc. Then evidence tags are added to any part of a TrEMBL entry not derived from the original EMBL entry.
UniProtKB/TrEMBL contains manually annotated protein sequences. Swiss-Prot entries are produced by manually annotating TrEMBL entries. Before creating a Swiss-Prot entry, the sequence is checked and analyzed. The data is cross-checked with literature and external scientific expertise. Once an entry is moved to Swiss-Prot, it is deleted from TrEMBL. Data in Swiss-Prot does not migrate to TrEMBL. Together, Swiss-Prot and TrEMBL provide all known protein sequences in the public domain.
The goals of Swiss-Prot are:
A Swiss-Prot Entry contains:
One UniRef100 entry contains all identical sequences including fragments.
One UniRef90 entry contains sequences that have at least 90% or more identity.
One UniRef50 entry contains sequences that have at least 50% or more identity.
UniParc are raw archived protein sequences.
Sequences and information in UniProt is accessible via text search, BLAST similarity search, and FTP.
A remarkable variability exists in genome size among eukaryotes that has little correlation with organismal complexity, size or number of coding genes. Even a unicellular organism can have a larger genome than a mammal! This striking disparity is due to non-coding DNA.
Non-coding DNA describes DNA which does not contain instructions for making cell products. It constitutes a large portion of the genome of eukaryotes. Some this non-coding DNA is involved in regulating the coding regions of DNA. Functions of the remaining non-coding DNA are still unknown.
The genome contains several types of non-coding regions (regions not coding for proteins). Non-coding regions can be found in three areas:
• Genic DNA,
• genic DNA coding for ncRNA, and
• intergenic DNA
Genic DNA is involved directly in gene expression. UTR regions (untranslated regions of mRNA), and introns are genic DNA.
The intergenic region contains mostly repetititve regions. Functional regions which constitute to about 15% of intergenic regions contains SAR (scaffold attachment regions), telomeres, centromeres. The functions of the remaining 85% regions are unknown.
SAR (Scaffold attachment regions) is an AT-rich segment of a eukaryotic genome that acts as an attachment point to the nuclear matrix. Nuclear matrix is a proteinaceous scaffold-like network that permeates the cell.
A telomere is a region of highly repetitive DNA at the end of a chromosome that functions as a disposable buffer. Every time linear eukaryotic chromosomes are replicated, the DNA polymerase complex is incapable of replicating all the way to the end of the chromosome; if it were not for telomeres, this would quickly result in the loss of useful genetic information.
The centromere is the site where spindle fibers of the mitotic spindle attach to the chromosome during mitosis. In most eukaryotes, the centromere has no defined DNA sequence. It typically consists of large arrays of repetitive DNA where the sequence within individual repeat elements is similar but not identical.
Much of this variation in genome size is due to non-coding, tandemly repeated DNA. A substantial fraction of the eukaryote genomes is often composed of repetitive DNA.
Simple repeats are duplications of the simple sets of DNA bases, typically 1 – 5bp. CpG are among the most important simple repeats. A CpG island is a short stretch of DNA in which the frequency of the dinucleotide sequence CG is higher than other regions. The p simply indicates that C and G are connected by a phosphodiester bond. To be classified a CpG island, a sequence must be at least 200 bases long.
DNA methylation occurs at CG-rich sites. Methylated cytosines may be converted to thymine by deamination over evolution CpG -> TpG. Methylated (inactive regions) are thus poor in CpG. CpG islands are unmethylated regions of the genome that are associated with the 5’ ends of genes which are frequently switched on. Often CpG islands ovelap the promoter and extend about 1000 base pairs downstream into the transcription unit.
Tandem repeats are typically found at the centromeres and telomeres of chromosomes. These are duplications of more complex 100-200 base sequences. DNA satellites can further be divided into satellites, minisatellites, and microsatellites, based on the number of nucleotides involved.
Segmental Duplications are large blocks of 10-300kbp which have been copied to another region of the genome.
Interspersed repeats are repeated DNA sequences located at dispersed regions in a genome. They are also known as mobile elements for transposable elements. LINEs are long interspersed elements. SINEs are short interspersed elements.
Pseudogenes are defined as nonfunctional sequences of DNA originally derived from functional genes (evolutionary relics). There are 2 major classes:
• unprocessed pseudogenes derived from gene duplication and
• processed pseudogenes derived from retrotransposition of mRNA
Pseudogenes may be transcribed but not translated. Their chromosomal distributions appear random and dispersed. Pseudogenes can be considered as ‘potogenes’, i.e. DNA sequences with a probability of becoming new genes.
Processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides.
Pseudogene.org is a organization which concentrates on pseudogenes.
In prokaryotes, one gene codes for one protein. Eukaryotes used a much more elaborate mechanism to increase sequence diversity and to enable themselves to produce newer proteins.
Several exons are involved to code for a single protein. Any one of the several exons can used to initiate the expression. The choice of the initiating exon could generate a different isoform of the same protein. In other words, alternative usage of promoters results in proteins with different isoforms.
RNA splicing is a precisely regulated co- and post- transcriptional process (occurring prior to mRNA translation) that removes introns and joins exons in a primary transcript.
During RNA splicing, exons can either be retained in the mature message or targeted for removal in different combinations to create a diverse array of mRNAs from a single pre-mRNA, a process referred to as alternative RNA splicing (tissue and cell specific).
There are four known modes of alternative splicing:
1. Alternative selection of promoters:
This is the only method of splicing which can produce an alternative N-terminus domain in proteins. In this case, different sets of promoters can be spliced with certain sets of other exons.
2. Alternative selection of cleavage/polyadenylation sites:
This is the only method of splicing which can produce an alternative C-terminus domain in proteins. In this case, different sets of polyadenylation sites can be spliced with the other exons.
3. Intron retaining mode
In this case, instead of splicing out an intron, the intron is retained in the mRNA transcript. However, the intron must be properly encoding for amino acids. The intron's code must be properly expressible, otherwise a stop codon or a shift in the reading frame will cause the protein to be non-functional.
4. Exon cassette mode:
In this case, certain exons are spliced out to alter the sequence of amino acids in the expressed protein.mRNA editing
…~15 % of disease-causing mutations involve misregulation of alternative splicing (missplicing)…
Exon order is not conserved. It cam be scrambled. A technique used in alternative promotor usage.
Splicing prepares pre-mRNA in eukaryotes to produce mature mRNA. This mature messenger RNA is then prepared to undergo translation as part of protein synthesis to produce proteins. When the exons are in the SAME RNA transcript, it is called cis-splicing.
Trans-splicing is a form of splicing that joins two exons that are not within the same RNA transcript.
ESEs are discrete sequences within exons that promote both constitutive and regulated splicing. The precise mechanism by which ESEs facilitate the assembly of splicing complexes has been controversial. However, recent studies have provided insights into this question and have led to a new model for ESE function. Other recent work has suggested that ESEs are comprised of diverse sequences and occur frequently within exons. Ominously, these latter studies predict that many human genetic diseases linked to mutations within exons might be caused by the inactivation of ESEs.
Exon sequence enhancers prediction - http://rulai.cshl.edu/tools/ESE/
Alternative splicing database project - http://www.ebi.ac.uk/asd/index.html
Non-coding RNAs represent ~10% of the genes but ~98% of all human transcripts. snRNA participates in post-transciptional chemical modification or processing of different RNAs.
Micro RNAs (miRNAs) are a class of non-coding RNA gene. They play an important role in the regulation of translation and degradation of mRNAs through base pairing to partially complementary sites in the untranslated regions (UTRs) of the messenger.
Antisense transcription is transcription from the opposite strand to a protein-coding or sense strand. Computational analysis suggests that between 15 and 25% of mammalian genes overlap, give rise to pairs of sense and antisense RNA. They are almost universally associated with candidate imprinted loci, also occurring on the autosomes. Its play roles in gene regulation involving degradation of the corresponding sense transcripts (RNA interference) as well as gene silencing at the chromatin level. The challenge is to determine the correct orientation for an expressed sequence, especially an expressed tag sequence (ESTs).
Antisense mRNA is an mRNA transcript that is complementary to endogenous mRNA. It is the noncoding strand complementary to the coding sequence of mRNA. Introducing a transgene coding for antisense mRNA is a strategy used to block expression of a gene of interest. A strand of antisense mRNA can also be introduced into the cytosol by microinjection. Radioactively-labelled antisense mRNA can be used to hybridise to endogenous sense mRNA, which can show the level of transcription of genes in various cell types.
ncRNA genes are found in genomic sequences by their sequence or structural homology.
tRNA have conserved sequence elements. Programs use a combination of patterns searches; probabilistic methods and (for eukaryotes) search for Pol III promoters. tRNAscan is a very good program for finding tRNAs.