A remarkable variability exists in genome size among eukaryotes that has little correlation with organismal complexity, size or number of coding genes. Even a unicellular organism can have a larger genome than a mammal! This striking disparity is due to non-coding DNA.
Non-coding DNA describes DNA which does not contain instructions for making cell products. It constitutes a large portion of the genome of eukaryotes. Some this non-coding DNA is involved in regulating the coding regions of DNA. Functions of the remaining non-coding DNA are still unknown.
The genome contains several types of non-coding regions (regions not coding for proteins). Non-coding regions can be found in three areas:
• Genic DNA,
• genic DNA coding for ncRNA, and
• intergenic DNA
Genic DNA is involved directly in gene expression. UTR regions (untranslated regions of mRNA), and introns are genic DNA.
The intergenic region contains mostly repetititve regions. Functional regions which constitute to about 15% of intergenic regions contains SAR (scaffold attachment regions), telomeres, centromeres. The functions of the remaining 85% regions are unknown.
SAR (Scaffold attachment regions) is an AT-rich segment of a eukaryotic genome that acts as an attachment point to the nuclear matrix. Nuclear matrix is a proteinaceous scaffold-like network that permeates the cell.
A telomere is a region of highly repetitive DNA at the end of a chromosome that functions as a disposable buffer. Every time linear eukaryotic chromosomes are replicated, the DNA polymerase complex is incapable of replicating all the way to the end of the chromosome; if it were not for telomeres, this would quickly result in the loss of useful genetic information.
The centromere is the site where spindle fibers of the mitotic spindle attach to the chromosome during mitosis. In most eukaryotes, the centromere has no defined DNA sequence. It typically consists of large arrays of repetitive DNA where the sequence within individual repeat elements is similar but not identical.
Repetitive DNA sequence classes
Much of this variation in genome size is due to non-coding, tandemly repeated DNA. A substantial fraction of the eukaryote genomes is often composed of repetitive DNA.
1. Simple Repeats
Simple repeats are duplications of the simple sets of DNA bases, typically 1 – 5bp. CpG are among the most important simple repeats. A CpG island is a short stretch of DNA in which the frequency of the dinucleotide sequence CG is higher than other regions. The p simply indicates that C and G are connected by a phosphodiester bond. To be classified a CpG island, a sequence must be at least 200 bases long.
DNA methylation occurs at CG-rich sites. Methylated cytosines may be converted to thymine by deamination over evolution CpG -> TpG. Methylated (inactive regions) are thus poor in CpG. CpG islands are unmethylated regions of the genome that are associated with the 5’ ends of genes which are frequently switched on. Often CpG islands ovelap the promoter and extend about 1000 base pairs downstream into the transcription unit.
2. Tandem Repeats - DNA satellites
Tandem repeats are typically found at the centromeres and telomeres of chromosomes. These are duplications of more complex 100-200 base sequences. DNA satellites can further be divided into satellites, minisatellites, and microsatellites, based on the number of nucleotides involved.
3. Segmental Duplications
Segmental Duplications are large blocks of 10-300kbp which have been copied to another region of the genome.
4. Interspersed Repeats (Transposons)
Interspersed repeats are repeated DNA sequences located at dispersed regions in a genome. They are also known as mobile elements for transposable elements. LINEs are long interspersed elements. SINEs are short interspersed elements.
5. Pseudogenes
Pseudogenes are defined as nonfunctional sequences of DNA originally derived from functional genes (evolutionary relics). There are 2 major classes:
• unprocessed pseudogenes derived from gene duplication and
• processed pseudogenes derived from retrotransposition of mRNA
Pseudogenes may be transcribed but not translated. Their chromosomal distributions appear random and dispersed. Pseudogenes can be considered as ‘potogenes’, i.e. DNA sequences with a probability of becoming new genes.
Processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in coding regions, with sequence similarity of 75% for amino acids and 86% for nucleotides.
Pseudogene.org is a organization which concentrates on pseudogenes.