Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

Gene Prediction

Gene prediction refers to algorithmically identifying stretches of DNA sequences that are biologically functional. In the old days, gene prediction was a very painstaking and difficult process. Today, thanks to comprehensive genome sequencing and powerful computational resources, gene prediction is largely a computational problem.

Gene prediction is used to find a functional sequence. In other words, a region of the DNA which is coding for a protein or mRNA. Regulatory regions, regions of DNA that regulate gene expression, are also considered functional. Gene prediction does not tell us which genes code for which proteins.

There are two primary approaches for predicting genes:

• Intrinsic approach – Ab Initio
• Extrinsic approaches – homology-based

Prerequisite Knowledge

A gene is the fundamental physical and functional unit of heredity. It is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific function product (RNA or protein).

An Open Reading Frame (ORF) is a series of DNA codons which do not contain any stop codons.

A Coding Sequence (CDS) is a region of DNA or RNA whose sequence determines the sequence of amino acids in a protein.

Frames always read from 5’ to 3’.

Prokaryotic gene model

Prokaryotes have small genomes with high gene density. They contain operons, which mean that one transcript results in many genes. Since there are no introns, one gene produces one protein. There is one ORF per gene. ORFs begin with start codon and end with stop codon. There are conserved promoter regions around the start sites of transcription and translation. Genes often overlap in prokaryotes.

The principal difficulties with prokaryote gene prediction are overlapping ORFs, short genes, and finding promoters. In spite of these difficulties, gene prediction in prokaryotes is 99% accurate.

Eukaryotic gene structure

Ab Initio Gene Prediction

Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. In Ab Initio method, genomic DNA sequence is systematically searched for signs of coding genes. Signal indicates the presence of coding regions in the vicinity. Ab initio methods make a prediction based on the sequence information only. It identifies only coding exons of protein-coding genes. Transcription start site, 5’ and 3’ UTRs are ignored. These methods can detect new genes with no similarity to known sequences or domains.

Ab initio methods are based on rules, using coding statistics and signal detection. Statistical properties of coding regions are also taken into consideration. Training sets of known gene structures are used to generate statistical tests for the likelihood of a prediction being real. Since these statistical properties are unique to each species, knowledge is usually not transferable. This method can detect genes with no similarity to known sequences or domains.

Gene Content
Certain information in the gene content such as GC content, codon bias, and hexamer frequency is used by ab initio methods to discriminate coding regions from non-coding regions. Codon bias refers to unusually high usage of certain codons over its alternates. For example, L can be coded by six different codons. However, human genes prefer to use CTG over others.

Coding statistics
Coding statistics is a function that for a given DNA sequence we are able to compute the likelihood that the sequence is coding for a protein. We know that intergenic regions, introns and exons have different nucleotide content. This information helps the function discriminate between the regions. For example, the probability of finding a stop codon in a random sequence would be different from finding it in a coding sequence.

Intergenic regions are DNA sequences located between genes that comprise a large percentage of the human genome with no known function.

Unequal usage of codons in the coding regions is a universal feature of the genomes (codon bias). Uneven usage of amino acids, uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) (codon usage), and hexamer usage also help discriminate coding region from non-coding regions.

Gene identification in prokaryotes

Gene prediction is easier and more accurate in prokaryotes than eukaryotes since prokaryote gene structure is much simpler. In prokaryotes, ab initio methods look for:

• The presence of an ORF (start + stop) with a statistically significant size to code for a protein
• Codon usage bias
• RBS (ribosome binding signal) and terminator identification.

Locating ORFs is much simpler in prokaryotes. DNA sequences encoding proteins are generally transcribed into mRNA which is translated into protein with very little modification. Locating an ORF from a start codon to a stop codon may suggest protein-coding regions. Longer ORFs are more likely to predict protein-coding regions than shorter ORFs.

Ab initio gene prediction has certain advantages largely due to the simplicity of prokaryote genomes. The genomes are small with high gene density and simple strurcture (no exons/introns).

The principle difficulties are:

• detection of initiation site (AUG)
• alternative start codons
• gene overlap
• undetected small proteins

Inspite of these difficulties, prokaryote gene prediction can reach 99% accuracy.

Gene prediction in Eukaryotes

Gene identification in eukaryotes is much more complicated, difficult and a lot less accurate. In eukaryotes, we look for the following patterns:

• upstream promoter sequences,
• Kozak sequence, and
• exon-intron boundaries

We use this information to predict Poly-A signal and the start/stop prediction. In eukaryotes, the signals are not as clearly defined as in prokaryotes. Therefore simple pattern matching techniques cannot be used. The problems with eukaryote gene prediction are numerous and the prediction accuracy is about 50% at best. Modern gene prediction tools use advanced techniques such as hidden Markov Models. GENSCAN is a notable program in this domain.

Locating ORFs is less effective for eukaryotic genomes. There are large non-coding regions between genes and introns in genes. mRNA undergoes processing before translation (splicing and alternative splicing). A protein-encoding gene may contain stop codons within intronic regions. PTMs make gene prediction even more difficult. There are several tools which attempt to or help locate ORFs such as SpliceView, ORF finder, etc.
Gene Prediction Methods
Various pattern recognition methods are used to identify signals:

• weighted matrix
• decision trees
• HMM
• Artificial neural networks
• Linear discriminate analysis

An algorithm can be:

• Rule-based
• Neural network based
• HMM based

GENSCAN is a general-purpose gene identification program which analyzes genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants. Genscan:

• Identifies complete exon/intron structure of genes in genomic DNA
• Predicts multiple genes, partial and complete genes
• Uses HMM to model gene structure

Genscan takes the following things into account to make a prediction:

• Transcription signals
• Translation signals
• Splicing signals (donors, acceptors, and branch points)
• Exon length distributions
• Compositional features such as G+C regions and hexamer frequency

Weaknesses of ab Initio prediction
Ab initio method is not reliable enough, especially in eukaryotes. It is not specific enough (too many false positives), however, exon sensitivity can be good. It is generally used to point sequence similarity searches in the right direction.

Similarity-based Methods

placeholder page until the content is ready for publishing

Comparative Genomics

placeholder page until the content is ready for publishing.