Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

Ab Initio Gene Prediction

Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. In Ab Initio method, genomic DNA sequence is systematically searched for signs of coding genes. Signal indicates the presence of coding regions in the vicinity. Ab initio methods make a prediction based on the sequence information only. It identifies only coding exons of protein-coding genes. Transcription start site, 5’ and 3’ UTRs are ignored. These methods can detect new genes with no similarity to known sequences or domains.

Ab initio methods are based on rules, using coding statistics and signal detection. Statistical properties of coding regions are also taken into consideration. Training sets of known gene structures are used to generate statistical tests for the likelihood of a prediction being real. Since these statistical properties are unique to each species, knowledge is usually not transferable. This method can detect genes with no similarity to known sequences or domains.

Gene Content
Certain information in the gene content such as GC content, codon bias, and hexamer frequency is used by ab initio methods to discriminate coding regions from non-coding regions. Codon bias refers to unusually high usage of certain codons over its alternates. For example, L can be coded by six different codons. However, human genes prefer to use CTG over others.

Coding statistics
Coding statistics is a function that for a given DNA sequence we are able to compute the likelihood that the sequence is coding for a protein. We know that intergenic regions, introns and exons have different nucleotide content. This information helps the function discriminate between the regions. For example, the probability of finding a stop codon in a random sequence would be different from finding it in a coding sequence.

Intergenic regions are DNA sequences located between genes that comprise a large percentage of the human genome with no known function.

Unequal usage of codons in the coding regions is a universal feature of the genomes (codon bias). Uneven usage of amino acids, uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) (codon usage), and hexamer usage also help discriminate coding region from non-coding regions.

Gene identification in prokaryotes

Gene prediction is easier and more accurate in prokaryotes than eukaryotes since prokaryote gene structure is much simpler. In prokaryotes, ab initio methods look for:

• The presence of an ORF (start + stop) with a statistically significant size to code for a protein
• Codon usage bias
• RBS (ribosome binding signal) and terminator identification.

Locating ORFs is much simpler in prokaryotes. DNA sequences encoding proteins are generally transcribed into mRNA which is translated into protein with very little modification. Locating an ORF from a start codon to a stop codon may suggest protein-coding regions. Longer ORFs are more likely to predict protein-coding regions than shorter ORFs.

Ab initio gene prediction has certain advantages largely due to the simplicity of prokaryote genomes. The genomes are small with high gene density and simple strurcture (no exons/introns).

The principle difficulties are:

• detection of initiation site (AUG)
• alternative start codons
• gene overlap
• undetected small proteins

Inspite of these difficulties, prokaryote gene prediction can reach 99% accuracy.

Gene prediction in Eukaryotes

Gene identification in eukaryotes is much more complicated, difficult and a lot less accurate. In eukaryotes, we look for the following patterns:

• upstream promoter sequences,
• Kozak sequence, and
• exon-intron boundaries

We use this information to predict Poly-A signal and the start/stop prediction. In eukaryotes, the signals are not as clearly defined as in prokaryotes. Therefore simple pattern matching techniques cannot be used. The problems with eukaryote gene prediction are numerous and the prediction accuracy is about 50% at best. Modern gene prediction tools use advanced techniques such as hidden Markov Models. GENSCAN is a notable program in this domain.

Locating ORFs is less effective for eukaryotic genomes. There are large non-coding regions between genes and introns in genes. mRNA undergoes processing before translation (splicing and alternative splicing). A protein-encoding gene may contain stop codons within intronic regions. PTMs make gene prediction even more difficult. There are several tools which attempt to or help locate ORFs such as SpliceView, ORF finder, etc.
Gene Prediction Methods
Various pattern recognition methods are used to identify signals:

• weighted matrix
• decision trees
• HMM
• Artificial neural networks
• Linear discriminate analysis

An algorithm can be:

• Rule-based
• Neural network based
• HMM based

GENSCAN is a general-purpose gene identification program which analyzes genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants. Genscan:

• Identifies complete exon/intron structure of genes in genomic DNA
• Predicts multiple genes, partial and complete genes
• Uses HMM to model gene structure

Genscan takes the following things into account to make a prediction:

• Transcription signals
• Translation signals
• Splicing signals (donors, acceptors, and branch points)
• Exon length distributions
• Compositional features such as G+C regions and hexamer frequency

Weaknesses of ab Initio prediction
Ab initio method is not reliable enough, especially in eukaryotes. It is not specific enough (too many false positives), however, exon sensitivity can be good. It is generally used to point sequence similarity searches in the right direction.