Much of bioinformatics involves sequences. Sequences are represented with strings of letters in an alphabet. DNA has an alphabet of 4 letters while proteins have an alphabet of 20 letters.
The most basic sequence analysis is to ask if two sequences are related. This involves aligning two sequences and then deciding whether the sequences are related or is the similarity just by chance. The key issues to ponder over are:
1. what sorts of alignments should be considered
2. the scoring system used to rank alignments
3. the algorithm used to find optimal (or good) scoring alignments
4. the statistical methods used to evaluate the significance
Finding similarity between sequences is important for many biological questions. Some examples:
- Finding similar proteins allows us to predict their function and structure.
- Locating similar subsequences in DNA allows us to identify pockets of interest, such as regulatory elements.
- Locating DNA overlapping sequences helps us in sequence assembly.
Two similar sequences are probably biologically similar. Very often similar sequences have similar 3D structures. This is important since the 3D structure of a protein defines its functions. In addition, similar sequences can come from two species which share a common ancestor, thereby indicating their evolutionary relationship. In other words, the residues occupying similar positions could have similar functional roles. Evolution tends to conserve the more efficient functional units. Therefore, important sequences which code for the important proteins are conserved among organisms in nature.
In the absence of comprehension of the biological mechanisms, it is indispensable to compare a new unknown sequence to known sequences that we know better. Therefore, discovery of efficient and reliable algorithms are becoming more and more important as the number of sequences increase exponentially.
Similar, Identical, Homologous
Understanding the difference between similar and identical is crucial for sequence alignment. An identical pair is a pair of two same amino acids. A similar pair is a pair of amino acids which could be considered chemically similar in that certain position. Two amino acids are considered similar if one can be substituted for another with a positive log odds score from a scoring matrix.
VKASQRTTV
VK ++RTTV
VKPNKRTTV
In this example, T, V, R, and K are identical pairs while S,N and Q,K are similar pairs.
Similarity can often be misleading. It can reveal evolutionarily related sequences or it can align two sequences with completely different function and structure. The challenge is to differentiate between the former and the latter.
Sequence alignment
A sequence alignment takes two sequences of the same alphabet as input and outputs an alignment of the two sequences. Alignment simply refers to placing one symbol against another. It does not involve judging the quality of the alignment. An alignment consists of writing two sequences one on each axis and inserting letters and symbols such that the two sequences have the same length. All methods are permitted as long as the order of the symbols in the sequences is not modified. There is no quality evaluation in the alignment step.
Lets look at the following two sequences:
GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA
A possible alignment could be:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
The string GCGC is a perfect match. The eight character G is a mismatch since it matches with T. The - symbols are indel (insertions or deletions) as they allow for an more optimal match to occur. Many different alignments are possible. The trick is to choose the most likely alignment. This is accomplished by scoring alignments and is covered in the next section.
Sequence identity refers to the occurrence of exactly the same nucleic acid or amino acid in the same position in two aligned sequences. Sequence similarity is meaningful only when possible substitutions are scored according to the probability with which they occur. Sequence homology indicates evolutionary relatedness among sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. Similarity refers to the presence of identical and similar sites in two sequences, while homology reflects a stronger claim that the two sequences share a common ancestor.
Similarity is not definite in a unique and exact manner. It is a mix of biological knowledge and mathematical and heuristic concepts. Sequence similarity is not about comparing two texts to state whether they are similar or different. A sequence similarity must be capable of tolerating gaps and substitutions. This is an optimization problem which could be formulated in a dynamic programming problem. The idea is to give a score to each pair of residues. Then search for insertions and deletions which can maximize the global score using a substitution matrix. In addition, the degree of similarity must be validated biologically and statistically. It is also important to be able to distinguish between accidental similarity and similarity based on biological factors.
Note: Parts of this post are summary of Durbin.