Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

Pairwise Alignment

Much of bioinformatics involves sequences. Sequences are represented with strings of letters in an alphabet. DNA has an alphabet of 4 letters while proteins have an alphabet of 20 letters.

The most basic sequence analysis is to ask if two sequences are related. This involves aligning two sequences and then deciding whether the sequences are related or is the similarity just by chance. The key issues to ponder over are:

1. what sorts of alignments should be considered
2. the scoring system used to rank alignments
3. the algorithm used to find optimal (or good) scoring alignments
4. the statistical methods used to evaluate the significance

Finding similarity between sequences is important for many biological questions. Some examples:

Two similar sequences are probably biologically similar. Very often similar sequences have similar 3D structures. This is important since the 3D structure of a protein defines its functions. In addition, similar sequences can come from two species which share a common ancestor, thereby indicating their evolutionary relationship. In other words, the residues occupying similar positions could have similar functional roles. Evolution tends to conserve the more efficient functional units. Therefore, important sequences which code for the important proteins are conserved among organisms in nature.

In the absence of comprehension of the biological mechanisms, it is indispensable to compare a new unknown sequence to known sequences that we know better. Therefore, discovery of efficient and reliable algorithms are becoming more and more important as the number of sequences increase exponentially.

Similar, Identical, Homologous

Understanding the difference between similar and identical is crucial for sequence alignment. An identical pair is a pair of two same amino acids. A similar pair is a pair of amino acids which could be considered chemically similar in that certain position. Two amino acids are considered similar if one can be substituted for another with a positive log odds score from a scoring matrix.

VKASQRTTV
VK ++RTTV
VKPNKRTTV

In this example, T, V, R, and K are identical pairs while S,N and Q,K are similar pairs.

Similarity can often be misleading. It can reveal evolutionarily related sequences or it can align two sequences with completely different function and structure. The challenge is to differentiate between the former and the latter.

Sequence alignment

A sequence alignment takes two sequences of the same alphabet as input and outputs an alignment of the two sequences. Alignment simply refers to placing one symbol against another. It does not involve judging the quality of the alignment. An alignment consists of writing two sequences one on each axis and inserting letters and symbols such that the two sequences have the same length. All methods are permitted as long as the order of the symbols in the sequences is not modified. There is no quality evaluation in the alignment step.

Lets look at the following two sequences:

GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA

A possible alignment could be:

-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A

The string GCGC is a perfect match. The eight character G is a mismatch since it matches with T. The - symbols are indel (insertions or deletions) as they allow for an more optimal match to occur. Many different alignments are possible. The trick is to choose the most likely alignment. This is accomplished by scoring alignments and is covered in the next section.

Sequence identity refers to the occurrence of exactly the same nucleic acid or amino acid in the same position in two aligned sequences. Sequence similarity is meaningful only when possible substitutions are scored according to the probability with which they occur. Sequence homology indicates evolutionary relatedness among sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. Similarity refers to the presence of identical and similar sites in two sequences, while homology reflects a stronger claim that the two sequences share a common ancestor.

Similarity is not definite in a unique and exact manner. It is a mix of biological knowledge and mathematical and heuristic concepts. Sequence similarity is not about comparing two texts to state whether they are similar or different. A sequence similarity must be capable of tolerating gaps and substitutions. This is an optimization problem which could be formulated in a dynamic programming problem. The idea is to give a score to each pair of residues. Then search for insertions and deletions which can maximize the global score using a substitution matrix. In addition, the degree of similarity must be validated biologically and statistically. It is also important to be able to distinguish between accidental similarity and similarity based on biological factors.

Note: Parts of this post are summary of Durbin.

Scoring

Scoring Model

When we compare sequences, we are looking for evidence that they have diverged from a common ancestor by a process of mutation and selection. Basic mutational processes are:

  • Substitutions: Residue changes in the sequence
  • Insertions: Addition of a residue
  • Deletions: Removal of a residue

Insertions and deletions, together, are called gaps.

The total score we assign to an alignment is the sum of terms for each aligned pair of residues, plus terms for each gap. In probabilistic interpretation, this would correspond to the log of the relative likelihood that the sequences are related, compared to being unrelated. In other words it is the log of the probability of being related to another sequence compared to the log of the probability of being unrelated.

The easiest scoring method is to assume that the each element of the sequence evolved independently and that the probability of a mutation is 1/20. However, this is an erroneous assumption since some changes are more plausible than others. The plausibility depends on properties of the amino acids. Amino acids which are likely to preserve the structure and function of the protein are more likely to be preserved over evolution than ones which modify. It is , therefore, expected that the identities and conservative substitutions are more likely to occur than randomly conserved regions. Thus true positives are more likely to have a positive score while random substitutions are expected to contribute towards a negative score.

Using an additive scoring corresponds to an assumption that we can consider mutations at different sites of the sequence to have occurred independently. In other words, each gap is a mutation. This seems to be a reasonable assumption for DNA and protein sequences. However, this assumption is seriously inaccurate for RNA, since RNA is transcribed from DNA.

Additive scoring function is defined as follows:

  • (x,y) is the score of replacing x by y
  • (x,-) is the score of deleting x
  • (-,x) is the score of inserting x

The score of an alignment is the sum of position scores

The optimal or maximal score between two sequences is the maximal score of all alignments of these sequences, namely:

d(s1,s2) = max(alignment score)

The additive form of the score allows us to perform dynamic programming to compute optimal score efficiently.

Substitution Matrices / Scoring Matrices

What you really want to learn when evaluating a sequence alignment is whether a given alignment is random or meaningful. To access the meaningfulness of an alignment we construct a scoring matrix.

A scoring matrix is a table of values that describe the probability of a residue pair occurring in an alignment. The values in a scoring matrix are logarithms of ratios of two probabilities. The first is the probability of random occurrence of an amino acid in a sequence alignment. The second is the probability of meaningful occurrence of a pair of residues in a sequence alignment.

In order to score an alignment, the alignment program needs to know whether it is more likely or less likely that a given amino acid pair occurred randomly. Negative log odds ratio is random while positive indicates an evolutionary relationship. It is important to note that the scores are logarithms so a match of 2 residues is far from a coincidence.

Notation
sequences: x, y
xi is the ith symbol in x, yj is the jth symbol in y
A is the alphabet e.g. A = {A, C, G, T} for DNA
symbols from the alphabet are a, b, ...

The unrelated or random model R is the simplest. It assumes that a occurs independently with some frequency qa, the probability of two sequences is just the product of the probabilities of each amino acid:

P(x,y|R) = qxi qyi
i j

The product of the frequencies of each element of sequence x multiplied by the product of the frequencies of each element of sequence y.

In the alternative match model M, aligned pairs of residues occur with a joint probability pab. pab can be thought of as the probability that the residues a and b have each independently been derived from some unknown original residue c which was present in their common ancestor. This gives:

P(x,y|M) = pxiyi
i

Joint probability is the probability of two or more things happening at once. In our case, this is the probability of finding the same nucleotide or amino acid on both sequences. In this model, we take the product of the probabilities of getting the same residues on both sequences.

The ratio between these the values computed by these two formulas is called the odds ratio. When we take the log of this ratio, we arrive at the log-odds ratio. To log likelihood ratio of a residue pair computed with:

s(a,b) = log( pab )
qaqb

This is basically the log of the joint probability of a pair divided by the product of the frequencies of each member of the pair. The sum of this value for each pair in both sequences gives us log-odds ratio.

S = s(xi,yi)
i

The log-odds ratio can also be looked at as the sum of P(alternative) / P(random).

There are several ways to derive substitution scores, however, substitution scoring based on probabilistic models seems to be the most accurate.

In order to arrive at an additive scoring system, we take the log of this ratio. The log likelihood ratios can be arranged in a matrix. DNA has a 4 x 4 matrix while proteins have a 20 x 20 matrix. This matrix is called the score matrix or substitution matrix. Blosum50 and PAM are the most commonly used matrices.

Substitution matrices essentially make a statement about the probability of observing ab pairs in real alignments.

Gap Penalties


DNA sequences change not only by point mutation, but by insertion and deletion of residues as well. Consequently, it is often necessary to introduce gaps into one or both of the sequences being aligned to produce a meaningful alignment between them.

Gaps have to be penalized. The standard cost associated with a gap of length g is given either by a linear score or an affine score.

γ(g) = -gd
γ(g) = -d-(g-1)e

where d is called the gap-open penalty and e is called the gap extension penalty.

Most sequence alignment models use affine gap penalties where the cost of opening a gap in a sequence is different from the cost of extending a gap that has already been started. The extension penalty is usually set to a number less than the gap-open penalty d. This allows insertions and deletions to be penalized less than they would in linear gap cost. This is desirable when gaps of a few residues are expected almost as often as gaps of a single residue. [1]

Gap penalties also correspond to a probabilistic model of alignment, although this is less widely recognized than the probabilistic basis of substitution matrices. We assume that the probability of a gap occurring at a particular site in a given sequence is the product of a function f(g) of the length of the gap, and the combined probability of the set of inserted residues. In other words, the length of a gap is not correlated to the residues it contains. Here the gap penalties correspond to the log probability of a gap of that length. [1]

On the other hand, if there is evidence for a different distribution of residues in gap regions then there should be residue-specific scores for the unaligned residues in gap regions. These scores should be equal to the logs of the ratio of their frequencies in gapped versus aligned regions. For example, a sequence is more likely to be in a hydrophobic region of the protein. [1]

Gap penalties are intimately tied to the scoring matrix that aligns the sequences. The best pair of gap opening and extension penalties for one scoring matrix doesn’t necessarily work with another.

Linear Gap Penalty
Linear gap penalties are the simplest type of gap penalty. The only parameter, d, is a penalty per gap. This is almost always negative, so that the alignment with fewer gaps is favored over the alignment with more gaps. Under a linear gap penalty, the overall penalty for one large gap is the same for many small gaps.

Affine Gap Penalty
Affine gap penalties attempt to overcome this problem. In biological sequences, for example, it is much more likely that one big gap of length 10 occurs in one sequence, due to a single insertion or deletion event, than it is that 10 small gaps of length 1 are made. Therefore, affine gap penalties have a gap opening penalty, c, and a gap extension penalty, e. A gap of length l is then given a penalty c + (l-1)e. So the gaps are discouraged, c and e are almost always negative. Since a few large gaps is better than many small gaps, e is almost always smaller than c.

Source

[1] Durbin

Significance of Scores

Once we have an optimal alignment, we need to access the significance of its score. This would permit us to decide if it is a biologically meaningful alignment or not.

We look at the distribution of the maximum N match scores to independent random sequences. If the probability of this maximum being greater than the observed best is small, then the observation is considered significant.

Alignment Algorithms

Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. When both sequences have the same length, there is only one possible global alignment of the complete sequences, but things get complicated once gaps are allowed or when we look for local alignment between subsequences of two sequences. It is not computationally feasible to enumerate all possible matches. For two sequences of length n, there are:

possible global alignments between the two. Clearly, this is an NP hard problem.

Dynamic Programming

A dynamic programming algorithm is an algorithm for finding optimal alignments which use additive alignment scores. Dynamic programming is crucial for computational sequence analysis. Unlike heuristic methods, dynamic programming algorithms are guaranteed to find the optimal scoring alignment or set of alignments. Dynamic programming involves dividing the problem into smaller problems and storing the results in a table. It is like a recursion with memory. In the previous section, we defined additive scoring function as:

  • (x,y) is the score of replacing x by y
  • (x,-) is the score of deleting x
  • (-,x) is the score of inserting x

The score of an alignment is the sum of position scores. The optimal score between two sequences is the alignment which gives the maximal score. As we have just seen, enumerating all possible alignments is not feasible. In a log-odds ratio scoring scheme, better alignment would produce higher scores. To find the optimal alignment, we would like to maximize the score. In terms of a Blosum50 matrix, we want to maximize the positive values and minimize the smaller values. Since, dynamic programming is recursion with a memory, lets look at how the recursion argument would be constructed.

Suppose we have two sequences:

s[1..n+1] and t[1..m+1]

The best alignment must be one of three cases:

1. Last match is (s[n+1],t[m +1] )
2. Last match is (s[n +1],-)
3. Last match is (-, t[m +1] )

Thus:

1. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m]) + σ(s[n+1], t[m+1])
2. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m + 1]) + σ(s[n+1], -)
3. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n + 1], t[1..m]) + σ(-, t[m+1])

where σ(s,t) is the gap cost.

Global Alignment: Needleman & Wunsch Algorithm

We now construct a matrix F indexed by i and j, one index for each sequence, where F(i,j) is the score of the best alignment between the initial segments of each sequence. F(i,j) is built recursively.

F(i,j) = d(s[i..i],t[1..j])

Using our recursive argument, we get the following reference:

Graphically, this translates to the following:

Certain texts write this algorithm from the perspective of F(i-1,j-1), but I find this method more intuitive. This, of course, makes no difference in the algorithm.

We need to first handle the base cases in recursion.

F(0,0) = 0
F(i+1,0) = F(i,0) + σ(s[i+1],-)
F(0,j+1) = F(0,j) + σ(-,t[j+1])

This allows us to fill the first column and the first row. Since we are using using linear gaps, we need to assign a gap cost. Here, I have assigned a gap cost of 2. So the values for the first row and column would be 0, -2, -4, etc. Graphically, it looks like the following:

Now we need to find out F(1,1). We know that A and A are a perfect match. Therefore, we add 1 to the first equation since it represents a perfect match. The other two represent A,- and -,A matches. To fill in a value for F(1,1), and to fill the rest of the table, we need to find the maximum of the three.

F(1,1) = max(0+1, -2, -2) = 1
F(1,2) = max(-2+1, -4, 1-2) = -1
...

Remember that A,- and -,A are penalized by gap costs.

Thus the conclusion is that d(AAAC, AGC) = -1. To find the best alignment, we would need to traceback to F(0,0). In this step, we start from the last cell and simply point our arrows back to the cells we used to derive our cells.

The traceback gives us the best alignment. In this case, the alignment is:

AAAC
AA-G

We chose an arbitrary gap cost for our example. If we had chosen a different value such as 8, we would still have gotten the same traceback.

This algorithm has both space and time complexity of O(mn), since filling the table requires O(mn) and the traceback requires O(m+n).

In programming terms, N&W involves an iterative matrix method of calculation. All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array. All possible alignments (comparisons) are represented by pathways through this array.

The following four steps are necessary to align sequence1 of N positions with sequence2 of M positions:

1. Build a matrix of size N * M;
2. Assign similarity values;
3. For each cell, look at all possible pathways back to the beginning of the sequence and give that cell the value of the maximum scoring pathway;
4. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment.

Try out graphical alignment at http://www.itu.dk/people/sestoft/bsa/graphalign.html

Local Alignment: Smith-Waterman Algorithm

Global alignment is useful when we want to align two sequences completely. Very often, however, two sequences do not align completely. In fact we are usually more interested in finding best alignment of subsequences. For example, we would like to find out if human and mouse haemoglobins are homologous. The highest scoring alignment of subsequences of sequence s and sequence t is called the best local alignment.

The algorithm for finding local alignments is similar to the global alignment algorithm with two notable differences.

  1. F(i,j) can take a 0 value if all other values are less than 0. 0 value corresponds to starting a new local alignment.
  2. The traceback can start from anywhere in the matrix. It starts at the maximum value and ends at 0.

The algorithm is as follows:

The base cases are:

F(0,0) = 0
F(i+1, 0) = max(0, F(i,0) + σ(s[i+1],-))
F(0, j+1) = max(0, F(0,j) + σ(-,t[j+1]))

If we have two sequences, s=TAATA and t=TACTAA, we would get the following alignments:

TAATA_
TACTAA

___TAATA
TACTAA

For local alignment to work, the expected score for a random match must be negative. If that is not true, then long matches between entirely unrelated sequences will have high scores, just based on their length. As a consequence, although the algorithm is local, the maximal scoring alignments would be global or nearly global. A true subsequence alignment would be likely to be masked by a longer but incorrect alignment, just because of its length. Similarly, there must be some σ(s,t) greater than 0, otherwise the algorithm won't find any alignment at all.[1]

The random match is required to have a negative value. In an ungapped case, only the expected value of a fixed length alignment can be considered and it must be noted that in a random model, all residues are independent. The gives the following formula:

where qa is the probability that s would occur in any given position in a sequence. When σ(s,t) is derived as a log likelihood ratio, using the same qa as for random probabilities, the equation above is satisfied. No equivalent analysis for optimal ungapped alignments exist. There is no analytical method for predicting which gap scores will result in local vs. global alignment behavior.

Repeated Matches

If one or both sequences are long enough, we would most probably find several different local alignments with a significant score. For example, we might find several copies of a repeated domain or motif in a protein. Here we are interesting in an asymmetric method which finds one or more non-overlapping copies of sections of one sequence (e.g. domain or motif) in the other. [1]

In our algorithm, we are interested in sequence matches with score higher than a certain threshold T. The reason behind defining T is that we would always find small subsequences with small positive scores which would quite likely match unrelated sequences.

Following notation is used for this algorithm:

  • y is a sequence containing some domain or sequence
  • x is the sequence in which we are looking for multiple matches
  • T is some threshold score value

Once again, we use F(i,j) matrix but the recurrence in now different, as is the meaning of F(i,j). In the final alignment, x will be partitioned into regions that match parts of y in gapped alignments, and regions that are unmatched. The score of a completed match region is the standard gapped alignment score minus threshold T. [1]

The algorithm obtains all local alignments in one pass. Changing the value of T changed what the algorithm finds.

Overlap Matches



Source:

[1] Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin.
[2] Dr. Nir Friedman's lectures: www.cs.huji.ac.il/~nir
[3] http://www.itu.dk/people/sestoft/bsa/graphalign.html
[4] http://thor.info.uaic.ro/~ciortuz/SLIDES/pairAlign.pdf

Dynamic Programming with more complex models

Linear gap model scoring scheme is not ideal for biological sequences since gaps are often longer than one residue. If we are given a general function γ(g) then we can still use all the dynamic programming algorithms described in the previous section with adjustments to the recurrence relation. This however require O(n3) operations, thus not feasible.

Alignment with affine gap scores

The alternative is to assume an affine gap cost structure. This brings us back to O(n2) implementation of dynamic programming. This, however, requires us to keep track of multiple values for each pair of residue coefficients (i,j) in place of F(i,j). This corresponds to FSA. An alignment corresponds to a path through the states, with symbols from the underlying pair of sequences being transferred to the alignment according to the Δ(i,j) values in the states.

Heuristic Algorithms

Dynamic programming guarantees to find the best solution but has a complexity of O(mn). Heuristic algorithms do not guarantee the best solution but are very fast in comparison with deterministic algorithms such as dynamic programming.

BLAST

BLAST finds high scoring local alignments between a query sequence and a target database, both of which can be either DNA or protein. The idea is that true match alignments are very likely to contain short stretches of high scoring identities. We use these as seed and expand the alignments. BLAST makes a list of all neighborhood words of a fixed length that would match the query sequence somewhere with score higher than some threshold.

FASTA

FASTA uses a multistep approach to finding local high scoring alignments:

1. Lookup table to locate all identically matching words of length ktup between 2 sequences.
2. Lookup diagonals with many mutually supporting word matches
3. Pursue the best diagonal, extending the exact word matches to find maximal scoring ungapped regions. This is analogous to hit extension in blast.
4. Check to see if any of these ungapped regions can be joined by a gapped region, allowing for gap costs.
5. The highest scoring candidate matches in a search database are realigned using the full dynamic programming algorithm, but restricted to a subregion of the dynamic programming matrix forming a band around the candidate heuristic search.