Alignment Algorithms

Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. When both sequences have the same length, there is only one possible global alignment of the complete sequences, but things get complicated once gaps are allowed or when we look for local alignment between subsequences of two sequences. It is not computationally feasible to enumerate all possible matches. For two sequences of length n, there are:

possible global alignments between the two. Clearly, this is an NP hard problem.

Dynamic Programming

A dynamic programming algorithm is an algorithm for finding optimal alignments which use additive alignment scores. Dynamic programming is crucial for computational sequence analysis. Unlike heuristic methods, dynamic programming algorithms are guaranteed to find the optimal scoring alignment or set of alignments. Dynamic programming involves dividing the problem into smaller problems and storing the results in a table. It is like a recursion with memory. In the previous section, we defined additive scoring function as:

  • (x,y) is the score of replacing x by y
  • (x,-) is the score of deleting x
  • (-,x) is the score of inserting x

The score of an alignment is the sum of position scores. The optimal score between two sequences is the alignment which gives the maximal score. As we have just seen, enumerating all possible alignments is not feasible. In a log-odds ratio scoring scheme, better alignment would produce higher scores. To find the optimal alignment, we would like to maximize the score. In terms of a Blosum50 matrix, we want to maximize the positive values and minimize the smaller values. Since, dynamic programming is recursion with a memory, lets look at how the recursion argument would be constructed.

Suppose we have two sequences:

s[1..n+1] and t[1..m+1]

The best alignment must be one of three cases:

1. Last match is (s[n+1],t[m +1] )
2. Last match is (s[n +1],-)
3. Last match is (-, t[m +1] )

Thus:

1. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m]) + σ(s[n+1], t[m+1])
2. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m + 1]) + σ(s[n+1], -)
3. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n + 1], t[1..m]) + σ(-, t[m+1])

where σ(s,t) is the gap cost.

Global Alignment: Needleman & Wunsch Algorithm

We now construct a matrix F indexed by i and j, one index for each sequence, where F(i,j) is the score of the best alignment between the initial segments of each sequence. F(i,j) is built recursively.

F(i,j) = d(s[i..i],t[1..j])

Using our recursive argument, we get the following reference:

Graphically, this translates to the following:

Certain texts write this algorithm from the perspective of F(i-1,j-1), but I find this method more intuitive. This, of course, makes no difference in the algorithm.

We need to first handle the base cases in recursion.

F(0,0) = 0
F(i+1,0) = F(i,0) + σ(s[i+1],-)
F(0,j+1) = F(0,j) + σ(-,t[j+1])

This allows us to fill the first column and the first row. Since we are using using linear gaps, we need to assign a gap cost. Here, I have assigned a gap cost of 2. So the values for the first row and column would be 0, -2, -4, etc. Graphically, it looks like the following:

Now we need to find out F(1,1). We know that A and A are a perfect match. Therefore, we add 1 to the first equation since it represents a perfect match. The other two represent A,- and -,A matches. To fill in a value for F(1,1), and to fill the rest of the table, we need to find the maximum of the three.

F(1,1) = max(0+1, -2, -2) = 1
F(1,2) = max(-2+1, -4, 1-2) = -1
...

Remember that A,- and -,A are penalized by gap costs.

Thus the conclusion is that d(AAAC, AGC) = -1. To find the best alignment, we would need to traceback to F(0,0). In this step, we start from the last cell and simply point our arrows back to the cells we used to derive our cells.

The traceback gives us the best alignment. In this case, the alignment is:

AAAC
AA-G

We chose an arbitrary gap cost for our example. If we had chosen a different value such as 8, we would still have gotten the same traceback.

This algorithm has both space and time complexity of O(mn), since filling the table requires O(mn) and the traceback requires O(m+n).

In programming terms, N&W involves an iterative matrix method of calculation. All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array. All possible alignments (comparisons) are represented by pathways through this array.

The following four steps are necessary to align sequence1 of N positions with sequence2 of M positions:

1. Build a matrix of size N * M;
2. Assign similarity values;
3. For each cell, look at all possible pathways back to the beginning of the sequence and give that cell the value of the maximum scoring pathway;
4. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment.

Try out graphical alignment at http://www.itu.dk/people/sestoft/bsa/graphalign.html

Local Alignment: Smith-Waterman Algorithm

Global alignment is useful when we want to align two sequences completely. Very often, however, two sequences do not align completely. In fact we are usually more interested in finding best alignment of subsequences. For example, we would like to find out if human and mouse haemoglobins are homologous. The highest scoring alignment of subsequences of sequence s and sequence t is called the best local alignment.

The algorithm for finding local alignments is similar to the global alignment algorithm with two notable differences.

  1. F(i,j) can take a 0 value if all other values are less than 0. 0 value corresponds to starting a new local alignment.
  2. The traceback can start from anywhere in the matrix. It starts at the maximum value and ends at 0.

The algorithm is as follows:

The base cases are:

F(0,0) = 0
F(i+1, 0) = max(0, F(i,0) + σ(s[i+1],-))
F(0, j+1) = max(0, F(0,j) + σ(-,t[j+1]))

If we have two sequences, s=TAATA and t=TACTAA, we would get the following alignments:

TAATA_
TACTAA

___TAATA
TACTAA

For local alignment to work, the expected score for a random match must be negative. If that is not true, then long matches between entirely unrelated sequences will have high scores, just based on their length. As a consequence, although the algorithm is local, the maximal scoring alignments would be global or nearly global. A true subsequence alignment would be likely to be masked by a longer but incorrect alignment, just because of its length. Similarly, there must be some σ(s,t) greater than 0, otherwise the algorithm won't find any alignment at all.[1]

The random match is required to have a negative value. In an ungapped case, only the expected value of a fixed length alignment can be considered and it must be noted that in a random model, all residues are independent. The gives the following formula:

where qa is the probability that s would occur in any given position in a sequence. When σ(s,t) is derived as a log likelihood ratio, using the same qa as for random probabilities, the equation above is satisfied. No equivalent analysis for optimal ungapped alignments exist. There is no analytical method for predicting which gap scores will result in local vs. global alignment behavior.

Repeated Matches

If one or both sequences are long enough, we would most probably find several different local alignments with a significant score. For example, we might find several copies of a repeated domain or motif in a protein. Here we are interesting in an asymmetric method which finds one or more non-overlapping copies of sections of one sequence (e.g. domain or motif) in the other. [1]

In our algorithm, we are interested in sequence matches with score higher than a certain threshold T. The reason behind defining T is that we would always find small subsequences with small positive scores which would quite likely match unrelated sequences.

Following notation is used for this algorithm:

  • y is a sequence containing some domain or sequence
  • x is the sequence in which we are looking for multiple matches
  • T is some threshold score value

Once again, we use F(i,j) matrix but the recurrence in now different, as is the meaning of F(i,j). In the final alignment, x will be partitioned into regions that match parts of y in gapped alignments, and regions that are unmatched. The score of a completed match region is the standard gapped alignment score minus threshold T. [1]

The algorithm obtains all local alignments in one pass. Changing the value of T changed what the algorithm finds.

Overlap Matches



Source:

[1] Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin.
[2] Dr. Nir Friedman's lectures: www.cs.huji.ac.il/~nir
[3] http://www.itu.dk/people/sestoft/bsa/graphalign.html
[4] http://thor.info.uaic.ro/~ciortuz/SLIDES/pairAlign.pdf