Multiple sequence alignment techniques are most commonly applied to protein sequences; ideally they are a statement of both evolutionary and structural similarity among the proteins encoded by each sequence in the alignment.
Multiple alignments must usually be inferred from primary sequences alone. Biologists produce high quality multiple sequence alignments by hand using expert knowledge of protein sequence evolution. This knowledge comes from experience. Important factors include:
The phylogenetic relationships between sequences dictate constraints on the changes that occur in columns and in the patterns of gaps.
Manual alignment is tedious. To automate the process, it is hard to define exactly what an optimal multiple sequence alignment is, and it is impossible to set a standard for a single correct multiple alignment. In theory, there is one underlying evolutionary process and one evolutionarily correct alignment generated from any group of sequences. However, the differences between sequences can be so great in parts of an alignment that there isn’t an apparent, unique solution to be found by an alignment algorithm. Those same divergent regions are often structurally unalignable as well. Most of the insight that we derive from multiple alignments comes from analyzing the regions of similarity, not from attempting to align highly diverged regions.
In general, an automatic method must have a way to assign a score so that better multiple alignments get better scores. We should carefully distinguish the problem of scoring a multiple alignment from the problem of searching over possible multiple alignments to find the best one.
To automate multiple alignment, we need to do the following:
In a multiple sequence alignment, homologous residues among a set of sequences are aligned together in columns. ‘Homologous’ is meant for both structural and evolutionary sense. Ideally, a column of aligned residues occupy similar 3D structural positions and all diverge from a common ancestral residue.
Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally or evolutionarily homologous positions and create a single ‘correct’ multiple alignment. Since protein structures also evolve, we do not expect 2 protein structures with different sequences to be entirely superposable. Even the definition of ‘structurally superposable’ is subjective and can be expected to vary among experts.
In principle, there is always an unambiguously correct alignment even if the structures diverge. In practice, however, an evolutionarily correct alignment can be even more difficult to infer than a structural alignment. Structural alignment has an independent point of reference, superposition of x-ray crystallography or NMR structures. The evolutionary history of the residues of a sequence family cannot be independently known from any source. It must be inferred from sequence alignment.
The program should not be asked to produce exactly the same alignment. Instead, it should be focused on the subset of columns corresponding to key residues and core structural elements that can be aligned with confidence.
Note: This post is a summary of chapter 6.1 of Durbin.
The scoring system should take 2 important features into account:
1. some positions are more conserved than others
2. sequences are not independent, but instead are related by a phylogenetic tree
An idealized way to score a multiple alignment would be to specify a complete probabilistic model of molecular sequence evolution. Given the correct phylogenetic tree for the sequences, the probability of a multiple alignment is the product of all the evolutionary events necessary to produce that alignment via ancestral intermediate sequences times the prior probability of the root ancestral sequence. This evolutionary model would be very complex. The probabilities of evolutionary change would depend on the evolutionary times along each branch of the tree, as well as position specific structural and functional constraints imposed by natural selection. This way key residues and structural elements would be conserved. High probability alignments would then be good structural and evolutionary alignments under this model. Unfortunately, we don’t have enough data to parameterize this model. Therefore assumptions must be made.
Almost all alignment methods assume that the individual columns of an alignment are statistically independent. Such a scoring function is written as follows:

Where mi is the column i of the multiple alignment m, S(mi) is the score for the column i, and G is a function for scoring the gaps that occur in the alignment. Most multiple alignment methods use affine scoring functions that pay a higher cost for opening the gap than extending it, so successive gaps are not treated independently. In this function, we are basically alignment score for each column. The sum is then added to the function G for scoring gaps.
Now let's focus on definitions of S(mi) for scoring a column of aligned residues with no gaps.
Lets look at this equation:

where δ(...) is one if the condition inside the function is true, 0 otherwise.
If the phylogenetic tree for sequences has many intermediate ancestors, then the statistical dependence between sequences is complex. The scoring problem is greatly simplified if we assume that sequences have all been generated independently. If we assume that residues within the column are independent, as well as being independent between columns, then the probability of a column mi is:

P(mi) is obtained by multiplying the probabilities of obtaining a residue a in column i of a sequence j. We can define a column score as the negative log of this probability:
This entropy measure is a convenient measure of the variability observed in an aligned column of residues. The more variable the column is, the higher the entropy. A completely conserved measure would score 0. A good alignment is one which minimizes the total entropy of the alignment.
Thus, in return for giving up evolutionary tree and assuming independence between sequences, we gain the ability to straightforwardly estimate a position-specific model of both residue probabilities in columns and insertions and deletions. This assumption, however, can only be reasonable if representative sequences of a sequence family are chosen carefully. In practice, sample sequences are often biased with under or over representations of sub families. Several tree-based weighting schemes have been devised to deal with this.
Sum-of-Pairs scores sum all possible pairwise match scores between amino acids in an aligned column; entropic scores use Shannon's information theoretical entropy to measure the diversity of symbols (amino acids) in a column; matrix scores employ a substitution matrix to evaluate stereochemical diversity in a column; sequence weighted scores normalize against redundancy of sequences in the alignment.
The standard method of scoring multiple alignments is not the HMM formulation, but is similar in that it does NOT use a phylogenetic tree and it assumes statistical independence for the columns. Columns are scored by an SP function using substitution scoring matrix. The SP score for a column is defined as:

where scores s(a,b) come from a substitution scoring matrix such as PAM or BLOSUM matrix. Fro simple linear gap costs, gaps are handled by defining s(a,-) and s(-,a) to be the gap cost, and s(-,-) to be zero. Otherwise gap costs are scored separately (e.g. affine gap cost).
Since substitution scores are derived as log-odds scores for pairwise comparisons, the extension to MSA would be for instance:

The relative difference between correct and incorrect alignment decreases with the number of sequences in the alignment.
Note: This post is a summary of chapter 6.2 of Durbin.
The dynamic programming algorithms used for pairwise sequences alignment can theoretically be extended to any number of sequences. However, the time and memory requirements of this algorithm increase exponentially with the number of sequences.
The only assumption necessary to make multidimensional dynamic programming to work is that column scores are independent.
A common approach to multiple sequence alignment is to progressively align pairs of sequences. The general strategy is:
This is a greedy heuristic algorithm. A greedy algorithm decomposes a problem into pieces, and then chooses the best solution to each piece without paying attention to the problem as a whole. Since it is a heuristic algorithm, progressive alignment is not guaranteed to find the best solution. In practice, however, progressive alignment methods are efficient and produce biologically meaningful results.
MSA uses a clever heuristic multidimensional dynamic programming algorithm. It assumes an SP scoring system for both residues and gaps. We assume that the score of a multiple alignment is the sum of the scores of all pairwise alignments defined by the multiple alignment. The score of the complete alignment is given by:

Let âkl be the optimal pairwise alignment of k,l, which we can calculate in O(L2) time by standard dynamic programing. Obviously, S(akl ≤ S(âkl).
Combining this simple observation and the definition of the SP scoring system, we obtain a lower bound on the score of any pairwise alignment that can occur in the optimal multiple alignment. Now we only need to consider pairwise alignment better than the lower bound. This significantly reduces the complexity.
Note: This post is a summary of chapter 6.3 of Durbin
Progressive alignment works by constructing a succession of pairwise alignments.
Initially, you align two sequences and then align the third sequence to the alignment, and so on. There are several progressive alignment strategies. They differ in the following ways:
Progressive alignment is heuristic:
However, in many cases the resulting multiple alignment is quite reasonable.
The most important heuristic of progressive alignment algorithms is to align the most similar pairs of sequences first. These are the most reliable alignments. Most algorithms build a binary guide tree whose leaves represents sequences and whose interior nodes represent alignments. The root node represents a complete multiple alignment and the nodes furthest from the root represent the most similar pairs.
MA
/ \
AL1 AL2
/ \ / \
AL3 S1 S2 S3
/ \
S4 S5
To convert alignment scores to distance values, we use the following formula:

where
Sobs: observed pairwise alignment score
Smax: maximum score, the average of the score of aligning either sequence by itself
Srand: score of a random alignment
Seff: can be seen as a normalized percentage similarity decreasing to 0 with increasing evolutionary distance.
The method for converting alignment scores to distances does not need to be very accurate as the goal is to create an approximate guide tree. Fitch & Margolish is a fast clustering algorithm that builds evolutionary trees from distance matrices. Before adding a sequence to an existing group, any alignment to one of the group members is tried. The highest score for such an alignment determines how the new sequence will be aligned to the group. Group-to-group alignments are done by comparing all possible pairwise alignment of the members of one group with the members of the other group. The best of these alignments determines the alignment of the two groups. Generally, PAM substitution with an affine gap penalty is used.
The symbol X is used to denote gaps after alignments. Rule: once a gap, always a gap. This rule is put in place to ensure consistency. It encourages gaps to occur in the same columns in subsequent pairwise alignments: s(X,anything) = 0.
A problem with Feng-Doolittle is that all alignments are determined by pairwise sequence alignments. Once an aligned group has been built up, it is advantageous to use position specific information from the group's multiple alignment to align a new sequence to it. The degree of sequence conservation at each position should be taken into account and mismatches at highly conserved positions penalized more stringently than mismatches at variable positions. Gap penalties in positions might be reduced where lots of gaps occur in the cluster alignment, and increased where no gaps occur.
A profile for a given group a sequences contains all features which are somehow typical for this group. We can think of conserved residues as a possible feature example. The idea behind profile alignment is and penalize mismatches more strongly in highly conserved regions than in variable positions.
Many progressive methods use pairwise alignment of sequences to profiles or of profiles to profiles as a subroutine which is used many times in the process. The exact definition of the scoring function used in profile-sequence or profile-profile alignment varies. Aligned residues are usually scored by some form of SP score, but the handling of gaps varies substantially between different methods.
For linear gap scoring, profile alignment is simple because the gap scores can be included in the SP score by setting s(-,a) = s(a,-) = -g and s(-,-) = 0. If we have two multiple alignments or profiles, an alignment of these two means that that gaps are inserted in whole columns, so that the alignment within one of the profiles is not changed. We can then split the sum into two sums only concerning the two profiles and one sum containing all cross terms. The first two sums are unaffected by the global alignment because adding columns of gap characters to a profile adds 0 to the score s(-,-) = 0. Therefore, the optimal alignment of th two profiles can be obtained by only optimizing the last sum with the cross terms. This can be done exactly like a pairwise alignment where columns are scored against columns by adding the pair scores. Obviously one of the profiles can consist of a single sequence only, which corresponds to aligning a single sequence to a profile.
One widely used implementation of profile-based progressive alignment is the CLUSTALW program. CLUSTALW works in much the same way as Feng-Doolittle method except for its carefully tuned use of profile alignment methods.
Algorithm
ClustalW is unabashedly ad hoc (designed for this, not generalizable) in its alignment construction and scoring stage. In addition to the usual methods of profile construction and alignment, various heuristics of ClustalW contribute to its accuracy:
One problem with progressive alignment algorithms is that the subalignments are 'frozen'. Once a group of sequences has been aligned, their alignment to each other cannot be changed at a later stage as more data arrives. Iterative refinement algorithms attempt to circumvent this problem.
In iterative refinement, an initial alignment is generated, then one sequence or a set of sequences are taken out and realigned to a profile of the remaining aligned sequences. If a meaningful score is being optimized, this earlier increases the overall score or results in the same score. Another sequence is chosen and realigned, and so on, until we arrive at a point where the alignment does not change. This procedure is guaranteed to converge to a local maximum of the score provided that all the sequences are tried and a maximum score exists, simply because the sequence space is finite. Barton-Sternberg is a good example.
Barton-Sternberg Algorithm
The ideas of profile alignment and iterative refinement come quite close to the formulation of probabilistic HMM approaches for multiple alignment.
Note: This post is a summary of chapter 6.4 of Durbin.
[1] Durbin
[2] http://www.cs.helsinki.fi/u/ajrantan/talks/slides_andre.pdf
Profile HMMs could be used in place of standard profiles in progressive or iterative alignment methods. The use of profile HMM formalisms may have certain advantages such as replacing SP scoring scheme by profile HMM assumption that sequences are generated independently from a single 'root' probability distribution.
Profile HMMs can also be trained from initially unaligned sequences using Baum-Welch expectation maximization algorithm.
Before tackling the problem of estimating a model and a multiple alignment simultaneously from initially unaligned training sequences, we consider the simpler problem of obtaining a multiple alignment from a known model. To align a sequence to a profile HMM, we find the most probable path through the model which is found by the Viterbi algorithm. Constructing a multiple alignment just requires calculating a Viterbi alignment for each individual sequence. Residues aligned to the same profile HMM match state are aligned in columns. Use fig. 456.
Suppose we align 5 sequences. Then we derive Viterbi optimal path and realign the sequences. A profile HMM inserts insert states [a-z] for unmatched residues and [A-Z] for matched residues. A profile HMM does not modify the alignment. Insert state residues represent parts of the sequences which are atypical, unconserved, and not meaningfully alignable.
Now we try to estimate a model and multiple alignment from initially unaligned sequences.
Initialization: Choose the length of the profile HMM and initialize parameters.
Training: Estimate the model using Baum-Welch or Viterbi algorithm. It is necessary to use a heuristic method for avoiding local optima.
Multiple Alignment: Align all sequences to the final model using the Viterbi algorithm and build a multiple alignment.
A profile HMM is a repeating linear structure of three states (match, delete, and insert). The only decision that must be made in choosing an initial architecture for Baum-Welch estimation is the length of the model M. M is the number of match states in the profile HMM rather than the total number of states, which is usually set to the average length or training sets or based on prior knowledge.
Since Baum-Welch finds local optima, it is important to choose initial models carefully. The model should be encouraged to use 'sensible' transitions; or instance, transitions into match states should be large compared to other transition probabilities. At the same time, we want to start Baum-Welch from multiple different points to see if all converge to approximately the same optimum, so we want some randomness in the choice of initial model parameters.
Note: This post is a summary of chapter 6.5 of Durbin.