Multiple sequence alignment techniques are most commonly applied to protein sequences; ideally they are a statement of both evolutionary and structural similarity among the proteins encoded by each sequence in the alignment.
Multiple alignments must usually be inferred from primary sequences alone. Biologists produce high quality multiple sequence alignments by hand using expert knowledge of protein sequence evolution. This knowledge comes from experience. Important factors include:
- specific sorts of columns in alignments, such as highly conserved residues or buried hydrophobic residues
- the influence of secondary or tertiary structure, such as the alteration of hydrophobic and hydrophilic columns in exposed beta sheet
- expected patterns of insertions and deletions, that tend to alternate with blocks of conserved sequence
The phylogenetic relationships between sequences dictate constraints on the changes that occur in columns and in the patterns of gaps.
Manual alignment is tedious. To automate the process, it is hard to define exactly what an optimal multiple sequence alignment is, and it is impossible to set a standard for a single correct multiple alignment. In theory, there is one underlying evolutionary process and one evolutionarily correct alignment generated from any group of sequences. However, the differences between sequences can be so great in parts of an alignment that there isn’t an apparent, unique solution to be found by an alignment algorithm. Those same divergent regions are often structurally unalignable as well. Most of the insight that we derive from multiple alignments comes from analyzing the regions of similarity, not from attempting to align highly diverged regions.
In general, an automatic method must have a way to assign a score so that better multiple alignments get better scores. We should carefully distinguish the problem of scoring a multiple alignment from the problem of searching over possible multiple alignments to find the best one.
To automate multiple alignment, we need to do the following:
- look at what we need to do for automatic multiple alignment structurally and evolutionarily
- consider how to turn the biological criteria into a numerical scoring scheme, so that a program will recognize a good multiple alignment.
- examine various approaches by different multiple alignment programs
- describe a full probabilistic multiple alignment approach based on profile HMM
What does a multiple alignment mean?
In a multiple sequence alignment, homologous residues among a set of sequences are aligned together in columns. ‘Homologous’ is meant for both structural and evolutionary sense. Ideally, a column of aligned residues occupy similar 3D structural positions and all diverge from a common ancestral residue.
Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally or evolutionarily homologous positions and create a single ‘correct’ multiple alignment. Since protein structures also evolve, we do not expect 2 protein structures with different sequences to be entirely superposable. Even the definition of ‘structurally superposable’ is subjective and can be expected to vary among experts.
In principle, there is always an unambiguously correct alignment even if the structures diverge. In practice, however, an evolutionarily correct alignment can be even more difficult to infer than a structural alignment. Structural alignment has an independent point of reference, superposition of x-ray crystallography or NMR structures. The evolutionary history of the residues of a sequence family cannot be independently known from any source. It must be inferred from sequence alignment.
The program should not be asked to produce exactly the same alignment. Instead, it should be focused on the subset of columns corresponding to key residues and core structural elements that can be aligned with confidence.
Summary
- multiple alignment is an alignment of more than two sequences
- usually gives more information about conserved regions
- It gives better estimate of significance when using a sequence of unknown function
- Must use multiple alignments when establishing phylogenetic relationships
Note: This post is a summary of chapter 6.1 of Durbin.