Definitions
Phylogeny refers to the evolutionary relationships among organisms. It is the study of patterns of lineage branching produced by the true evolutionary history of the organisms being considered.
Phylogenetics is the field of biology that deals with the relationships between organisms. It includes the discovery of these relationships, and the study of the causes behind their pattern.
In molecular biology terms, phylogenetics is useful for
- Infering function by similarity
- Choosing template for homology modeling
- Discovering and analyzing gene families
- Comparing whole genomes
Taxon is a unit of classification. Often it refers to the members of the groups of organisms being analyzed. This may be a single species or a group of species. It is the “label” at the leaf of the tree.
Homology is similarity due to a common ancestor. It is in fact the hypothesis we make when we align sequences. Homology is not similarity. Similarity is a measurable scale. Homology is a hypothesis that can be either true of false.
Homoplasy:The occurrence of similar states of a character not due to common lineage. This may be due to environmental constraints or simply a random occurrence.
Convergence: bats and birds have wings but don’t share common ancestry.
Reversion: whales resemble fish but whale’s ancestors lived on land.
Orthologs and Paralogs
Two genes are orthologous if they diverged after a speciation event. Two genes are paralogous if they diverged after a gene duplication event.

Haemoglobin α and β are paralogs whether we compare within or across species.
Human α-Haemoglobin and pig α-Haemoglobin are orthologs.
Human β-Haemoglobin and pig β-Haemoglobin are orthologs.
There is only on speciation event. It is present twice in the tree because each paralog diverged after it occured.
Comparing human α-Haemoglobin and pig β–Haemoglobin for the purpose of inferring function would give aberrant results.
Introduction
Alignment of sequences should take account of their evolutionary relationship. For example, an alignment that implies many substitutions between closely related sequences is less plausible than one that makes most of its changes over large evolutionary distances.
Similarity of molecular mechanisms of different organisms strongly suggests that they might have originated from a common ancestor. Such relationships between species is called phylogeny and it can be represented in a phylogenetic tree. Phylogenetics is the science of inferring a phylogenetic tree from experiments and observations.
Organisms diversify by either gene duplication or speciation events. In a gene duplication event, a gene is duplicated and over time the two genes diverge. In a speciation event, a gene is modified. Due to gene duplication, the phylogenetic tree of a group of sequences does not reflect the phylogenetic tree of the host species. If we are interested in inferring the phylogenetic tree of the species carrying genes, we must use orthologous genes (created by speciation events).
Phylogenetic Trees
Phylgenetic trees are usually binary trees. Each edge branches into two daughter edges. Each edge of the tree has a certain amount of evolutionary divergence associated to it. This divergence is measured by some measure such as distance between sequences, or from a substitution model of residues over the course of evolution. Different proteins evolve at different rates. Even same sequences in different organisms change at different rates. However, avereraging over larger sets of proteins, we witness a correspondence between lengths and evolutionary time periods.
By definition, a phylogeny has a root which is the ancestor of all sequences. However, it is not always possible to reliably infer a root. Several algorithms provide information about the location of the root while others like parsimony and the probabilistic models are completely uninformative. For such algorithms, other criteria needs to be used for rooting the tree.
A rooted tree indicates the direction of evolutionary time. The direction of time is undermined in an unrooted tree.
Counting and labelling trees
A rooted binary tree with n leaves contains n-1 non-leaf nodes, 2n-1 nodes in total, and 2n-2 edges. An unrooted binary tree with n leaves has 2n-2 nodes and 2n-3 leaves.
Phylogenetic Algorithms
There are three classes of phylogenetic algorithms:
- Numeric taxonomic phenetics - distance based
- Cladistic Methods
- Probabilistic Methods
Cladistic Methods
Make inferences about characters at internal nodes. All cladistic methods attempt to find the following:

The vast majority of cladistic methods are optimization algorithms. These algorithms search for an optimum in a search-space. The search space is the set of possible trees. This includes all topologies and all ancestral states for each topology.
A search methods could be brute-force, branch and bound or heuristic.
Brute Force cladistic search methods
The search space can be represented in the form of a tree. A selection is made at each node. In brute force, a complete search of all phylogenetic trees is made by walking the decision-tree and calculating the score at each leaf of the decision tree.
Branch and Bound cladistic search methods
Branch and bound algorithms also use search trees. The score of the partially constructed tree is calculated at each internal node. If the score is worse than the best score obtained so far, we do not continue with that branch.
Heuristic algorithm based cladistic search methods
Both brute force and branch and bound always find the best solution but they cannot do much in real time. Heuristic solutions are much faster but do not guarantee the optimal solution. Local optima vs. global optima.
Advantages of cladistic methods
• Take variable rates of evolution and homoplasy into account.
• Gives a tree with putative ancestral states.
Disadvantages of cladistic methods
• Slow
• Often only local optima is found
• Care must be taken when interpreting evolutionary distances
• Many equally optimal solutions may be generated
Probabilistic Methods
Probabilistic methods start with a model of evolution. This model is described in the form of mutation probabilities. The most probable tree given the data and the model can then be calculated. The probabilities of multiple mutations in a branch are also taken into account. The most commonly used probabilistic algorithms are maximum likelihood and bayesian methods.
Advantages
• Based on a model of evolution
• Take variable rates of evolution, homoplasy and even multiple mutations in a branch into account
• Statistical confidence for the result is inherent in the method
Disadvantages
• Slow
• Often only local optimum is found.
Probabilistic Methods
Based on a probabilistic model of evolution
Molecular Clock
At the molecular level, mutations occur with a certain probability. However, a date cannot be read directly from molecular data. In some organisms this rate is higher than others due to geographical and temporal variations. Mutations are not conserved at a constant rate. All purely molecular dating methods give aberrant results.