Additivity and Neighbor Joining

Following is an example where simple clustering methods would produce erroneous results.

A and B are clustered together first since they are closest. Additivity can be used to solve this problem.

Additivity: Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them. [1] An additive distance is fully characterized by the four point condition which states that any 4 points can be renamed such that:

d(a,c) + d(b,d) < d(a,b) + d(c,d) = d(a,d) + d(b,c)

Following is graphical representation of this equation

When we apply this equation to our graph, we get

< =

This equation states that a graph can be decomposed in such a way that two are equal while the third is less than the two.

Suppose we have the following matrix representing distance matrix. It must constitute an additive metric for it to produce a valid tree.

We choose a pair of leaf nodes for which Dij is minimal.

Dab = dab - (ra + rb)

where ra is the average distance between a and all other leaves and ra is the average distance between a and all other leaves. Using our matrix, we obtain the following values:

A = 2 + 7 + 4 + 7 = 20
B = 2 + 7 + 4 + 7 = 20
C = 27
D = 22
E = 27

Since A and B have the smallest values, we would start by joining A and B to some other node

And we continue using the additivity equations.

If a new path branches off an existing branch in a tree, replace one of the original leaves by another leaf along the branching path.

Each run of this algorithm has a complexity of O(n). Therefore, complexity of the reconstructing a tree from an additive distance is O(n2).

In practice, the distance between molecular sequences to be non-additive. In such cases, we look for a tree T whose matrix is close to the given one.

Analysis

It is possible for the molecular clock property to fail but for additivity to hold. UPGMA fails when molecular clock property fails. We can use neighbor joining algorithms when molecular clock fails but additivity holds true.

Neighbor-joining is a bottom-up clustering method used for the creation of phylogenetic trees. The algorithm requires knowledge of the distance between each pair of taxons in the tree. Neighbor-joining is based on the minimum-evolution criterion for phylogenetic trees, i.e. the topology that gives the least total branch length is preferred at each step of the algorithm. [2]

Being a greedy algorithm, neighbor-joining may not find the true tree topology with least total branch length. However, it usually finds a tree quite close to the optimal tree. Neighbor-joining is an efficient algorithm with polynomial-time complexity which makes it suitable for large datasets where other algorithms such as minimum evolution are not suitable computationally.

Unlike UPGMA, neighbor-joining does not assume a molecular clock i.e. that all species evolve at the same rate. Thus, neighbor-joining produced unrooted trees. There are several techniques to root a tree. For example, a root can be added by using an outgroup and the root can then effectively be placed on the point in the tree where the edge from the outgroup connects.

Neighbor-joining is statistically consistent under many evolutionary models. Hence, given data of sufficient length, neighbor-joining will reconstruct the true tree with high probability. [2]

Additivity is a property that depends on the distance measure used. A tree may be additive with respect to one distance measure and not with respect to another. Additivity means that the sums of 2 lengths must be larger than a third and equal in size.

Just as the ultrametric condition provided a test for the molecular clock property, the four point condition provides a test for additivity. For every set of 4 points, two will be equal and both will be larger than the third.

Neighbor-joining involves stripping all nodes with the exception of pre-existing additive trees. The tree is then reassembled.

Source

[1] Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin.
[2] Wikipedia
[3] http://www.icp.ucl.ac.be/~opperd/private/neighbor.html
[4] Class notes of B. Sonderegger