UPGMA

UPGMA, which stands for unweighted pair group using arithmetic averages, is a clustering method used to build phylogenetic trees. It works by averaging the distances but these distance averaging is based on the number of taxons in different clusters.

If A and B are two nodes, the distance between two nodes is computed by:


Latex: \huge$d_{(AB)C} = \frac{d_{AB} + d_{AC}}{2}$

which is simply the average of distance between AB and AC.

Suppose, we have two clusters i and j such that |Ci| and |Cj| denote the number of sequences in the clusters respectively. Suppose that Ck which is formed by the union of Ci and Cj. If we wish to wish to compute the distance between Ck and some other cluster Cl:


Latex: \huge$d_{kl} = \frac{d_{il}|C_{i}| + d_{jl}|C_{i}|}{|C_{i}| + |C_{j}|}$

Following is more intuitive example of UPGMA.

  • We assign each sequence to its own cluster. Initially, each cluster would be a cluster in itself.
  • For each cluster, we find a cluster which is closest to it and compute dij and define a new cluster formed by the union of both. If we find several equidistant clusters, we choose one at random.
  • We repeat this process until, all clusters are connected.

Now lets look a this example using a distance matrix. We have the following matrix:

A B C D
B 2
C 4 4
D 6 6 6
E 6 6 6 2
  1. d(AB)C = (dAC + dBC) / 2
  2. d(ABC)DE = (dAD + dAE + dBD + dBE + dCD + dCE) / 6

When the data is ultrametric, both UPGMA and WPGMA have the same results. When the data is not ultrametric, the inferences vary.

Molecular clocks and the ultrametric property of distances

UPGMA produces a rooted tree of a special kind. The edge lengths in the resulting tree can be viewed as times measured by a molecular clock with a constant rate. The divergence of sequences is assumed to occur at the same constant rate at all points in the tree, i.e. the sum of time down a path to the leaves from any node is the same. If our distance data is derived by adding edge lengths in a tree T with a molecular clock, then UPGMA will reconstruct T correctly. If the original tree has different length routes to its leaves, then it may be reconstructed incorrectly by UPGMA. The problem is that the closest leaves may not be neighboring leaves as they may not have an ancestor in common. A test of whether reconstruction is likely to be correct is teh ultrametric condition. The distances dij are said to be ultrametric if, for any triplet of sequences, xi, xj, xk, the distances dij, djk, dik are either all equal, or two are equal and the remaining one is smaller. This condition holds fro distances derived from a tree with a molecular clock.