UVP Doc-It Life Science User Manual

Page 94

Advertising
background image

Performing 1D Analysis

89

Creating clusters

Initially, each lane has its own cluster. Then, repeatedly, a linkage rule (see below) is used to merge
smaller groups into larger clusters, until all the clusters have been combined into a single cluster. The
result is a hierarchy of clusters. Moving up the hierarchy contains clusters with more but less similar
lanes. Lanes that are very similar to each other will appear together in clusters near the bottom of
hierarchy.

The dendrogram shows the links that have been made between the clusters to form larger clusters

– the

shorter the distance between items in the dendrogram, the more similar they are.

Linkage rules

A linkage rule offers a method to calculate a measure of the distance between two clusters.

Single Linkage (nearest neighbor): The distance between two clusters is given by the distance
between the two closest items (lanes) in the different clusters.

Using this method often causes the chaining phenomenon, which is a direct consequence of the
single linkage method tending to force clusters together due to single entities being close to each
other regardless of the positions of other entities in that cluster.

Complete Linkage (furthest neighbor): The distance between two clusters is given by the
greatest distance between two items in the different clusters.

This method should not be used if there is a lot of noise expected to be present in the dataset,
because outliers are given more weight in the cluster decision. It also produces very compact
clusters. This method is useful if one is expecting entities of the same cluster to be far apart in
multi-dimensional space (provided there is no noise).

Unweighted pair-group method average (UPGMA): The distance between two clusters is
calculated as the arithmetic mean of the distances between all possible pairs of entities of the two
clusters in question.

This method is a halfway choice between single and complete linkage. The chaining problem is
not observed for this method and outliers are not given any special favor in the cluster decision,
which makes this method the most popular.

Weighted pair-group method average (WPGMA): This is identical to UPGMA except that the
number of items in a cluster is taken into account

– this may be useful when there is a large

variation in the number of items in the clusters.

Unweighted pair-group method centroid (UPGMC): The distance between two clusters is the
distance between the centroids of each cluster (the centroid of a cluster is the average point in
the multidimensional space of the cluster). The resulting trees are not right-aligned and branches
can have negative values.

Weighted pair-group method centroid (WPGMC): This is identical to UPGMC except that the
number of items in a cluster is taken into account

– this may be useful when there is a large

variation in the number of items in the clusters.

Ward s method: This method differs from the others in that it uses an analysis of the variance to
calculate distances between clusters. An item is joined to a cluster if the joining results in a
minimum degree of variation within the cluster. This means that items will not get grouped into a
cluster simply because they do not belong anywhere else. As a consequence, Ward s method
can lead to a large number of small clusters. Details of the method can be found at

Advertising
This manual is related to the following products: