Class CentroidLinkage

  • All Implemented Interfaces:
    AgglomerationMethod

    public final class CentroidLinkage
    extends Object
    implements AgglomerationMethod
    The "centroid" or "Unweighted Pair-Group Method using Centroids (UPGMC)" method is a geometric approach that links the centroids of clusters. Each cluster is represented by its centroid. The distance between two clusters is calculated as the distance between their centriods. This method does not distort the cluster space. [The data analysis handbook. By Ildiko E. Frank, Roberto Todeschini] Can produce a dendrogram that is not monotonic (it can have so called inversions, which are hard to interpret). This occurs when the distance from the union of two clusters, r and s, to a third cluster is less than the distance between r and s. Used only for Euclidean distance! The general form of the Lance-Williams matrix-update formula: d[(i,j),k] = ai*d[i,k] + aj*d[j,k] + b*d[i,j] + g*|d[i,k]-d[j,k]| For the "centroid" method: ai = ci/(ci+cj) aj = cj/(ci+cj) b = -ci*cj/((ci+cj)*(ci+cj)) g = 0 Thus: d[(i,j),k] = ci/(ci+cj)*d[i,k] + cj/(ci+cj)*d[j,k] - ci*cj/((ci+cj)*(ci+cj))*d[i,j] = ( ci*d[i,k] + cj*d[j,k] - ci*cj/(ci+cj)*d[i,j] ) / (ci+cj)
    • Constructor Detail

      • CentroidLinkage

        public CentroidLinkage()
    • Method Detail

      • computeDissimilarity

        public double computeDissimilarity​(double dik,
                                           double djk,
                                           double dij,
                                           int ci,
                                           int cj,
                                           int ck)
        Description copied from interface: AgglomerationMethod
        Compute the dissimilarity between the newly formed cluster (i,j) and the existing cluster k.
        Specified by:
        computeDissimilarity in interface AgglomerationMethod
        Parameters:
        dik - dissimilarity between clusters i and k
        djk - dissimilarity between clusters j and k
        dij - dissimilarity between clusters i and j
        ci - cardinality of cluster i
        cj - cardinality of cluster j
        ck - cardinality of cluster k
        Returns:
        dissimilarity between cluster (i,j) and cluster k.