In the study of hierarchical clustering, Dasgupta's objective is a measure of the quality of a clustering, defined from a similarity measure on the elements to be clustered. It is named after Sanjoy Dasgupta, who formulated it in 2016.[1] Its key property is that, when the similarity comes from an ultrametric space, the optimal clustering for this quality measure follows the underlying structure of the ultrametric space. In this sense, clustering methods that produce good clusterings for this objective can be expected to approximate the ground truth underlying the given similarity measure.[2]

In Dasgupta's formulation, the input to a clustering problem consists of similarity scores between certain pairs of elements, represented as an undirected graph , with the elements as its vertices and with non-negative real weights on its edges. Large weights indicate elements that should be considered more similar to each other, while small weights or missing edges indicate pairs of elements that are not similar. A hierarchical clustering can be described as a tree (not necessarily a binary tree) whose leaves are the elements to be clustered; the clusters are then the subsets of elements descending from each tree node, and the size of any cluster is its number of elements. For each edge of the input graph, let denote the weight of edge and let denote the smallest cluster of a given clustering that contains both and . Then Dasgupta defines the cost of a clustering to be[1]

The optimal clustering for this objective is NP-hard to find. However, it is possible to find a clustering that approximates the minimum value of the objective in polynomial time by a divisive (top-down) clustering algorithm that repeatedly subdivides the elements using an approximation algorithm for the sparsest cut problem, the problem of finding a partition that minimizes the ratio of the total weight of cut edges to the total number of cut pairs.[1] Equivalently, for purposes of approximation, one may minimize the ratio of the total weight of cut edges to the number of elements on the smaller side of the cut. Using the best known approximation for the sparsest cut problem, the approximation ratio of this approach is .[3]

References

edit
  1. ^ a b c Dasgupta, Sanjoy (2016), "A cost function for similarity-based hierarchical clustering", Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2016), New York, New York: ACM, pp. 118–127, arXiv:1510.05043, doi:10.1145/2897518.2897527, MR 3536559, S2CID 2262168
  2. ^ Cohen-Addad, Vincent; Kanade, Varun; Mallmann-Trenn, Frederik; Mathieu, Claire (2018), "Hierarchical clustering: objective functions and algorithms", Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics, pp. 378–397, arXiv:1704.02147, doi:10.1137/1.9781611975031.26, MR 3775814
  3. ^ Arora, Sanjeev; Rao, Satish; Vazirani, Umesh (2009), "Expander flows, geometric embeddings and graph partitioning", Journal of the ACM, 56 (2): A5:1–A5:37, doi:10.1145/1502793.1502794, MR 2535878, S2CID 263871111