The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms.[1][2] This is part of a group of validity indices including the Davies–Bouldin index or Silhouette index, in that it is an internal evaluation scheme, where the result is based on the clustered data itself. As do all other such indices, the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance. For a given assignment of clusters, a higher Dunn index indicates better clustering. One of the drawbacks of using this is the computational cost as the number of clusters and dimensionality of the data increase.

Preliminaries

edit

There are many ways to define the size or diameter of a cluster. It could be the distance between the farthest two points inside a cluster, it could be the mean of all the pairwise distances between data points inside the cluster, or it could as well be the distance of each data point from the cluster centroid. Each of these formulations are mathematically shown below:

Let Ci be a cluster of vectors. Let x and y be any two n dimensional feature vectors assigned to the same cluster Ci.

  , which calculates the maximum distance (the version proposed by Dunn).
  , which calculates the mean distance between all pairs.
  , calculates distance of all the points from the mean.

This can also be said about the intercluster distance, where similar formulations can be made, using either the closest two data points (used by Dunn), one in each cluster, or the farthest two, or the distance between the centroids and so on. The definition of the index includes any such formulation, and the family of indices so formed are called Dunn-like Indices. Let   be this intercluster distance metric, between clusters Ci and Cj.

Definition

edit

With the above notation, if there are m clusters, then the Dunn Index for the set is defined as:

 

where   is the inter-cluster distance between the clusters   and   while   is the within cluster distance, e.g. the maximum distance within one cluster when following Dunn's original definition.

Explanation

edit

Being defined in this way, the DI depends on m, the number of clusters in the set. If the number of clusters is not known apriori, the m for which the DI is the highest can be chosen as the number of clusters. There is also some flexibility when it comes to the definition of d(x,y) where any of the well known metrics can be used, like Manhattan distance or Euclidean distance based on the geometry of the clustering problem. This formulation has a peculiar problem, in that if one of the clusters is badly behaved, where the others are tightly packed, since the denominator contains a 'max' term instead of an average term, the Dunn Index for that set of clusters will be uncharacteristically low. This is thus a worst case indicator, and has to be kept in mind. There are ready implementations of the Dunn index in some vector based programming languages like MATLAB, R and Apache Mahout.[3][4][5]

Notes and references

edit
  1. ^ Dunn, J. C. (1973-09-17). "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–57. doi:10.1080/01969727308546046. S2CID 120919314.
  2. ^ Dunn, J. C. (1973-09-01). "Well-Separated Clusters and Optimal Fuzzy Partitions". Journal of Cybernetics. 4 (1) (published 1974): 95–104. doi:10.1080/01969727408546059. ISSN 0022-0280.
  3. ^ "MATLAB implementation of the Dunn Index". Retrieved 5 December 2011.
  4. ^ Lukasz, Nieweglowski. "Package 'clv'" (PDF). R project. CRAN. Retrieved 2 April 2013.
  5. ^ "Apache Mahout". Apache Software Foundation. Retrieved 9 May 2013.
edit