Cluster Quality Versus Choice of Parameters
4.2 Setting the Number of Clusters
It has been assumed up till now that the number of clusterskis known, which, in fact, is not always the case. The most common method of establishing the proper value ofkis using certain partition quality measure,m(k)and establishing such a value ofkparameter, which optimises the measure. Nonetheless, it must be emphasized that in the majority of cases the values of quality indicators depend on concrete data.
The fact that for particular data certain indicator allows for establishing the proper number of clusters does not mean that in case of different data it will also indicate the proper value of k. For this reason, various methods are applied, which can be divided into following groups, [515]
(a) Data visualisation. In this approach multidimensional data projection on bi- or tridimensional space is applied. Typical representatives of this direction are principal component analysis (PCA) and multidimensional scaling.4
(b) Optimisation of some criterion function characterising the properties of mixtures of the probability distributions. E.g. the algorithm EM, discussed in Sect.3.2, optimises the parametersθof the mixture for a given value ofk. The value of kfor which the quality index attains the optimum value is supposed to be the probable number of clusters. The typical indicators i this domain are
– Akaike information criterion AIC(k)= 1
m
−2(m−1)−nk−k 2l(θ)
+3np
wherenkdenotes the number of cluster parameters,np—total number of para- meters being optimised andl(θ)—likelihood function logarithm.
– Bayesian criterion
BIC(k)=l(θ)−np
2 ln(n)
– Variance based criteria, for example maximising Fisher-like quotient of between-cluster variance and within cluster variance
– Comparison of within-cluster cohesion and between-cluster separation, for example the point-biserial correlation (Pearson’s correlation between the vec- tor of distances between pairs of elements and the vector of indicators assign-
4Cf. e.g. I. Borg, and Patrick JF Groenen.Modern multidimensional scaling: Theory and applica- tions. Springer, 2005.
166 4 Cluster Quality Versus Choice of Parameters ing each pair 1, if they belong to the same cluster, and 0 otherwise), or silhou- ette widthss(i)= max(b(ia)−(i),a(bi)(i)), witha(i)being the average distance between iand all other entities of its cluster, andb(i)is the minimum of the average distances betweeniand all the entities in each other cluster (If the silhouette width value is close to−1, it means that the entity is misclassified. If all the silhouette width values are close to 1, it means that the set is well clustered).
– Consensus based methods. They assume that under the “right” number of clusters the partitions generated by randomised clustering algorithms should vary less than under wrong number of clusters.
– Resampling methods are in their spirit similar to consensus based ones, except that the variation in partitioning results not from non-determinism of the algo- rithm but rather from the random process of sample formation (subsampling, bootstrapping, random division into training and testing subsets).
– Hierarchical methods. Hierarchical clustering (divisive or agglomerative algo- rithms) are applied till a stopping criterion (too small improvement or wors- ening) occurs. The resulting number of clusters is assumed to be right one and used as a parameter for partitional methods. An interesting example is here the “intelligentk-means” (ik-means) algorithm,5designed as special ini- tialisation fork-means clustering, that takes the sample centre and the most remote point as starting cluster centres for 2-means run, in which only the sec- ond one is updated till stabilisation. Thereafter the second cluster is removed from the sample and the process is repeated without changing the position of the very first cluster centre. The clusters are then post-processed by refuting the smallest ones (of cardinality 1). The number of clusters is known to be overestimated in this process. Still another approach (purely divisive) is used in thex-means” algorithm6
(c) Heuristic methods.
The spectral methods, which are discussed in the next chapter, also provide tools allowing to decide on the number of clusters.
Generally, it must be said that, according to many researchers, determining the proper number of clusters is the basic issue concerning the credibility of cluster analysis results. Excessive number of clusters leads to results which are non-intuitive and difficult to interpret, while too small value ofkresults in information loss and wrong decisions.
5M. Ming-Tso Chiang and B. Mirkin: Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads. Journal of Classification 27 (2009).
6D. Pelleg and A. Moore:X-means: Extendingk-means with efficient estimation of the number of clusters. Proc. 17th International Conf. on Machine Learning, 2000.
0 20 40 60 80 100 120 140 0
1000 2000 3000 4000 5000 6000 7000
0 20 40 60 80 100 120 140
0 2 4 6 8 10 12 14x 10
5
Fig. 4.1 Relation between an average cluster radius (left figure) and the total distance of objects from the prototypes depending on the number of clusters (right figure)
4.2.1 Simple Heuristics
One of the simplest rules says that7 k≈
m/2 (4.2)
Another, so called Elbow Method,8 consists in examining the fraction of the explained variance in the function of the number of clusters. According to this heuris- tics we take forksuch a number of clusters, that adding yet another cluster increases the fraction only slightly. The decision is based on the diagram: on the coordinate axis one puts subsequent values ofk, and on the ordinate axis—the percentage of the explained variance corresponding to them. The point where the curvature flexes is the candidatekvalue. The percentage of the explained variance is conceived as the ratio of the respective within-group variance to the total of the whole data set.
However, one should remember that setting the inflection point is not always unam- biguous. Another variant of this method consists in examining the variability of an average cluster radius depending on the number of classesk. The radius rj of the clusterCjis defined as the maximum distance of a point fromCjto the prototypeµj. The mean value of the radii of all clusters is indeed the length of an average radius r. A typical diagram of dependence ofron the number of clusterskis presented in Fig.4.1. Usually, this kind of diagram is plotted for multiples of the value ofk, e.g.
as indicated in figures below, fork =2,4,8, . . .. If a given value changes slightly in the interval{k,2k}, then as the number of clusters we select the value ofk∗from this interval [402, Sect. 7.3.3].
7Cf. K. Mardia et al.Multivariate Analysis. Academic Press 1979, p. 365.
8The basic work of reference, usually cited while this method is discussed, is the following:
R.L. Thorndike “Who Belong in the Family?”.Psychometrika18 (4) 1953. Some modification of this method was applied in the paper C. Goutte, P. Toft, E. Rostrup, F.A. Nielsen, L.K. Hansen.
“On clustering fMRI time series”.NeuroImage9 (3): 298–310 (March 1999).
168 4 Cluster Quality Versus Choice of Parameters A more formal approach to this problem, together with a survey of other methods can be found in [459]. The authors propose there the so called gap statistics, measur- ing the change of the within-group varianceW(k,m), determined after partition of m-elements set intokclusters (using a chosen clustering algorithm) in relation to the expected variance obtained from anmdimensional sample coming from an exem- plary distribution. It is assumed that the value ofk, for which the range is maximal, is the probable estimation of the number of clusters.
4.2.2 Methods Consisting in the Use of Information Criteria
For determining the value ofk, one applies Bayesian information criterion (BIC), already mentioned, Akaike information criterion (AIC), or, finally, the so called deviance information criterion (DIC), which is a generalisation of both previous criteria.
Reference [449] proposes yet another method. For a given partition of the setX intokclusters (knowing their dimensions) one determines an averaged Mahalanobis distance (cf. Sect.2.2)
d(k)= 1 n min
j=1,...,kd(X,µj) (4.3)
whered(X,µj)denotes Mahalanobis distance between the elements of the setX and thej-th centre obtained e.g. by using thek-means algorithm.
Subsequently, we determine the range
Jk=d−α(k)−d−α(k−1) (4.4) whered−α(0)=0 andα=n/2. The value
k∗=arg max
j Jj (4.5)
is assumed to be the proper number of clusters.
In particular, whenXis a set without any internal structure, thenk∗=1.
In practice, the value ofd(k)is approximated by the minimum sum of error squares (3.1), computed by running thek-means algorithm.
4.2.3 Clustergrams
By clustergrams we understand the diagrams in the parallel coordinate system, where a vector was assigned to each observation, [420]. The components of a vector
correspond to the cluster membership of the observation for a given number of clusters. The diagrams thus obtained allow for observing the changes in the assign- ment of objects into clusters in line with the increase of the number of clusters.
The author of the idea claims that clustergrams are useful not only in the partition- ing methods of cluster analysis but also in hierarchical grouping, when the num- ber of observations increases. At the addresshttp://www.r-statistics.com/2010/06/
clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/the code (in the R language) of a programme generating clustergrams is available.
Internal indexes such as the Bayesian information criterion (BIC) and the sum- of-squares encounter the difficulties regarding finding the inflection point of the respective curves and so, detection methods through BIC in partition-based clus- tering are proposed in the mentioned study. A new sum-of-squares based index is also proposed, where the minimal value is considered to correspond to the optimal number of clusters. External indexes, on the other hand, need a reference clustering or ground-truth information on data and therefore cannot be used in cluster valid- ity. Consequently, an external index was extended into an internal index in order to determine the number of clusters by introducing a re-sampling method.
4.2.4 Minimal Spanning Trees
Several methods of determining the number of clusters refer to spanning a mini- mum weight tree. One of them, proposed by Galluccioa et al. [196] is based on the assumption that clusters one is looking for are dense regions in the sample space, surrounded by low frequency areas. Consequently, when constructing the minimum weight spanning tree (MST) using so-called Prim algorithm one will observe “low frequency” areas in the so-called Prim-trajectory. Assume we proceed as follows when constructing MST: Pick a node to be the initialisation of a partial tree and consider the remaining nodes as non-tree nodes. Now in each stept=1, . . . ,m−1, wheremis the total number of nodes, look for the pair of nodes, one from the tree and one from non-tree that has the minimum weight. Defineg(t)as this minimum weight found in stept. Remove the respective node from the non-tree set and add it to the tree set. If one drawsg(t), one will observe “valleys” with lowgvalues, separated by picks of distances. The number of such valleys is the number of clusters, after some filtering process. The centre of the element set constituting a valley is the cluster mean candidate (seed for subsequentk-means iterations). The filtering consists in statistical testing whether or not a sequence of low distance values can be considered high density areas that is if we can say that the sequence is unlikely to come from a uniform distribution together with the surrounding peaks ofg. The method works fine if the assumptions are met, but it may fail wherever single linkage clustering methods have difficulties. It differs from single linkage by making statistical tests whether or not a tree may be “broken” due to not conforming to the hypothesis that some Poisson process with uniform distribution generated the points.
170 4 Cluster Quality Versus Choice of Parameters