Algorithms of Combinatorial Cluster Analysis
3.1.5 Variants of the k-means Algorithm
3.1.5.5 Kernel Based k-means Algorithm
The idea of the algorithm is to switch to a multidimensional feature spaceFand to search therein for prototypesμj minimizing the error
J2(U)= m
i=1
1≤minj≤k(xi)−μj2 (3.25)
where:Rn → Fis a non-linear mapping of the space X into the feature space.
This is the second variant of application of the kernel functions in cluster analysis, according to classification from the Sect.2.5.7. In many cases, switching to the space Fallows to reveal clusters that are not linearly separable in the original space.20
20Dhillon et al. [149] suggest for example to use a weighted kernelk-means to discover clusters in graphs analogous to ones obtained from a graph by normalised cuts. They optimise a weighted version of the cost function, that is
J2(U)= m i=1
1min≤j≤kw(xi)(xi)−μj2
where
μj = 1
xi∈Cjw(xi)
xi∈Cj
w(xi)(xi)
3.1 k-means Algorithm 89
In analogy to the classicalk-means algorithm, the prototype vectors are updated according to the equation
μj = 1 nj
xi∈Cj
(xi)= 1 nj
m i=1
ui j(xi) (3.26)
wherenj = m
i=1ui j is the cardinality of the j-th cluster, andui j is the function, allocating objects to clusters, i.e.ui j =1 whenxi is an element of the j-th clus- ter and ui j = 0 otherwise. A direct application of this equation is not possible, because the functionis not known. In spite of this, it is possible to compute the distances between the object images and prototypes in the feature space, making use of Eq. (2.50). The reasoning runs as follows:
(xi)−μj2=
(xi)−μj
T
(xi)−μj
=(xi)T(xi)−2(xi)Tμj +(μj)Tμj
=(xi)T(xi)− 2 nj
m h=1
uh j(xi)T(xh)+ (3.27) + 1
n2j m r=1
m s=1
ur jus j(xr)T(xs)
=kii− 2 nj
m h=1
uh jkhi+ 1 n2j
m r=1
m s=1
ur jus jkr s
where, like in Sect.2.5.7,ki j =(xi)T(xj)=K(xi,xj). If we denote withuj the j-th column of the matrixUand substituteuj =uj/uj1, then the above equation can be rewritten in a closed form:
(xi)−μj2=kii−2(uTjK)i+uTjKuj (3.28) whereK is a Gram matrix with elementski j, and the scalar(uTjK)i denotes thei-th component of the vector(uTjK).
In this way, one can update the elements of the matrixUwithout determining the prototypes explicitly. This is possible, becausexiis assigned to the cluster minimising the above-mentioned distance, i.e.
ui j =
1 if(xi)−μj2=min1≤t≤k(xi)−μt 2
0 otherwise (3.29)
(Footnote 20 continued)
The application of kernelk-means would liberate from the necessity of computing eigenvalues and eigenvectors needed in spectral clustering of graphs.
Though it is not possible to use the Eq. (3.26) directly, one can determine (in the original feature space X) the approximate cluster prototypes by assumingμj to be the objectxjmatching the condition [533]
xij =arg min
xi∈Cj
(xi)−μj2 (3.30)
Prototypes, defined in this way, are in fact medoids (see the next section). Another method of determining the prototypes is presented in Sect. “KFCM-F Algorithm”
on page 121.
The algorithm is summarized in the form of the pseudo-code3.6. It needs to be stressed that, like thek-means algorithm, its kernel based variant is also sensitive to initialisation, meaning the 2nd step of the algorithm. The simplest way to initialise it is to assignk−1 (randomly selected) objects tok−1 different clusters and to allocate the remainingm−k+1 objects to thek-th cluster. Another method is to assume the partition returned by the classicalk-means algorithm. It guarantees that at least a part of the objects will belong to proper clusters and the Algorithm3.6will only modify erroneous assignments.
Algorithm 3.6Kernel-basedk-means algorithm Input:Data setX, number of clustersk.
Output:A partition defined by the set of gravity centres{μ1, . . . ,μk}.
1: Choose a kernel functionKand compute the elements of the Gram matrixki j =K(xi,xj)for the set of objects{x1, . . . ,xm}.
2: Compute the initial allocation of elements among clusters.
3: For each objectxi∈Xcompute its distance from each cluster by applying Eq. (3.27), and assign xito the closest cluster
4: Repeat step (3) till none ofui jvalues changes.
5: Determine the approximate prototypes according to Eq. (3.30).
The presented algorithm clusters correctly the data similar to the ones in Fig.3.5, and, as reported by authors of [288], even in case of such sets asirisone gets better clustering performance (compared tok-means algorithm). Anon lineversion of this algorithm was presented by Schölkopf, Smola and Müller in [419], and its modifications are presented in [194].
3.1.5.6 k-medoids Algorithm
Thek-means algorithm is formally applicable only if the dissimilarity between pairs of objects is equal to the square of their Euclidean distance. This means that the features describing the object properties must be measured on quantitative scale, e.g.
on the ratio scale. In such a case, the objects can be represented asn-dimensional vectors with real-valued components. Beside this, the partition cost, given by Eq. (3.1) depends on the maximal distances between the prototypeμi and the object of the
3.1 k-means Algorithm 91
Fig. 3.5 Results of application of the kernel-based variant of the k-means algorithm.
Object-to-cluster allocation is indicated by different colours.σ=1 was assumed
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1 0 1 2 3 4
cluster represented by this prototype. Hence, thek-means algorithm is not resistant to the presence of outliers. Furthermore, the centresμiare abstract objects not belonging to the setX. To avoid these disadvantages, one may measure the partition cost as follows:
Jmed(p1, . . . ,pk)= min
1≤j≤k
i∈Cj
d(xi,pj) (3.31)
wherep1, . . . ,pk ∈Xare cluster centres, called also prototypes, examples, exem- plars, medoids or just centres, andd:X×X→Ris a dissimilarity measure. There is no need for this measure to be symmetric or even metric (as required for Euclidean distance or its generalisations).
The k-medoids (or k-centres) algorithm aims at such a choice of centres {p1, . . . ,pk} ⊂Xfor which the indexJmedreaches its minimum. By introducing the dissimilarity measuredi j =d(xi,pj)we broaden the applicability of the algorithm requiring only that for each pair of objects from the setXit is possible to compute their dissimilarity. Working with dissimilarity matrix has an additional advantage:
its dimension depends on the number of objects only.
As the object representation stops playing any role here, the practical implementa- tions of the algorithm makes use only of the vectorchaving the elementsci, indicating to which example (and in this way to which cluster) thei-th object belongs. More precisely:
ci =
j ifxi∈Cj
i ifxiis an exemplar (3.32)
An elementary implementation of thek-medoids algorithm is demonstrated by the pseudo-code3.7, see e.g. [237, sec. 14.3.10].
Algorithm 3.7k-medoids algorithm
Input:Dissimilarity matrixD= [di j]m×m, number of clustersk Output:PartitionC= {C1, . . . ,Ck}.
1: Select the subsetK⊂ {1, . . . ,m}. Its elements are pointers to examples (prototypes) 2:while(nottermination condition)do
3: Assign objects to clusters using the rule ci=
arg min
j∈K
di j ifi∈/K
i otherwise , i=1, . . . ,m (3.33) 4: Update the examples, that is
jr∗=arg min
t:ct=r
t:ct=r
dtt,r=1, . . . ,k (3.34)
5:end while
6: If the index value remained unchanged after testingmobjects—Stop. Otherwise return to step 2.
In the simplest case the examples are picked at random, but very good results can be achieved by adapting the methods from Sect.3.1.3. In particular, one can start with selection ofkmost dissimilar objects.
The Eq. (3.33) tells us that the objectiis assigned to the least dissimilar example from the set K. On the other hand, the Eq. (3.34) states that for a set of objects sharing a common example we select as the new example such an object for which the sum of dissimilarities to other objects of the cluster is the lowest. Like thek-means algorithm, thek-medoids algorithm stops when the new examples are identical with those of the previous iteration.
In spite of its superficial simplicity, the algorithm is much more time-consuming thank-means algorithm: determining a new example requires O(|Cr|)operations, and the cluster allocation updates—O(mk)comparisons.
Kaufman and Rousseeuw initiated research on elaboration of efficient method of determination of examples. In [284] they proposed two algorithms: PAM (Partition- ing Around Medoids) and CLARA (Clustering LARge Applications). PAM follows the just described principle, that is- it seeks inXsuchmedoidobjects which minimise the index (3.1). This approach is quite expensive. The big data analysis algorithm CLARA uses several (usually five) samples consisting of 40+2kobjects and applies the PAM algorithm to each of the samples to obtain a set of proposed sets of medoids.
Each proposal is evaluated using the index (3.1); the set yielding the lowest value of the criterion function is chosen.
The next improvement was the CLARANS (Clustering Large Applications based upon RANdomized Search), [375]—an algorithm with quadratic complexity in the number of objectsm. The algorithm construes a graph of sets ofk medoids; two graph nodes are connected by an edge if the assigned sets differ by exactly one element.
The neighbours are generated using random search techniques. Another variant of thek-centres algorithm was proposed in [385].
3.1 k-means Algorithm 93
1 2 3 4 5 6 7 8
2.5 3 3.5 4 4.5 5 5.5 6
2 4 6 8 10 12 14 16 18 20 22
2 4 6 8 10 12 14 16 18 20 22
(a) (b)
Fig. 3.6 Clustering of separable data sets with thek-medoids algorithm:athe setdata3_2,bthe setdata6_2. Cluster examples are marked with red dots
In Fig. 3.6allocations of objects to examples for the data sets data3_2and data6_2are presented.
Remark 3.1.1 Thek-medoids algorithm should not be confused with thek-medians algorithm, proposed to make thek-means algorithm resistant againstoutliers. In the k-medians algorithm the Euclidean distance is replaced with the Manhattan distance, assuming p = 1 in the Eq. (3.2), [84, 273]. In such a case the cluster centres are determined from the equation
xi∈Cj
∂
∂μjl
|xil−μjl| =0⇒
xi∈Cj
sqn(xil−μjl)=0
i.e.μjl are the medians of the respective components of the vectorsxi assigned to the clusterCj.
This algorithm plays an important role in operations research, in particular in the
choice of location of service centres.21
3.1.5.7 k-modes Algorithm
Huang presented in [263] an adaptation of thek-means algorithm for cases where the features describing the objects are measured on nominal scale. Examples of such features are: sex, eye colour, hair colour, nationality etc. Each feature takes usually a value from a small value set; for example the value of the feature “eye colour” can be: “blue”, “brown”, “black” etc. By replacing these values with consecutive integers we can assign each real objectxi a concise representation in terms of a vectorxi, elements of which xil are integerspointing at the proper value of thei-th feature.
21See e.g. N. Mladenovi´c, J. Brimberg, P. Hansen, J.A. Moreno-Pérez. The p-median problem: A survey of meta-heuristic approaches.European J. of Op. Res., 179(3), 2007, 927–939.
Now the dissimilarity of the pair of objectsiand jis defined as (see Sect.2.2.2):
di j =d(xi,xj)= n
l=1
δ(xil,xjl) (3.35)
where
δ(xil,xjl)=
1 ifxil=xjl
0 otherwise
henced(xi,xj)counts the number of differences between the compared objects.
The mode of the setCis defined as an objectm(not necessarily from the setX) minimizing the rating:
d(C,m)=
x∈C
d(x,m) (3.36)
Huang noted in [263] the following property that allows to identify the mode of the setCquickly:
Lemma 3.1.1 Let Dom(Al)denote the set of values of the feature Al and let cls
denote the number of objects in the set C in which the feature l takes on the value s.
Vectormwith components matching the condition ml= arg max
s∈Dom(Al)cls,l =1, . . . ,n (3.37)
is the mode of the set C.
Letmj denote the mode of the j-th cluster. The problem ofk-modes consists in finding such modesm1, . . . ,mkthat the index
Jmod(m1, . . . ,mk)= k
j=1
d(Cj,mj)= k
j=1
xi∈Ci
d(xi,mj) (3.38)
is minimised. A method to solve this problem is presented in the form of the pseudo- code3.8.
The subsequent theorem points at the difference between thek-medoids and the k-modes algorithms
Theorem 3.1.2 Letp1, . . . ,pkbe a set of medoids and letm1, . . . ,mkbe the set of modes. Let both be computed using respective algorithms. Then
Jmed(p1, . . . ,pk)≤2Jmod(m1, . . . ,mk) (3.39)
3.1 k-means Algorithm 95
Algorithm 3.8k-modes algorithm, [263]
Input:MatrixXrepresenting the set of objects, number of clustersk.
Output:PartitionC= {C1, . . . ,Ck}.
1: Determine the dissimilaritiesdi jfor each pair of objects.
2: determine in any waykmodes.
3:while(nottermination condition)do
4: Using the dissimilarity measure3.35assign objects to the closest clusters 5: Update mode of each cluster applying the Lemma3.1.1.
6:end while
7: Return a partition intokclusters.
Let us note however that the execution of thek-medoids algorithm requires only the dissimilarity matrixD= [di j]while thek-modes algorithm insists on access to the matrixX with rows of which containing the characteristics of individual objects.
Huang considers in [263] the general case when the features describing objects are measured both on ratio scale and on nominal scale. Let us impose such an order on the features that the firstn1features are measured on ratio scale while the remaining ones—on nominal scale. In such a case the cost of partition represented by the matrix U = [ui j]can be determined as follows:
Jp(U,M)= k
j=1
m i=1
ui j
n1
l=1
(xil−mjl)2+γui j
n l=n1+1
δ(xil,mjl)
(3.40)
whereγ>0 is a coefficient balancing both types of dissimilarities.
By substituting Prj =
m i=1
ui j n1
l=1
(xil−mjl)2, Pjn =γ m
i=1
ui j
n l=n1+1
δ(xil,mjl)
we rewrite the Eq. (3.40) in the form Jp(U,M)=
k j=1
(Prj +Pnj) (3.41)
Optimisation of this index is performed iteratively, namely, starting with the initial centroid matrixM objects are assigned to clusters with the least differing centroids.
The dissimilarity measure of the objectxi with respect to centroidmi equals to d(xi,mi)=
n1
l=1
(xil−mjl)2+γ n l=N1+1
δ(xi j,mjl) (3.42)
Subsequently new cluster centres are determined. Their components being real num- bers are computed as in the classick-means algorithm, see Eq. (3.3), while for the nominal valued components the Lemma3.1.1is applied.
According to Huang [263], there are three prevailing differences between this algorithm (called by the authork-prototypes) and the algorithm CLARA, mentioned in the previous section:
• Thek-prototypes algorithm processes the entire data set, while CLARA exploits sampling when applied to large data sets.
• Thek-prototypes algorithm optimises the cost function on the entire data set so that at least a local optimum for the entire data collection is reached. CLARA performs optimization on samples only and hence runs at risk of missing an optimum, especially when for some reason the samples are biased.
• Sample sizes required by CLARA grow with increase of the size of the entire data set and the complexity of interrelationships within it. At the same time efficiency of CLARA decreases with the sample size: it cannot handle more than dozens of thousands of objects in a sample. On the contrary, thek-prototypes algorithm has no such limitations.
It is worth mentioning that the dissimilarity measure used here fulfils the triangle inequality, which implies that accelerations mentioned in Sect.3.1.4are applicable here.