Clustering in Subspaces via k-means - Algorithms of Combinatorial Cluster Analysis

Algorithms of Combinatorial Cluster Analysis

3.6 Clustering in Subspaces via k-means

It happens frequently, that the dataset contains attributes that are not relevant for the problem under investigation. In such a case clustering may be distorted in a number of ways, as the non-relevant attributes would act like noise.

To annihilate such effects, a number of approaches have been tried out, including so-calledTandem Clustering(described and discussed e.g. by [232]) that consists of: (1) Principal Component Analysis to filter out appropriate subspaces, in which the actual clustering can be performed, (2) applying conventionalk-means clustering on the first few eigenvectors, corresponding to highest absolute eigenvalues. Chang [232] criticised this approach demonstrating that sometimes eigenvectors related to low eigenvalues may contain the vital information for clustering.

De Soete and Carroll [141] proposed an interesting extension ofk-means algorithm that can handle this issue in a more appropriate way, a method calledreduced k-means, (orRKMfor short). Its stability properties are discussed by Terada [457].

The algorithm seeks to minimise the objective function J(U,M,A)=

m i=1

k j=1

ui jxi−Aμj²= X−U M A^T²F (3.135)

whereAis a matrix of dimensionsn×q(q <min(n,k−1)), responsible for search of clusters in lower dimensional space, it must be column-wise orthonormal, and

45For degenerate sets of points the number may be even larger.

3.6 Clustering in Subspaces viak-means 133

M is a matrix containing, as rows, the k cluster meansμj ∈ R^q in the lower,q dimensional space..Fdenotes the Frobenius norm.

The algorithm consists of steps:

1. InitialiseA,M,U.

2. Compute the singular value decompositionQP^Tof the matrix(U F)^TX. where Qis aq×q orthonormal matrix,is aq×q diagonal matrix, andPis an×q column-wise orthonormal matrix.

3. Compute next approximation ofAas A:=P Q^T.

4. Compute a newUmatrix so that eachxibelongs to the transformed centreAμj

of the cluster j.

5. Compute a newMmatrix asM =(U T U)⁻¹U^TX A.

6. Compute a new value of J(U,M,A)according to Eq. (3.135). If this value is decreased, replace oldU,M,Awith the new ones and go to step 2. Otherwise stop.

As the matrixU is constrained to be a binary one, the algorithm may stick in local minima. Hence multiple starts of the algorithm are recommended.

Note that this algorithm not only finds a partitionU around appropriate cluster centresU, but also identifies the appropriate subspace (viaA) where the clusters are best identified.

A drawback is that bothkandq must be provided by the user. However, [457]

proposes a criterion for best choice ofq.⁴⁶

Coates and Ng [124] exploit the above idea in an interesting way in order to learn extremely simplified method for representing data points, with application to image processing. One may say that they look for a clustering of data points in such a way, that each data point belongs to some degree to only one cluster. So their optimisation objective is to minimise:

J(M,S)= m

i=1

xi−M^Tsi² (3.136)

where S =(s₁,s₂, . . . ,sm)^T is a matrix ofm so-called code vectorssi, each ofk rows, but with only one non-zero entry, each encoding the corresponding data vector xi, andM, called “dictionary”, is a matrix containing thekcluster centresμj ∈Rⁿ as rows. As the cluster centres are normalised to unit length (lie on a unit sphere), we can talk about sphericalk-means here.

The computational algorithm, after initialisation, consists of alternating compu- tation ofM andS, till some stop condition.

46Still another approach, calledsubspace clustering, was proposed by Timmerman et al. [460]. In this approach, the coordinates of cluster centres are no more required to lie in a singleq-flat, but have rather two components, one “outer” (in aq_b-flat) and one inner (q_w-flat), that are orthogonal.

So not only subspaces are sought, where clustering can be performed better, but for each cluster a different subspace may be detected. Regrettably, this method is equivalent toq_w+q_b-flat based reducedk-means clustering in terms of found clusters, but additionally ambiguities are introduced when identifying both flats.

Given M we compute for each i the vectorsi as follows: Compute the vector s=Mxi. Letlbe the index of the maximal absolute value of of a component ofs, l =arg maxlsl. Lets" be a vector such thats"l = s_land equal zero otherwise.

Thensi=s".

Given S we computeM as follows: M = X^TS. Then normalise M row-wise and you obtainM.

In this case, instead of representing cluster centres in a low dimensional subspace, the data points are represented in a one-dimensional subspace, separate for each cluster.

Witten et al. [504] approached this problem in a still another way. They propose to assign weights to features in theirsparse clustering k-meansalgorithm. Weights are subject to squared sum constraint and they alternate application of traditional k-means with weight optimisation for fixed cluster assignment.

They maximize the between-cluster-sum-of-squares:

J(U,M,w)= m

i=1

k j=1

n l=1

wl(xil−μl)²− m

i=1

k j=1

n l=1

ui jwl(xil−μj_l)² (3.137)

whereμlis the overall mean for featurel,wis a vector of non-negative weights of individual features of objectsxi, wheren

l=1w²l ≤1, and additionally the constraint n

l=1wl <sis imposed, wheresis a user defined parameter. Both constraints are preventing elimination of too many features Initially, the weightsware set to √¹n. Then in a loop firstk-means algorithm is performed while keeping constantw. Upon reaching an optimum, the cluster assignment is fixed and thenwis optimised via a quadratic optimisation procedure under the mentioned constraints (upon fixing clusters, this is a convex optimisation problem).

As a result we obtain both a partition of the data set and a selection (weighting) of individual features.

Hastie et al. [238, Sect. 8.5.4] point at the problem that Witten’s approach performs optimisation in a space that is bi-convex (cluster centres, weights), but is not jointly convex, and hence guarantees for obtaining a global solution are hard to achieve.

Hence they propose a modification based on prototypes. It may be deemed as a kind ofk-means with each element belonging to a separate cluster and one attempts to collapse upon one another the cluster centres (so that the number of clusters drops belowm). They minimise

J(M)=0.5 m

i=1

||xi−μi||²+λ

i<j

wi j||μi−μj||q (3.138)

q=1 or 2,λ>0 is a user-defined parameter. Weightswi j may be equal 1 or fixed as a function of distances between observationsxi,xj, for examplewi j =e^−||^xⁱ^,^x^j^||². See the book by Hastie et al. [238], Sects. 8.4 and 8.5 for details.

3.6 Clustering in Subspaces viak-means 135

Regrettably, the sparse clusteringk-means algorithm is quite sensitive to outliers.⁴⁷ Therefore Kondo et al. [299] propose a modification, calledrobust sparse k-means algorithm. The modification consists in changing the procedure of computing cluster centres twice: once in the traditional manner and then after rejecting the elements most distant from their cluster centres either in the original space or in the space with weighted coordinates.

Dalam dokumen Modern Algorithms of Cluster Analysis (Halaman 150-153)