Semisupervised Clustering - Algorithms of Combinatorial Cluster Analysis

Algorithms of Combinatorial Cluster Analysis

3.15 Semisupervised Clustering

While much research effort goes to elaboration of both clustering and classification algorithms (called unsupervised and supervised learning respectively), there exists big area for investigations, mixing both of them, where part of the elements of the sample are pre-labelled and the other part is not. Depending on the perspective one can speak about partially supervised classification or semi-supervised clustering, as the classification algorithms may benefit from unlabelled elements via better exploration of the geometry of the sample space (via exploitation of the similarity function) and clustering may benefit not only for application of external quality criteria but also by better understanding the aesthetic rules underlying the similarity function. The labelling hereby may be direct assignment to some predefined classes or have the

form of hints of the form: these two elements belong to the same cluster, these two cannot belong to the same cluster. Whatever labelling method is used, the information contained therein and in the similarity function should not contradict one another.

The basic idea behind exploitation of both labelled and unlabelled examples may be generally described as follows: On the one hand there exits an underlying probability distribution P(y|x)of assignment of a labely∈ Y to the exemplarx ∈ X.

On the other hand the data lie on a low dimensional manifold in a high dimensional space so that it makes sense to consider distances different from the ones induced by the high dimensional space. One makes the following assumption: Given a similarity function overXone hopes that the aforementioned conditional probability changes continuously with the similarity.

Under these circumstances a labelling function is sought that is penalized both for deviations from labels on the labelled part of the examples and from continuity for the whole set of examples.

One may distinguish several fundamental approaches to semisupervised clustering:

• similarity-adapting methods, where the similarity function is adapted to fit the label information (Sect.3.15.1)

• search-adapting methods where the clustering algorithm itself is modified to respect the label information (Sect.3.15.2)

• target variable controlled methods (Sect.3.15.3)

• weakened classification methods, where the unlabelled data is used to modify class boundaries (Sect.3.15.4)

• information spreading algorithms, where the labels are spread with similarity based

“speed” over the set of object (Sect.3.15.5).

3.15.1 Similarity-Adapting Methods

As most clustering algorithms rely on some measure of similarity, dissimilarity or distance, the simplest way to take into account the pre-classified elements is to modify the similarity matrix (or more generally the similarity function).

Klein et al. [292] developed an algorithm called Constrained Complete-Link (CCL), that takes into account various kinds of information on common cluster membership via specific modification of similarity matrix, which can be later used by a clustering algorithm likek-means and in this way take advantage of user knowledge about relations between clustered objects. The Klein’s method consists in modifying the similarity matrix to reflect imposed constraints (e.g. increasing the dissimilarity between elements for which we know that need to lie in distinct clusters) and to propagate these constraints wherever applicable also to other elements of similarity matrix (if e.g. pointsxiandxjare close to one another, then a point close toximust also be close toxj; on the other hand if pointsxi andxj are far apart, then a point

3.15 Semisupervised Clustering 153 close toximust also be far away fromxj). Klein’s algorithm takes into account two types of labelling information

• must-link (two data points must belong to the same cluster)

• cannot-link (two data items must not belong to the same cluster).

The must-link constraint is imposed via setting the dissimilarity between the elements to zero. If the dissimilarity has to be a (metric) distance, corrective actions are applied.

From the original distance matrix with modified entries the shortest path matrix is constructed which is metric and is relatively close to the original distance matrix, so it can replace it. The cannot-link constraint is not so easy to impose because even identifying if all such constraints may be respected can be an N P-complete task. Therefore Klein et al. propose a heuristic approach. After imposing must-link constraints and replacing the distance with shortest path distance, they propose to model the cannot-link constraint simply by replacing the distance between the data points with a maximum distance plus one. Subsequently they suggest to use an algorithm like complete link for clustering so that the metricity is partially restored.

Xing et al. [513] consider constraints of the type “data pointximust be close to data pointxj”. They suggest to look for a (semi-positive definite) matrixAsuch that the Mahalanobis distance

dA(xi,xj)= xi−xjA =*

(xi−xj)^TA(xi−xj)

would reflect this requirement. In particular, if S is the set of (all) pairs of points which should be “close” to one another, then we need to minimize

(xi,xj)∈S

dA(xi,xj)²

subject to the constraint

(x_i,x_j)∈D

dA(xi,xj)²≥1

whereDis a set of pairs of objects known to be dissimilar. This constraint is necessary to avoid the trivial solutionA=0. The problem is solved by gradient descent method.

3.15.2 Search-Adapting Methods

The search adapting methods mean modification of the clustering algorithm itself.

Examples of such approaches are the constrained k-means algorithm, seededk- means algorithm,k-mean with preferential data selection,k-means with penalizing via Gini-Index etc. Though many authors illustrate the approaches usingk-means modifications, the underlying principles apply generally to other algorithms.

Theseeded k-means algorithm, proposed by Basu et al. [55] modifies the basic k-means algorithm in the initialisation step. Instead of some kind of randomised initialisation, initial clusters are formed from labelled data only and these initial clusters form the basis for initial cluster centroid determination. Further steps of the algorithm are the same as ink-means.

Theconstrained k-means algorithm[55] takes into account labels of the labelled examples also during iterations, that is they constitute a fixed part of the cluster even if some of them may become closer to the centre of some other cluster.

Demiriz et al. [143] propose to amend thek-means target function with a penalty factor based in Gini index, representing a mismatch of class membership of labelled examples.

J(U,M,P)=α

⎛

⎝^m i=1

k j=1

u_{i j}xi−μj²

⎞

⎠+β k j=1

⎛

⎝^m i=1

u_{i j}−(_{r c} c=1

_m i=1p_{i c})² _m

i=1u_{i j}

⎞

⎠

(3.148) wherePis a matrix with entriespi,c=1 stating that elementibelongs to the intrinsic class c. pi,c = 0 otherwise.r c

c=1pi c ≤ 1. Ifr c

c=1pi c = 1 then the element is labelled, otherwise it is unlabelled.αandβare some non negative constants, specific for a given problem, which determine which part of data (labelled or unlabelled) shall play a stronger role.

Instead of classicalk-means algorithm, a genetic algorithm is applied, with target function equal to J(U,M,P). For each chromosome, one step of the classicalk- means algorithm is performed with the exception of the cluster assignment step where the genetic operations of mutation and crossover replace the classic assignment to the closest cluster centre for non-labelled elements. Also clusters with too few labelled elements are discarded and new random clusters are formed instead.

Wagstaff and Cardie [487] take into account slightly weaker information about cluster membership: must-link constraintsthat specify that two instances have to be in the same cluster, andcannot-link constraintsthat specify that two instances cannot be in the same cluster. They use the CONWEB algorithm to illustrate their proposal, which is incremental in nature. They propose also a modification for the k-means algorithm in this spirit (calledCOP-KMEANS). A respective extension to k-means is quite straight forward. Prior to running a weighted version ofk-means, all elements bound by “must link” constraints are pulled together and will be represented during the rest of the process by the respective mean vector, weighted by the number of element pulled together. Then during the iterations, the cluster assignment is performed as follows: we take the unassigned elements and seek the one that is closest to any cluster centre and assign it to the closest cluster. Then we consider iteratively the remaining ones taking the one that is closest to any cluster centre not containing an element that it is in the “cannot link” relation. This algorithm will work if no element is engaged in more thank−1 “cannot link” relations. If the constraints cannot be satisfied. The algorithm fails.

3.15 Semisupervised Clustering 155 Basu et al. [56] proposed another algorithmPCKmeanstaking into account these weaker types of constraints, that is pairwise “must-link” and “cannot-link” relations.

Their algorithm exhibits a kind of behaviour similar to either seeded or constrained k-means, depending on parameter setting. The algorithm starts with creating a con- sistent set of “must-link” relations, that is adding the appropriate transitive closures.

Then the “cannot-link” relations are extended appropriately (ifxcannot linkyandy must linkz, thenxcannot linkz). Groups of elements are formed that are connected by “must-link” relation. If their number equals tok, the centres of these groups will become the initial cluster centres. If there are more of them, then only the groups ofkhighest cardinalities will become cluster centres. If there are fewer thanksuch groups, other cluster centres will be formed of elements having “cannot-link” relation to these groups. Remaining clusters will be initialised at random. Then iteratively cluster assignment and computation of cluster centres is performed. Cluster centres are computed as in traditionalk-means. Cluster assignment takes into account the squared distance to the cluster centre and the number of violations of the “must-link”

and “cannot-link” constraints with respect to elements already assigned to clusters, with user-defined weights. As this step is dependent on the order of element consid- eration, the authors use a random sequence of elements at each step.

3.15.3 Target Variable Driven Methods

The methods of this type may be considered as a variation on both previously men- tioned approaches.

In methods of this type we can take into account not only labelling of some elements of the data set, but more generally data on any variable considered as an

“outcome” variable (or rather a noisy version of the true outcome of interest). The

“outcome” variable is the one that a clustering should help to predict. It is not taken into account when clustering, but preferentially the clusters should say something about the value of this variable for objects classified into individual clusters.

Bair and Tibshirani [42] proposed subspacing methods, where features are selected to reflect the intended clustering prior to application of the proper algorithm. They suggest the following targeted feature selectionprocedure: For each feature compute its correlation with the outcome variable and use for clustering only the variables that are significantly correlated and the correlation level is above some threshold.

Thereafter, either unsupervised clustering or one of the semi-supervised clustering methods can be applied.

Still another approach relies on the so-calledinformation bottleneck approach [461]. The general idea is that our data shall predict some variable of interest, say Y. So we can compute mutual information between the data X and the variable of interest, I(X;Y) =

yp(x,y)log₂ _p^p₍_x⁽^x₎^,_p^y₍⁾_y₎. The clustering will now be understood as a data compression technique performed in such a way that cluster representations Xˆ can be used instead of the original data—the mutual information

I(Xˆ,Y) ≤ I(X;Y)between this target variable and the representations will be necessarily decreased but we are interested in keeping it close (up to a user-defined parameter) to the mutual information of the target variable and the original data. That is one seeks to minimize the function

L[p(xˆ|x)] = I(Xˆ,X)−βI(Xˆ,Y) (3.149) where β is a parameter weighing the extent to which information onY shall not be lost while performing the compression. Such an understanding can drive both the number of clusters and other aspects of cluster representation. An algorithm for the minimization is proposed by Tioshby et al. in [461]. Its relation tok-means is explored in [445].

3.15.4 Weakened Classification Methods

A number of classification algorithms, like SVM, have been adapted to work under incomplete data labelling (semi-supervised classification) so that they can serve also as semi-supervised clustering algorithms.

A classifier h : X → L assigns each element of a domainX a label from a setL. One defines a loss function : L×L → Rdefining the punishment for mismatch between the real label of an instance and the one defined by the classifier. A classification algorithm seeks to minimise the value of

(xi,li)∈Xt(li,h(xi))where Xt is the labelled training set containing examplesxi and their intrinsic labelsli. Under semisupervised settings, an additional loss function :L→ Rpunishing the classifier for assigning a label to an unlabelled instance. So one seeks not to minimise

(x_i,l_i)∈X_t(li,h(xi))+

(x_j)∈X_u(h(xj)) where Xu is the unlabelled training set containing examplesxjwithout labels.

In algorithms based on manifold assumption like label propagation or Lapla- cian SVMs the unlabelled loss function can take the form(h(xj))=

x_i∈X_t,x_i=x_j

s(xi,xj)h(xi)−h(xj)².

Zhung [544] proposes to train a classifierclassi fusing the labelled data whereby this classifier returns somecon f i ndencelevel for each unseen example. Then clev- erly chosen subsets of these originally unlabelled data labelled by the classifier are used to extend the training set of the classifier in order to improve classifier accuracy.

The data selection process proceeds as follows: First the originally unlabelled examples are clustered e.g. usingk-means. Within each cluster the examples are ordered by thecon f i ndenceand split using the “quantile” method into the same numberzof subsets and the subsets are ranked from 1 toz. Then subsets of the same rankrfrom all the clusters are joined intor-th bin, so that we getz“bins”. Then iteratively one picks each bin and checks its suitability to be added to the labelled set for training of a new classifier. The process is stopped, if the classification accuracy cannot be improved (is worsened).

3.15 Semisupervised Clustering 157 Tanha et al. [454] apply a strategy in the same spirit, but to specifically decision trees. They train a base classifier on the labelled data. Then they classify the unlabelled data with this classifier and pick those that were correctly classified with highest confidence and add them to the labelled set and repeat the process till a stopping criteria. The major problem they need to cope with is the issue of how to estimate the confidence in a label assigned by the classifier. Decision trees have the characteristic that decisions are made at the leaf nodes where there are frequently quite few examples so that an estimate of the confidence (the ratio of the main class of a leaf to the total number of examples in the leaf) is unreliable. Various proposals for such estimates are made, like Laplacian correction, Naive Bayes classifier etc.

Criminisi et al. [127, Chap. 7] deal with extension of random forest algorithms to semi-supervised learning. They introduce the concept of atransductive treethat generalises the concept of the decision tree and the clustering tree. When growing the transductive tree, the mutual information

Ij=I^U_j +αIj^S

is optimised (maximised) whereI^U_j is the clustering mutual information as defined by Eq.3.147and is computed for all the data elements, andI_j^Sis the classical mutual information compuyted only for labelled examples.αis a used-provided parameter.

Another extension of a classifier for semi-supervised learning is so-called Trans- ductive Support Vector Machine Support (TSVM) method proposed by Vapnik [478].

The Support Vector Machine (SVM) classifier seeks to separate two classes of objects in space by a hyperplane in such a way that there exists a large margin between them.

So one seeks to minimize the inner product of weight vectorw^Twsubject to constraints(w^Txi+b)li ≥1 for each data vectorxi with labelli, where the labels can be either 1 or −1 (two-class classification problem). Transductive approach, with unlabelled data, consists in imposing additional constraints(w^Txu+b)lu≥1 where xu are unlabelled examples andlu ∈ {−1,1}is an additional variable created for each unlabelled example. This approach will work only if the labelled examples are linearly separable. If they are not, so-called slack variablesη≥0 are introduced and one minimises the sumw^Tw+cL

iηi+cU

uηusubject tow^Txi+b)yi ≥1−ηi

for each labelled data vectorxiwith labelyi,w^Txu+b)yu ≥1−ηuandyu∈ {−1,1}

for each unlabelled examplexu.cL,cU are user-defined constants. The unlabelled loss function is here of the form=max(0,1− h(ג)).

3.15.5 Information Spreading Algorithms

In this category of algorithms the labels are spread with similarity based “spread”

over the set of object. As described in previous chapter, there exist many ways to transform distances to similarity measures, and any can be applied.

One of the approaches, proposed by Kong and Ding [300] relies on the computation of so-called Personalized PageRank for random walks on a neighbourhood graph.

As a first step we need to obtain measures of similarity between the objects to be clustered. One can build a graph with nodes being the clustered objects and edges connecting objects with similarity above some threshold, eventually with edge weights reflecting the degree of similarity.

In this graph we havek+1 types of nodes: that is the set of nodes can be split into disjoint partsV =V₁∪V₂∪. . .∪Vk∪VuwhereV₁, . . . ,Vkcontain objects labelled as 1, . . . ,krespectively, whileVucontains all the unlabelled objects.

Nowkrandom walkers will move from nodevto nodeu,u, v∈Vwith probability equal to the weight of the link(v,u)divided by the sum of weights of all outgoing links of nodevtimes 1−b. With probabilitybthe random walkerijumps to any node in the set Vi. Under these circumstances the probability distribution of the walker i being at any node reaches some stationary distributionπi. Probability of walkeri being at nodevwill amount toπi(v).

As a result of this process any unlabelled objectv ∈Vu will be assigned proba- bilitiesπ1(v), . . . ,πk(v)and it will be assigned a label equal arg maxiπi(v).

To formalise it, let S be a similarity matrix. and D be its diagonal matrix D = di ag(S1). Define the (column-stochastic) transition probability matrix as P=S D⁻¹. Hence for each walker

πi =(1−b)Pπi+bhi

where the preferential jump vector hi(v) = 1/Vi for v ∈ Vi and is equal 0 otherwise.

So for the population of the random walkers we have:

=(1−b)P+b H

whereconsists of stationary distribution column vectorsπi andH of the preferential jump vectorshi.

Hence

=b H(I−(1−b)P)⁻¹

Another approach, proposed by Zhu et al. [542] assumes that a smooth function f can be defined over the graph nodes f :V →Rassigning values to nodes in such a way that for labelled nodes f(v)=l_vwherel_vis the label of the nodevand that the labels of similar nodes are similar that is the sum

E(f)= 1 2

i,j∈V

si j(f(i)− f(j))²

is minimised.

3.15 Semisupervised Clustering 159 The function f minimising E(f)is harmonic, that is for unlabelled examples L f =0, whereLis the LaplacianL =D−S, whereD=di ag(S1).

As f is harmonic, the value of f at each unlabelled point is the average of f at neighbouring point, that is:

f(j)= 1 dj−sii

i=j

si jf(i)

for j being unlabelled point. This implies that f = (D−mai ndi ag(S))⁻¹(S− mai ndi ag(S))f. For simplicity, denoteD−mai ndi ag(S)withF,S−mai ndi ag(S) withW. LetWll,Wul,Wlu,Wuu denote submatrices ofW related to pairs (labelled, labelled), (labelled, unlabelled), (unlabelled, labelled), (unlabelled, unlabelled). Sim- ilar notation should apply to F, and fl, fu should mean vectors of function values for labelled and unlabelled examples. Now one can see that

fu =(Fuu−Wuu)⁻¹Wulfl

3.15.6 Further Considerations

The ideas of semi-supervised clustering may also be combined with soft clustering, as done e.g. by [81,82,388]. other possibilities of generalisation are may concern the relationship between a class and a cluster. So far we assumed that clusters are to be generated that related somehow to classes indicated in labelled data. Reference [81]

considers the case when several clusters constitute one class. Still another assumption made so far is that the distance between objects is independent of the actual clustering.

Reference [82] drops this assumption and develops an algorithm adapting distance measure to the actual content of clusters.

In order to incorporate semi-supervised learning into the FCM algorithm, [388]

proposes to extend optimisation goal given by Eq. (3.54) to the form

J_α(U,M)= m

i=1

k j=1

u^α_{i j}xi−μj²+β m

i=1

k j=1

ui j−fi jbi^αxi−μj² (3.150)

wherebi tells whether the sample elementi is labelled (bi =1) or not (bi =0), fi j

is meaningful for labelled sample elements and reflects the degree of membership of the labelled elementito the class/cluster j. The parameterβgoverns the impact of the supervised part of the sample. The algorithm reduces to FCM, ifβ is zero.

Forβ >0 the second sum punishes for diversion of cluster membership degreesui j

from the prescribed class membership fi j.

The above approach is valid if the number of classes is identical with the number of clusters that are formed, so that distinct clusters are labelled with distinct labels.

Dalam dokumen Modern Algorithms of Cluster Analysis (Halaman 169-180)