Clustering Evolving Over Time - Algorithms of Combinatorial Cluster Analysis

Algorithms of Combinatorial Cluster Analysis

3.11 Clustering Evolving Over Time

In a number of applications data to be clustered is not available simultaneously, but rather data points become available over time, for example when we handle data streams, or when the clustered objects change position in space over time etc. So we can consider a number of distinct tasks, related to a particular application domain:

• produce incrementally clusters, as the data is inflowing, frequently with the limi- tation that the storage is insufficient to keep all the data or it is impractical to store all the data

• generate partitions of data that occurred in a sequence of (non-overlapping) time frames

• generate clusters incrementally, but data points from older time points are to be taken into account with lower weights

• detect new emerging clusters (novelty detection) and vanishing ones

• detecting of clusters moving in data space (topical drift detection)

• the above under two different scenarios: (a) each object is assumed to occur only once in the data stream, (b) objects may re-occur in the data stream cancelling out their previous occurrence.

Especially in the latter case the task of clustering is not limited just to partitioning the data, but also tracing of corresponding clusters from frame to frame is of interest.

Furthermore, it may be less advantageous to the user to obtain best clusterings for individual windows, but rather less optimal but more stable clusters between the time frames may be preferred.

3.11.1 Evolutionary Clustering

Chakrabarti et al. [101] introduces the task of evolutionary clustering as a clustering under the scenario of periodically arrivingnewdata portion that needs to be incorpo- rating into an existent clustering, which has to be modified if the new data exhibits a shift in cluster structure. It is necessary under such circumstances to balance the consistency of clustering over time and the loss of precision of the clustering of the current data. Note that we are interested in clusterings obtained at each step and want to observe how they evolve over time.

3.11 Clustering Evolving Over Time 143 Formally, at any point in timet,C⁽^t⁾ be the clustering of all data seen by the algorithm up to time pointt. IfC⁽^t⁾is the clustering at the point of timet, then the snapshot qualitysq(C⁽^t⁾,X⁽^t⁾)measures how wellC⁽^t⁾reflects the cluster structure of dataX⁽^t⁾that arrived at time pointt(neglecting the earlier data), while the history costhc(C⁽^t⁻¹⁾,C⁽^t⁾)measures the distance betweenC⁽^t⁻¹⁾andC⁽^t⁾and is computed for all data seen by the algorithm up to the time pointt−1, that isX⁽¹⁾∪· · ·∪X⁽^t⁻¹⁾. They develop algorithms maximising

sq(C⁽^t⁾,X⁽^t⁾)−γhc(C⁽^t⁻¹⁾,C⁽^t⁾) (3.144) whereγis a user-defined parameter balancing historic consistency and current accu- racy.

Chakrabarti et al. illustrate their approach among others by sphericalk-means algorithm, but it may be applicable quite well to the classicalk-means algorithm. Let us recall the cost function ofk-means clustering that can be formulated as follows (a rewrite version of (3.1)):

J(M,X)= m

i=1

j∈{1,...,mink}xi−μj² (3.145)

In their framework we would set sq(C⁽^t⁾,X⁽^t⁾) = −J(M⁽^t⁾,X⁽^t⁾)andhc(C⁽^t⁻¹⁾, C⁽^t⁾)=J(M⁽^t⁾,X⁽¹⁾∪· · ·∪X⁽^t⁻¹⁾)−J(M⁽^t⁻¹⁾,X⁽¹⁾∪· · ·∪X⁽^t⁻¹⁾).As at time point tthe clusteringC⁽^t⁻¹⁾is already fixed, maximisingsq(C⁽^t⁾,X⁽^t⁾)−γhc(C⁽^t⁻¹⁾,C⁽^t⁾) is equivalent to minimising J(M⁽^t⁾,X⁽^t⁾)+γJ(M⁽^t⁾,X⁽¹⁾∪ · · · ∪ X⁽^t⁻¹⁾)that is performing the classicalk-means for which the historical data is down-weighted by the factorγ.

However, as one wants to keep as little as possible of historical data, one may replace the data up to the pointt −1 with the cluster centres M⁽^t⁻¹⁾weighted by cardinalities of these clusters and perform the clustering on data redefined this way.

3.11.2 Streaming Clustering

Ailon et al. [11] consider the problem of an efficient approximation of thek-means clustering objective when the data arrive in portions, as in the preceding subsection.

But this time we are interested only in the final result and not in the in-between clusterings. So the historical consistency plays no role. Rather we are interested in guarantees on the final clustering while we investigate a single data snapshotX⁽^t⁾at a time.

They introduce the concept of (α,β) -approximation tok-means problem (their Definition1.2) as an algorithm B such that for a data set X it outputs a clustering C containing αk clusters with centres described by the matrix M such that

J(M,X) ≤ βJopt(k,X)where Jopt(k,X)is the optimal value for X being clustered intokclusters,α>1,β>1.

Obviously, thekmeans++ algorithm from Sect.3.1.3.1is an example of a (1,8 (lnk+2)) approximation. They modify this algorithm to obtain the so-calledk- means# algorithm, described by the pseudo-code3.11.

Algorithm 3.11k-means# algorithm, [11]

Input:Data setXand number of groupsk.

Output:Final number of groupsk, gravity centres of classes{μ1, . . . ,μk}along with the assign- ment of objects to classesU= [u_{i j}]m×k.

1: Pick randomly 3 lnkobjectsx∈Xand make them to be the setC.

2:for j=2 tokdo

3: For each objectxcompute the distance to the nearest centre from the setC, that isu(x)= min_μ∈Cx−μ².

4: Create a setC_jconsisting of 3 lnkobjectsx∈Xxpicked randomly according to the distrib- ution (3.5).

5: AddC_jto the set of centres,C←C∪C_j. 6:end for

7: Run the|C|-means algorithm. Setk= |C|.

Ailon et al. [11] prove that their algorithm, with probability of at least 1/4, is a (3 lnk,64)-approximation to thek-means problem.

Based onkmeans++ andkmeans# they develop a Streaming divide-and-conquer clustering (pseudo-code3.12).

Algorithm 3.12Streaming divide-and-conquer clustering, version after [11]

Input:(a) Stream of data setsX⁽¹⁾, . . . ,X^(T⁾. (b) Number of desired clusters,k∈N

(c) A, being an (α,β )-approximation algorithm to thek-means objective. (d) A’, being an (α,β)-approximation algorithm to thek-means objective.

Output:Final number of groupsk, setMof gravity centres of classes{μ1, . . . ,μk}. 1: create an empty setRof pairs (object, weight).

2:fort=1 toTdo

3: Run the algorithm A for the setX^(t), obtaining the setR^(t)ofαkcluster centres with weights being the cardinality of the respective cluster.

4: AddR⁽^t⁾to the set of centres,R←R∪R⁽^t⁾. 5:end for

6: Run the A’ algorithm onRto obtainαkcluster centresM.

This algorithm uses algorithms A and Abeing (α,β) and (α,β) approximations tok-means objective and is itself an (α,2β+4β(β+1))-approximation ofk-means objective (see their Theorem 3.1). Hereby the algorithm A is defined as follows:

“Runk-means# on the data 3 lnmtimes independently, and pick the clustering with the smallest cost. where m is the number of data items in X⁽¹⁾, . . . ,X⁽^T⁾ taken

3.11 Clustering Evolving Over Time 145 together”. Apparently with probability(1−(3/4)^{3 ln}^m)it is a (3 lnk,64) algorithm.

The algorithm A’ is defined as “Run thek-means++ algorithm on the data”.

In all, the algorithm in over 50% of runs yields an(1,O(lnk))—approximation tok-means objective.

3.11.3 Incremental Clustering

When speaking about incremental clustering, we assume that the intrinsic clustering does not change over time and the issue is whether or not we can detect properly the clustering using limited resources for storage of intermediate results.

Reference [5] sheds some interesting light on the problem induced by algorithms that are sequential in nature, that is when data is presented one object at a time. They consider a sequentialk-means that during its execution stores onlykcluster centres and the cardinalities of each of the cluster and assigns a new element to the closest cluster updating the respective statistics. If one defines a clustering in which in a cluster any two elements are closer to each other than to any other point, as “nice clustering”, then this algorithm is unable to detect such a clustering even if it exists in the data. What is more, it cannot even detect a “perfect clustering”, that is one in which the distance of two elements in a cluster is always lower than the distance between any two elements in different clusters.

Another sequential algorithm, nearest neighbour clustering algorithm, can sur- prisingly discover the perfect clustering, but the nice one remains undetected. This algorithm maintainskdata points, and upon seeing a new data object it stores it in its memory, seeks the pair of closest points (in the set of the previous objects plus the new one) and randomly throws out one of the two objects of the pair.

The “nice-clustering” can be discovered by incremental algorithms only if 2^k⁻¹ cluster centres are maintained. For highkthis requirement may be prohibitive even if a clear data structure exists in the data.

Dalam dokumen Modern Algorithms of Cluster Analysis (Halaman 160-163)