Data stream clustering: A survey - ACM Digital Library

Finally, we show the challenges to be faced and promising future directions for the area in Section 8 and provide the complexity analysis of the main data stream clustering algorithms in the Electronic Appendix. The feature vector. The use of a feature vector for summarizing large amounts of data was first introduced in the BIRCH algorithm [Zhang et al. These three components allow the calculation of clustering measures, such as the cluster mean (Eq. 3) The CF vector exhibits important growth and addition properties, as described below.

Each of thek2clusters is evaluated according to a compactness criterion that verifies whether the variance of a cluster is below a threshold β (input parameter). The statistical summary CF vector) of the clusters meeting this criterion is stored in the buffer, together with the CF vectors obtained from the primary compression. The additional components are the sum of the time stamps (LST) and the sum of the squares of the time stamps (SST). At bounded periods Tp (given by Eq. 9)), the set of p-microclusterssis checked to verify whether a p-microcluster should become an o-microcluster.

Fig. 1. Object-based data stream clustering framework.

Window Models

If gist is not in the list of valid grid cells (structured as a hash table), it is inserted into it and its corresponding summary is updated. The local site communicates the local state change by sending the number of the grid cell that has been updated. Each time a new value xit is read, the counter of the corresponding bin is incremented in both the first and second layers.

The number of containers in the first layer can change, provided the following condition is met: if the value of the counter associated with a container in the first layer is greater than a user-defined threshold, α, the container is split into two. If there was a split, all bins in the second layer are sent to the central page, otherwise only the updated bin is sent to the central page. Newer items weigh more than older items, and items' weight decreases over time.

An illustrative example of the wet window model is shown in Figure 7, where the weight of objects decays exponentially from black (most recent) to white (expired). The higher the value of λ, the lower the importance of the past data relative to the most recent data. Reference point window pattern. Processing of a stream based on reference point windows requires the handling of disjoint parts of streams (chunks), which are separated from reference points (corresponding objects).

Landmarks can be defined either in terms of time (eg, on a daily or weekly basis) or in terms of the number of elements observed since the previous landmark [ Metwally et al. At the same time, in stable phases along the stream, it can affect the performance of the learning algorithm.

Outlier Detection Mechanisms

One of the most popular algorithms for data clustering isk means [MacQueen 1967]. due to its simplicity, scalability, and empirical success in many real-word applications [Wu et al. The first strategy is the simplest to be used in practice, as no further modification of the clustering algorithm is required. The third strategy requires modification of the clustering algorithm to properly handle the CF vectors as objects.

2003], the employedk-means variant uses an adapted version of the second strategy that selects initial prototypes with a probability proportional to the number of objects in a microcluster (expanded CF vector). Update: the new prototype is defined as the weighted center of the objects in an array;. The cost function is the sum of the costs associated with open facilities and service costs.

So it is a combination of the Sum of Squared Error (SSE) and the cost of inserting a medoid into a partition, which provides more flexibility to find the number of clusters. The purpose of the central site is to find kclusters and keep the data partition continuously up to date. The central object of each of the most common global states will be used in the final clustering.

When the system operates in the non-converged state, when a new state s(t) is reached, it updates the centers of the clusters according to the top m states. If the system has already converged and the current state has actually become part of the top states, the system updates the centers and changes the state to non-converged.

TEMPORAL ASPECTS

Time-Aware Clustering

Algorithms for clustering data streams should make adequate use of the inherent temporal element in data streams. For example, these algorithms should be able to implicitly or explicitly consider the influence of time during the clustering process (time-aware clustering). Current data stream clustering algorithms perform time-aware clustering by assigning different importance levels to objects (given that recent data is more important than old data) or by modeling the behavior of incoming data in such a way that objects can be clustered according to different temporal patterns instead of the traditional spatial approach.

In the first case, the merging process is influenced by the age of the objects, which is explicitly modeled by the decay function [Aggarwal et al. For the second case, a typical example is the Temporal Structure Learning algorithm for Clustering Massive Data Streams in Real Time (TRACDS) [Hahsler and Dunham 2011]. With the MC model, TRACDS can model the behavior of continuously arriving objects via state change probabilities, with the temporal element implicitly accounted for through different time patterns of object sequences.

Outlier-Evolution Dilemma

Cluster Tracking

ASSESSING CLUSTER STRUCTURES

The SSE criterion can be formally described by Eq. 16), kuci is the center of group Ci. To calculate the purity criterion, each group is assigned to its majority class, as described in Eq. 17), kuvj is the number of objects in group Cj from the class. Note that criteria such as SSE and purity are typically used in a sliding-window model, meaning that the clustering partition is performed with data within the sliding window.

However, if the algorithm does not use the sliding window model (for example, if it uses the concept of representative objects), evaluating a partition created with the most recent objects of the stream may not be a good idea, since representative objects capture both previous as current information together. In this sense, it is not sufficient to evaluate the quality of the generated partition (spatial criterion), but it is also necessary to evaluate the changes that occur in the partition over time (temporal criterion). Although the quality of the partition may indicate that there have been changes in the data distribution (for example, quality loss due to a newly emerging cluster), it is not possible to clearly indicate what is causing the quality loss.

Therefore, there is a need to combine spatial and temporal criteria for the correct evaluation of the quality of the partitions and their behavior in the flow of the stream. 2011], the authors propose an external criterion for the evaluation of clustering algorithms, called CMM (Clustering Mapping Measure), which takes into account the age of objects. Clusters that are constantly moving can eventually “lose” objects, so the CMM penalizes these missed objects.

Clusters may eventually overlap during the stream, and thus CMM penalizes for erroneous objects. CMM is an extrinsic criterion, and therefore requires a "gold standard" separation, which is not available in many practical applications.

DATA STREAM CLUSTERING IN PRACTICE

Applications

Due to its large size, it has also been consistently used to assess data stream clustering algorithms (e.g., [Aggarwal et al. 2003; Aggarwal and Yu 2008; Aggarwal 2010]). Examples of artificially generated data sets are: (i) data generated by varying Gaussian distributions Aggarwal et al. They use a platform developed within the Department of Automatic Control and Micro-Mechatronic Systems of the FEMTO-ST Institute to generate data regarding the testing and validation of bearing prognostic approaches.

Vibration and temperature measurements of the rolling bearing during its operating mode are collected by various sensors. A dataset commonly used by the data stream clustering research community is the Charitable Donations dataset (KDD-CUP '98) [Aggarwal et al. However, another commonly used dataset in stream clustering is the forest cover type dataset [ Aggarwal et al.

2009] and Zhang and Wang [2010], the authors propose to cluster data streams from real-time network monitoring. To diagnose the EGEE grid (Enabling Grid for E-Science4), they leveraged the gLite reports on the job life cycle and on the behavior of the middleware components to provide the summarized information on the grid operation status. Clustering of the time series generated by each sensor is one of the learning tasks required in this scenario, given that it allows the identification of consumption profiles and the identification of urban, rural and industrial consumers.

Grouping this kind of information can help understand electricity demand patterns at different times of the day. Each record contains 15 attributes corresponding to several speech characteristics, vocal tract model, pitch, and arousal.

Fig. 9. Arbitrarily shaped synthetic datasets—adapted from L ¨ uhr and Lazarescu [2009] c Elsevier 2009, Chen and Tu [2007] c ACM 2007, Serir et al

Data Repositories

Software Packages

CHALLENGES AND FUTURE DIRECTION

The authors would like to express their gratitude to the anonymous reviewers of the original manuscript for their constructive comments. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS'02). ACM Press, New York, 1–16. In Proceedings of the 22nd ACM SIGMOD-SIGACT SIGART Symposium on Principles of Database Systems.

InProceedings of the 8thInternational Conference on Machine Learning and Data Mining in Pattern Recognition.Lecture Notes in Computer Science, vol. InProceedings of the 17thACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11).ACM Press, New York, 868–876. Extension of the limit theorems of probability theory to a sum of variables connected in a chain.