4.3.8 Exact Batch Incremental Clustering (Addition)
Given a dataset D alongwith its initial clustering f : D →C where C ⊆ P (D), an insertion sequence of B batches with ‘k’ points per batch takes place. After k0
≤ kB number of insertions wherekB(mod k0)≡0, let D0 be the updated data set.
Then an incremental clustering given by a mappingh :D0 →C0, withC0 ⊆ P (D0) is isomorphic to the one time clustering f(D0) produced by the non-incremental algorithm.
KNN list consists of the top-K nearest neighbors for any data point. Please refer to Figure 4.1 for the representation of KNN list for data point P (say). In this figure, let the points P1, P2, P3, P4 ,P5 be at a distance of 3, 6, 4, 2 and 5 units respectively from P. Then for K = 5, the KNN list for P is the set {P4, P1, P3, P5 ,P2}.
Figure 4.1: KNN list for point P where K = 5
P
P4 P1
P3 P5
P2
6 units 4 units 2 units
5 units
3 units
Point in focus
Closest point within KNN list of P with K=5
P
P4 P1 P3 P5 P2
The K-nearest neighbors(KNN) list of
P with K=5 Farthest point within KNN list of P with K=5
Strong link
The concept of shared nearest neighbors or SNN is inherited from the clustering scheme proposed by Jarvis and Patrick [52]. The SNN clustering technique does not use any distance metric for deciding the measure of closeness between any two data points. Instead it relies on the number of shared data points between the KNN lists of any pair of points (p, q) to evaluate their proximity. The proximity score obtained is treated as the similarity value betweenpandq. While constructing the SNN graph, the data points are treated as nodes while the edge weight is equivalent to the similarity value between the pairs of points. This step is followed by the
“K-Nearest Neighbor Sparsification” [24, 52] of SNN graph. While building a K- SNN graph, an edge is formed between any two nodes p and q iff the following two conditions are satisfied:
1. Points p and q are present in each others’ KNN list.
2. The similarity value between p and q is greater than or equal to a certain threshold δsim (say).
Each of the edges constructed between any pair of points (p,q) satisfying the above two conditions are considered as strong links. Figure 4.2 demonstrates the similarity value calculation and strong link formation between two points P and
P3. The KNN list of P contains {P4, P1, P3, P5 P2} while the KNN list of P3 consists of {P7, P8, P, P2, P4}. We observe that both P and P3 are included in each others’ KNN list. The proximity score or the degree of closeness between P and P3 is therefore given as 2. This is because points P and P3 share two elements:
{P2,P4} between their KNN lists. If the value of δsim is set to be 2, then an edge between points P and P3 is considered to be a strong link since|{P2,P4}| ≥δsim.
2
Figure 4.2: Similarity value between points P and P3 in the K-SNN graph given that P ∈KNN(P3) and P3∈ KNN(P) and K = 5.
P
P4 P1
P3 P5
P2
7 units 4 units 2 units
5 units
3 units
P P4P1 P3P5 P2
The KNN list of P3 with K=5
P6
P7
P8
5 units 8 units
2 units
3 units
P3 P7P8 P2P P4
6 units
The KNN list of P with K=5
δsim = 2 (say) KNN(P) ∩ KNN(P3) = {P2,P4}
similarity(P,P3) = 2
Strong link Weak link
Strong link
Weak link
The graph obtained by this mechanism is known as the K-sparsified SNN (K-SNN) graph [24, 52]. In the K-SNN graph, all the existing edges between any pair of nodes are strong links. While constructing an edge between p and q, if any one of the above two conditions is violated, an edge is not formed. All the connected components contained in the K-SNN graph are now treated as the final set of clusters by the SNN [52] algorithm.
However, the SNNDB [24] algorithm produces K-SNN graph without considering its connected components as clusters. Instead, SNNDB adopts a clustering scheme similar to the DBSCAN [23] algorithm. SNNDB identifies the dense(core) and border (non-core) points to find its final set of clusters. In the K-SNN graph, for any given pointp (say), SNNDB detects the number of strong links adjacent to p
2Having a point in the KNN list does not guarantee the formation of a shared strong link between the concerened point and its neighbor. For a shared strong link to exist, each of the two conditions for strong link formation must be satisfied.
(denoted as adj (p)). If adj (p) > δcore (a certain threshold) then p is designated as a core point, otherwise p is a non-core point. The number of strong links associated with pointp provides a measure of its density.
Similar to DBSCAN [23], if p and q are two core points connected by a strong link, then both these points obtain the same cluster membership (First point under Clustering definition from Section 4.3). However, if one of them is a non-core point, then that point is allocated to a cluster containing its nearest core point (Second point under Clustering definition from Section 4.3). The nearest core point is the one that shares a strong link with the concerned non-core point, and has highest edge weight as compared to other adjacent core points. The set of points which fail to obtain any cluster membership are classified as noise points (Third point under Clustering definition from Section 4.3).
Figure 4.3: Cluster containing core points P and P3 in the K-SNN graph. If δcore is set as 4, thenadj(P) > δcore and adj(P3)> δcore
P
P4 P1
P3 P5
P2
7 units 4 units 2 units
5 units
3 units
P6
P7
P8
5 units 8 units
2 units 3 units 6 units
All the links except P3-P4 in the above K-SNN graph are strong links.
Let δcore = 3(say)
Density of point P: adj(P) = 5 > δcore Density of point P3: adj(P3) = 5 > δcore P and P3 are core points.
P and P3 are a part of the same cluster.
Core points P, P3
Cluster containing P and P3
Strong link Weak link
Strong link
Weak link
In Figure 4.3, let us assume that the core point formation threshold (δcore) is set to be 3. Now for point P, the number of adjacent strong links is five. Therefore adj(P) equals 5. Similarly for point P3, adj(P3) is also determined as 5. Points P3 and P4 share a weak link3 which is not considered as a link. It is just given for representational purpose. Since, the density of points P and P3 exceeds the threshold value of δcore, P and P3 are designated as core points. As per the DBSCAN [23] clustering scheme, points P and P3 become a part of the same cluster.
3A weak link is only a virtual link (dotted line) represented to show its difference with a
IncSNN-DBSCAN [1] or InSDB is an incremental extension to the SNNDB [24]
clustering algorithm. InSDB facilitates detection of clusters dynamically while points are added to the base datasetD one at a time. InSDB tags each data point p ∈ D with the following properties: KNN list, strengths of shared strong links, number of adjacent strong links, core or non-core status. When a new data point arrives, InSDB identifies only those among old points which undergo changes in their properties. Only the affected points are targeted by the algorithm, while the unaffected points are allowed to exist in their previous state.
Let Npt be a new data point entering D. Upon entry of Npt, D changes to D0. Now, for any pointp∈D, ifp exhibits changes in its properties (as stated above), then InSDB targets p. The changes that p incurs in its properties may lead to creation of new SNN connections or removal of existing ones. New SNN connec- tions could merge the existing clusters and their removal could split them. The selective handling of affected data points ensures that the reconstruction time of the updated KNN lists, K-SNN graph is drastically reduced. InSDB shows that a very small percentage of existing points ultimately gets affected due to which it becomes more efficient than SNNDB. However, InSDB is a point-based insertion technique, which might slow down as the size of D increases. This is because when insertions are made upon a larger base dataset, the time required to find the affected points will increase. Moreover, repetitive construction of necessary algo- rithmic components such as KNN lists and K-SNN graph against every insertion may slow down the overall cluster detection process.