• Tidak ada hasil yang ditemukan

The Batch − Inc1 clustering algorithm

4.7 Batch-Incremental SNNDB Clustering Algorithms for Addition

4.7.1 The Batch − Inc1 clustering algorithm

TheBatch−Inc1 clustering algorithm builds the updated KNN lists of individual points in the base dataset incrementally. When new points arrive in a batch, some of the old points may get affected as they undergo change in their property values.

By targeting only the affected points, the KNN list for each data point present in the base dataset is constructed. The points which remain unaffected due to new insertions retain their existing KNN lists. The new similarity matrix (updated K- SNN graph or K-SNNupdated graph), new core and non-core points are determined non-incrementally. The steps of Batch−Inc1 algorithm are as follows:

1. Step 1 - Set the parameters: The algorithm takes three parameters: K, δsim and δcore. The parameters have the following meanings:

(a) K denotes the size of KNN list for each data point.

(b) Given that two data points p and q are present in each others’ KNN list,δsimis the minimum value of SNN required forp,qto form a strong link between them.

(c) δcore is the minimum number of strong links adjacent to a point p ex- ceeding whichp becomes a core point.

2. Step 2 - Obtain the required data from prior SNNDB execution:

(a) Get the base dataset D where |D| = n (say).

(b) Get the KNN list ∀pi ∈D, i= 1,2,3, . . . , n.

(c) Get the similarity matrix Sim M at(D).

3. Step 3 - Insert a batch of new points: Add a batch containing k new data points upon D. D changes toD0 where|D0|= n+k.

4. Step 4 - Compute the KNN list of newly entered points: In this step, the KNN list of all the newly added data points is computed non- incrementally. If k data points are added in a single batch, then ∀pj ∈ D0, j =n+ 1, n+ 2, n+ 3, . . . , n+k, we find KNN(pj).

5. Step 5 - Compute the updated KNN list for old data points in D∩ D0 incrementally: The number of existing data points in D (base dataset) prior to any insertion is n. When k new points are added to D, D changes to D0 (|D0| = n +k). From the set D∩D0 (set of old points), the algorithm identifies those points that can accommodate any newly added point in their KNN list by replacing an old one. If the size of nearest neighbor list is K, then a maximum of K old points can be replaced by the new ones from the KNN list. The set of old points in D∩D0 which contain at least one newly added point in their KNN list are categorized as KN−Sadd type affected points.

The termKN−Saddmeans that both the KNN list as well as the similarity measures of the affected data points may be altered. KN stands for change(s) in the KNN list while S signifies a possible change in the similarity values (shared link strength) of the affected data point with points in its updated KNN list (KNNupdated(.)). If the new link strength falls belowδsim, the link ceases to exist further. The new points and the unaffected old points are not categorized as KN −Sadd type. The unaffected old points retain their previous KNN list. Batch−Inc1 therefore focuses only on re-building the KNNupdated(.) list for KN −Sadd type points. The KNNupdated(.) lists for unaffected old points in D∩D0 are not constructed separately.

Running example: Let us visit Figure 4.4 for an illustrative example of this step (Step 5). Consider the point P, where KNN(P)={P4, P1, P3, P5, P2} (Assuming K=5) (top most image in Figure 4.4) prior to entry of any new points in the dataset. Let three new points N1, N2 and N3 (yellow color) enter the dataset. For our purpose, we consider that N1 and N3 are at a distance of 1 and 2.5 units respectively from P while N2 is at a distance

Figure 4.4: The formation of KN −Sadd type affected points upon entry of new points.

P

P4 P1

P3 P5

P2

6 units

4 units 2 units

5 units

3 units P

P4 P1 P3 P5 P2

KNN(P) with K=5

P

P4 P1

P3 P5

P2

6 units 4 units 2 units

5 units 3 units

P3 N2

8 units

N1

1 units

N3

2.5 units Broken link

Weak link

Broken link

Points P2, P5 removed from KNN(P) with K=5

P

P4 P1 P3

New point in KNN(P) New point in

KNN(P)

P

P4 P1

P3

4 units 2 units

3 units

P3 N1

1 units

N3

2.5 units New point in KNN(P)

Updated KNN(P) with K=5

P

N1 P4 N3 P1 P3

New point in KNN(P)

KN-Sadd type affected point

Strong link

Strong link

Strong link

Strong link Broken link Weak link

of 8 units (say). On comparing distances with other nearest neighbors of P, it is clear that the points N1 and N3 can potentially enter into the KNN list of P displacing points P2 and P5. This results in creation of two vacant slots in KNN(P) (second image in Figure 4.4). As a result, the link between pairs of points: (P,P2) and (P,P5) gets broken5. Consequently points N1 and N3 occupy the two vacant slots created in KNN(P).

Between N1, N2 and N3, we consider only N1 and N3 to share a strong link with P. On sorting the current set of points in increasing order of distance to point P, the updated KNN list of P (KNNupdated(P)) obtained incrementally consists of {N1, P4, N3, P1, P3} (bottom image in Figure 4.4). Point P (green color) is therefore aKN−Sadd type affected point since it accommo- dates new points N1, N3 in its updated KNN list. New point N2 does not have any influence over the KNN list of P and is therefore not a member

5The link gets broken as the points are no longer present in each others’ KNN list, a necessary condition to construct a shared strong link (Refer Section 4.5).

of KNNupdated(P). The newly entered points and the old unaffected points are not classified as KN −Sadd type. The non-KN −Sadd type old points (say P1, P3 or P4) retain their previous KNN list, therefore we have the following:

KNNupdated(P1) = KNN(P1), KNNupdated(P3) = KNN(P3), KNNupdated(P4) = KNN(P4).

6. Step 6 - Construct the updated K-SNN graph: The algorithm con- structs the updated K-SNN (K-SNNupdated) graph or the new similarity ma- trixSim M at(D0) non-incrementally. The updated dataset D0 now consists of n+k points. Therefore, ∀pj ∈ D0, j = 1,2,3, . . . , n+k, Batch−Inc1 determines if a shared strong link can be constructed∀q∈ KNNupdated(pj).

7. Step 7 - Identify new core and non-core points: For each point in K-SNNupdated graph (Sim M at(D0)), if the number of adjacent strong links is greater than δcore, the point obtains a core status otherwise a non-core.

The new set of core and non-core points are stored in Core(D0) andN on− Core(D0) respectively.

8. Step 8 - Form Clusters: Two connected core points are placed into the same cluster. A non-core point is assigned to a cluster of its nearest core point6.

9. Step 9 - Discard noise points: The non-core points which are not con- nected to any core point become noise points. Such points do not obtain any cluster membership.

10. Step 10 - Retain the updated values:

(a) D=D0 (b) n=n+k

(c) ∀pi ∈D, i= 1,2,3, . . . , n+k KNN(pi) = KNNupdated(pi).

(d) Sim M at(D) = Sim M at(D0) (e) Core(D) = Core(D0)

(f) N on-Core(D) =N on-Core(D0)

11. Repeat Steps 3 to 10 for the next batch of entering points.

6The core point with which the shared link strength is highest becomes the “nearest” core