• Tidak ada hasil yang ditemukan

F1-score= 2pr p+r where p= T P

T P +F P r= T P

T P +F N

(3.14)

Herep and r (Equation 3.14) denote Precision and Recall respectively.

Table 3.11: Clusters quality evaluation metrics using Equations 3.11, 3.12, 3.13 and 3.14.

Dataset #Classes iMass MBSCAN

NMI RI F1-score NMI RI F1-score

Libras 15 0.272 0.80438 0.10896 0.272 0.80438 0.10896 Segment 7 1.0 0.17848 0.30290 1.0 0.17848 0.30290

Wine 3 1.0 0.38488 0.55583 0 0.52105 0.68512

Seeds 3 1.0 0.38684 0.55787 1.0 0.38743 0.55848

Iris 3 1.0 0.38743 0.55848 0.81497 0.54598 0.42210

For most of the class labeled datasets in Table 3.11, we observe that the iM ass clustering algorithm either retains or has a better NMI value than that of MB- SCAN. However for datasets: Wine and Seeds, MBSCAN has higher RI and F1- score as compared toiM ass. For the other three unlabeled datasets: Aggregation, S1 and S2, a mean cluster accuracy of 60.375% was achieved.

to identify in constant time whether a new point has penetrated into a lowest leveled node. Moreover by retaining the exactness of clusters for certain datasets and maintaining an overall mean accuracy of about 60.375% for unlabeled data, we showed that C0(IM)≈C0(M). For labeled data, we showed that the iM ass algorithm achieved similar or improved results over MBSCAN in terms of NMI, RI and F1-score thereby proving the objectives as stated in Section 3.4.

The MBSCAN [2] clustering algorithm was built on the strength of random enti- ties: the split-attribute (q) and split-point (p) values. As a result it proved to be a major challenge to produce an exact incremental extension to MBSCAN. The construction of iF orest is heavily reliant on the subsample size D ⊂ D and the number ofiT rees(t). In our proposed algorithmiM ass, we chose not to increase t because increasing the number of iT rees against new insertions would tend to mitigate the advantages that we may derive by maintaining a consistent number of iT rees. The creation of internal nodes within an iT ree is also dependent on the random entities used by the MBSCAN algorithm. Since a prior execution of MBSCAN is performed before implementingiM ass, at no point we can guarantee that two independent runs of theiM assalgorithm will produce an identical set of clusters. However, we have been able to achieve a highest efficiency upto an order of 2.28 or about 191 times across datasets due to iM ass which shows its worthy extension as an approximate incremental version of MBSCAN.

Chapter 4

BISDB add : Towards Exact Incremental Clustering in

Batch-Mode for Insertion using SNN-DBSCAN

In the previous chapter, we oversaw an intelligent tuning done to the expensive components of the baseline algorithm. The proposed scheme however was limited to single point insertions. Taking into consideration the repeated reconstruction of heavier algorithmic components, the point based updates may eventually not prove to be an efficient technique while dealing with a larger base dataset. Moreover, it is also desirable for an incremental algorithm to produce results identical to the naive or non-incremental approach. Therefore, the efforts laid in our first contribu- tion motivated us to expand our research towards proposing an exact incremental solution. We chose to incrementally extend a robust density based clustering al- gorithm known as SNN-DBSCAN [24] (SNNDB), where updates (insertions) are made in batches to the base dataset.

We initially proposed two sub-variant algorithms viz. Batch−Inc1 andBatch− Inc2. WhileBatch−Inc1 solves only a single component of SNNDB incrementally, Batch−Inc2 deals with two components. Both these algorithms process insertions in batch mode leading towards the designing of most effective variant in form of BISDBadd(Batch Incremental Shared Nearest Neighbor Density Based Clustering Algorithm for addition). The BISDBadd algorithm targets all the components of SNNDB incrementally.

4.1 Motivation

Dynamic datasets undergo frequent changes in their size upon periodic insertion.

A naive method to get an exact clustering over the changed dataset requires a redundant execution of the clustering algorithm. Moreover for minor changes in input, the variation in output is also expected to be minimal. These changes inflicted upon the dataset cannot be ignored as they might be significant for data points and their neighborhood. With increase in frequency of such updates, the problem of redundant computation may lead to efficiency and latency issues.

Table 4.1: Motivation behind developing theBISDBaddclustering algorithm.

Motivation Description

Redundant computation

Non-incremental algorithms fail to address the issue of redundant computation while handling dynamic datasets.

They involve the entire set of data points against every new update made to the dataset.

Small frequent Updates

When minimal number of insertions are made upon a larger base dataset, the changes in clustering is also expected to be small. As a result, there is a need for designing intelligent algorithms to handle such frequent updates efficiently without redundant computation.

InSDB [1] handles pointwise addition

InSDB handles addition of points one at a time. The process may get slower as the size of base dataset increases with new insertions. This is because in order to find the affected points against every insertion, a single scan of the whole dataset is required. This scanning time is bound to rise with increase in the size of base dataset. As a result there is a need to process updates in batch mode for quickening the cluster detection process against new updates.

SNNDB [24] is a robust graph-based clustering algorithm that enables finding clusters of arbitrary shapes, sizes and densities. Existing incremental extension to SNNDBi.e.,IncSNN-DBSCAN [1] (InSDB) facilitates addition of data points one at a time. As a result, the process involved in rebuilding the expensive components of SNNDB against every point insertion incurs a high computational cost. To address this issue, we propose an exact incremental solution to SNNDB processing updates in batch mode. Entry of data points in batches enables faster processing of updates in one attempt. This procedure was otherwise not possible with point

based insertion scheme. Table 4.1 provides a brief description about the motivation behind our work.

4.1.1 Chapter contributions

The key contribution(s) made in this chapter may be summarized as follows:

1. We propose three incremental variants of SNNDB each of which processes updates made due to addition of data points in batch mode. These three algorithms areBatch−Inc1,Batch−Inc2 and BISDBadd(See Table 4.2).

Experimentally, we observed that the third variant BISDBadd is the most efficient as compared to the other two variants.

2. We showed the effectiveness of our fastest incremental variant BISDBadd over SNNDB [24] while handling minimal changes made to the dataset.

3. We demonstrated the fact that when size of base dataset increases, point wise insertion of data no longer remains an effective option to detect clusters dynamically. The updates made to a larger base dataset in batch mode proves to be more efficient than both the naive (SNNDB [24]) and point- based incremental method (InSDB [1]).

4. A thorough cluster analysis is provided.

Table 4.2: Brief overview of our proposed batch incremental clustering algo- rithms for addition (Refer Section 4.3 for definitions of related concepts).

Algorithm Brief working mechanism Advantage Improvement

BatchInc1

Computes the KNN lists incrementally, detects same clusters as SNNDB, performs batch wise insertion.

Reduces the time taken to compute the KNN lists post new insertions.

BatchInc2

BatchInc2

Computes the KNN lists and similarity matrix incrementally, detects same clusters as SNNDB, performs batch wise insertion.

Reduces the time taken to compute the KNN lists and construct similarity matrix post new insertions.

BISDBadd.

BISDBadd

Computes the KNN lists, similarity matrix along with core and non-core points incrementally, detects same clusters as SNNDB, performs batch wise insertion.

Reduces the time taken to compute the KNN lists, construct similarity matrix and identify core and non-core points post new insertions.