• Tidak ada hasil yang ditemukan

Organization of the Thesis

advertisement of a search engine. KAGO had a reduced memory consumption of about 51.57% on an average. Outlier evaluation on these datasets using Rand index and F1-score showed a mean improved accuracy of around 3.3% as compared to KNNOD.

sub-variant algorithms: Batch−Inc1,Batch−Inc2 and is the most efficient comparatively.

5. Chapter 5: In this chapter, an exact incremental solution to SNN-DBSCAN [24] is proposed in form of the BISDdel algorithm. BISDdel facilitates dele- tion of points in batch mode. It is a combination two sub-variant algorithms:

Batch−Dec1, Batch−Dec2 and is the most efficient comparatively.

6. Chapter 6: Presents the KAGO algorithm which leverages the idea of KDE to find local outliers. These outliers are aggregated to find at most top-N global outliers against every point insertion.

7. Chapter 7: Provides the concluding remarks along with the future scopes of pertaining contributions made in this thesis.

Chapter 2

Literature Survey

In this chapter, first we present our study of previous works on incremental density based clustering and outlier detection. Besides we also provide a background study of some naive algorithms that are related to the contributions made in this thesis.

2.1 Related density-based incremental cluster- ing algorithms

1. Incremental DBSCAN: Inc-DBSCAN [43] is the incremental version of the DBSCAN [23] clustering algorithm. Patterns in database eg:log database alters temporally with new logs being added to and previous records are deleted from the database. The algorithm identifies the affected parts of existing clusters caused by an update in the database. Based on this under- lying idea of selective handling of updated dataset, Inc-DBSCAN proves to be more efficient than DBSCAN. Post insertion of new points, some non-core (non-dense) objects may turn into core (dense) forming novel density con- nections. Points which were not density reachable [23] earlier might become density reachable. Similarly upon performing deletion, some core objects may turn into non-core resulting in removal of existing connections.

If an objectpis inserted or deleted, thenNEps(p) [23] (Epsneighborhood of p) becomes the affected region. The unaffected points retain their old cluster membership. The number of region queries performed by Inc-DBSCAN is determined experimentally. Letri and rddenote the mean number of region

fd be the percentage of insertions and deletions. Then the cost incurred by Inc-DBSCAN for making r updates to the dataset incrementally is given as r×(fi×ri+fd×rd).

Inc-DBSCAN shows its inability in handling bulk insertion or deletion of data objects. The algorithm is also sensitive to changes in parameter values.

2. IncSNN-DBSCAN: IncSNN-DBSCAN [1] (InSDB) is an extension of the SNN-DBSCAN [24] clustering algorithm. InSDB detects clusters dynami- cally while points are added to the base dataset D one at a time. InSDB identifies each data point p ∈ D with the following properties: KNN (K- Nearest Neighbors) list, strengths of shared links [1], number of adjacent links, dense or non-dense status. When a new data point arrives, InSDB identifies only those points which undergo changes in values of their proper- ties. The algorithm targets only the affected points while rest of the points are allowed to exist in their previous state. This selective handling of data points ensures that the reconstruction time of the updated KNN lists and the shared nearest neighbor (SNN) graph [1] is significantly reduced. InSDB shows that a very small percentage of points ultimately gets affected due to which it becomes more efficient than SNN-DBSCAN [24].

Since InSDB is a point-based insertion technique, it might slow down as the size of base dataset increases. It is also sensitive to change in parameter values.

3. IGDCA:The incremental grid density-based clustering algorithm (IGDCA) [44] enables discovery of clusters having arbitrary shapes. IGDCA is an incremental extension of the GDCA [44] algorithm. The clusters obtained through GDCA are modified after a sequence of insertions δadd (say) and deletions δdel (say) of data points. Let D be the base dataset and D0 be the updated dataset where D0 = D+δadd −δdel. Since the data space is partitioned into grid cells, a cell gets updated only when a data point is added to or removed from it. Once the affected grids are identified, the updated clusters are subjected to modification. New points obtain cluster membership followed by the modification of existing clusters.

The algorithm is unable to determine the threshold parameters automati- cally. Moreover, the task of deletion also involves efficiency issues.

4. Dynamic density based clustering: This work [45] investigates the prin- ciples of dynamic clustering by DBSCAN [23] and theρ-approximateversion

of DBSCAN. The work proves that the ρ-approximate version suffers from the same computational hardness as the one when the dataset is fully dy- namic in nature. However, it also shows that this hardness disappears when a tiny relaxation is made. The quality of clusters obtained is same as that while handling static data. This phenomenon is known as the ”sandwich guarantee” of ρ−approximate DBSCAN. The algorithms guarantee near- constant update processing. The approximate version takesO(N) (N is the data size) time while the unit spherical emptiness checking (U SEC) method consumes o(N)4/3 time in worst case. A factor which may go against this approach is the involvement of multiple theoretical concepts within it.

5. DBCLASD: DBCLASD (Distribution-Based Clustering of LArge Spatial Databases) [46] assumes that the objects within a cluster are distributed uniformly. The algorithm dynamically determines the quantity and con- formation of clusters in a database without involving any input parameter.

DBCLASD incrementally augments an initial cluster with points in its neigh- borhood. This procedure continues till the set of nearest neighbor distances of the resultant cluster fits the estimated distance distribution. A point which is not yet a part of the current cluster but needs to be examined for possible cluster membership is a candidate point. Candidates failing the cluster membership test in their first attempt are called unsuccessful candi- dates. Unsuccessful candidates are not overlooked. They are considered at a later time. Objects belonging to any cluster might shift to another cluster later. The running time of DBCLASD is approximately twice that of the DBSCAN.

2.2 Related density-based incremental outlier de-