Density-Based Mining Algorithms for Dynamic Data

Jitendriya Swain for their timely and constructive suggestions besides thanking the other faculty members of the CSE department. Bhriguraj Borah and all other staff of CSE department who have helped me at various times.

Incremental Algorithms

Some examples of incremental algorithms in data mining

Incremental K-means clustering algorithm [6]: An incremental extension of K-means clustering algorithm involves adding cluster centers individually during clustering. The scheme adopted in this work is designed to reduce cluster distortion by initiating movement of cluster centers.

Dynamic data and its applications

Any choice made by a user can be treated as a change in data and therefore the predictions can be made accordingly. The dynamic data, such as the number of users in different regions in each time period, can be mined for commercial interest such as to determine congestion patterns on the roads.

Data mining tasks of choice

Density-based clustering

Applications of density-based clustering algorithms 8
Some applications of outlier detection algorithms . 11

Improve efficiency
Reduce latency
Limit resource usage

Using DBCLAs, clusters are identified as regions of higher density than the rest of the data space [23]. However, observing Figure 1.7 (b (bottom figure)), the incremental algorithm A computes the output due to changes made to input Ik dynamically.

Figure 1.1: Representation of various clustering domains.

Motivation

Reasons inclining towards inefficiency of naive algorithms

This establishes the fact that no additional waiting time is involved before the next update is processed. Similarly, tfk+2 < fk+3i resulting in no waiting time before handling the third element of change.

Reasons inclining towards robustness of data mining tasks

Excessive consumption of computer resources, e.g. CPU usage, additional buffering requirements due to non-intelligent handling of continuous updates can prove to be detrimental.

Objectives

Primary Contributions

The iM ass clustering algorithm providing an approximate
The BISDB add clustering algorithm providing an exact in-
The BISDB del clustering algorithm providing an exact in-
The KAGO outlier detection algorithm providing an ap-

BISDBadd outperformed SNN-DBSCAN by more than 3 times (≈ 1000 times) as found in three real and two synthetic datasets. BISDBdel (Batchwise Deletion Nearest Neighbor Density Clustering Algorithm) incrementally extends SNN-DBSCAN with support for batch mode deletion of data points.

Summary

Organization of the Thesis

The algorithm targets only the affected points, and the rest of the points can exist in their previous state. This selective handling of data points ensures that the reconstruction time of the updated KNN lists and the joint nearest neighbor (SNN) graph [1] is significantly reduced.

Related density-based incremental outlier detection algorithms

Within the limits of limited memory, the algorithm is able to detect outliers from high-volume data streams. The algorithm introduced the concept of abstract kernel center (aKDE) [51] to accurately estimate the local data density.

Other related naive algorithms

However, the preset value of the Minpts parameter prevents the algorithm from detecting clusters with variable densities. If either point p or q is not present in each other's KNN lists, no border is formed between them.

Figure 2.1: Cluster formation scheme in DBSCAN.

Mixture of other related naive and incremental clustering algorithms 29

MBSCAN
SNN-DBSCAN
KNNOD
Chapter contributions

However, the scheme adopted by SNN-DBSCAN is slightly different from that of the SNN clustering algorithm. Therefore, to efficiently extract clusters after new updates, we provide an approximate incremental extension of the MBSCAN clustering algorithm known as iM ass.

Table 3.1: Motivation behind developing the iM ass clustering algorithm.

Related work and background

The next set of related works presents concepts that lead to the use of data-dependent dissimilarity measures in MBSCAN as well as the iM ass algorithm. Core (.) Set of core points in the dataset Non-core (.) Set of non-core points in the dataset.

Preliminaries and Definitions

Modeling a region
Mass of a region
Mass of smallest local region
Mass-based dissimilarity
Mass-based neighborhood
Clustering
Approximate Incremental Clustering
Core and Non-core points
Noise points

The mass of a region is defined as the number of data points within that region. The mass of the smallest local regionr is the number of elements in the lowest level node including the pair of points a and b.

Problem formulation

For any point a ∈ D, if the size of the µ-quarter mass Mu(a) exceeds a threshold value for nucleationδnucleus, then na is designated as a nucleation point, otherwise it is a non-nucleation point. For a non-core point a∈D, if it obtains no cluster membership, then that point qualifies as a noise point.

The MBSCAN Clustering Algorithm

Consequently, the mass of the smallest local region containing the pair of points (c, d) is 2. For the pair of points (b, g), (a, c), the mass of the node with the lowest degree is 3, since there are three data points.

Figure 3.1: iT ree construction procedure based on axis-parallel split algo- algo-rithm [2].

Experimental evaluation of MBSCAN in brief

Inferences drawn from experiments on MBSCAN

The mass matrix turned out to be the most expensive component of the MB-SCAN algorithm. During the design of the iM ass algorithm, we aim to build the mass matrix incrementally, as this is one of the more difficult components of MBSCAN.

The iM ass Clustering Algorithm

Theoretical Model
Assumptions made for the iM ass clustering algorithm
Retain the components of MBSCAN algorithm
Steps of the iM ass clustering algorithm

Based on this comparison, the new point On+1 traces itself to the right child of the root node (∵2.7≥2.5). The mass value for new point On+i ∈D0 (updated data set) with each of the old points x∈D is calculated in the same way as the MBSCAN method.

Figure 3.5: Sequence of execution for the iM ass clustering algorithm.

Time Complexity comparison between MBSCAN and iM ass

However, for the calculation of the probability average mass between newly inserted pointOn+i and each of the older points, we use the methodology identical to the MBSCAN algorithm. Calculating the lowest node masses between the new point and each of the old points takes O(t(n+ 1) log2Ψ) time.

Theoretical analysis of the iM ass clustering algorithm

Cases related to updated mass-matrix

The updated mass becomes independent of the number of iT rees and is reduced to a function of D and old mass value me(x, y). Since the value of the term me(x, y) is already known from the execution of MBSCAN, corresponding to case 1, the updated mass matrix becomes independent of the number of iT re-involved.

Lemmas related to the iM ass clustering algorithm

Proof: Threshold μ (Refer to Equation 3.5) can change after inserting a point, but despite the reduction of the average values of the pairwise probability measure (Lemma 3.2), it cannot be guaranteed that the size of the neighborhood measure μ for each point will continue to increase or decrease.

Experimental evaluation

Experimental procedure

Total n odes in orest iF Along with the above mentioned metrics used to evaluate iM ass ie. CPU execution times and the percentage of affected nodes (Equation 3.9, Equation 3.10), we also determined the degree of reduction achieved by iM ass in the time required for the construction of the mass matrix and the calculation of the nodal mass by iF orest .

Table 3.8: Parameters and Data division for experiments related to iM ass algorithm.

Experimental results

Total # points after each insertion (# points) Percentage of nodes affected by iForest post insertion (Libras data). of iTrees: 20 Basis Point Threshold: 5 Percentage of affected nodes. Total # of points after each insertion (# points) Percentage of nodes affected by iForest post insertion (Dataset S1). of iTrees: 20 Basis Point Threshold: 9 Percentage of Nodes Affected.

Figure 3.8: Efficiency comparison between iM ass and MBSCAN along with the percentage of affected nodes due to iM ass for datasets: Libras, Segment,

Cluster analysis

Whenever a new point places itself in the correct nodes of an iT ree, the iM ass algorithm approximates the lowest node ID of (x, y) within that iT ree. For most datasets with class labels in Table 3.11, we see that the iM ass clustering algorithm maintains an NMI value or has a better NMI value than that of MB-SCAN.

Table 3.10: Extent of reduction achieved for building iF orest and mass-matrix incrementally due to iM ass algorithm.

Conclusion

Chapter contributions

Computes the KNN lists and similarity matrix incrementally, detects same clusters as SNNDB, performs batchwise insertion. Reduces the time it takes to compute the KNN lists and construct the similarity matrix after new insertions.

Related work and background

Preliminaries and Definitions

K-nearest neighbor (KNN) list
Shared nearest neighbors (SNN)
Similarity matrix or SNN graph
K-SNN graph
Core and non-core points
Noise points
Clustering
Exact Batch Incremental Clustering (Addition)

The remaining edges present in the SNN graph are identified as strong links. This method of obtaining the residual graph from the original SNN graph is known as K-nearest Neighbor Parsification of SNN graph [24].

Problem formulation

Then an incremental clustering is given by a mappingh :D0 →C0, with C0 ⊆ P(D0) isomorphic to the one-time clustering f(D0) produced by the non-incremental algorithm.

The SNNDB and InSDB clustering algorithm

If the value of δsim is set to 2, the edge between points P and P3 is considered a strong connection because|{P2,P4}| ≥δsim. Since the density of points P and P3 exceeds the threshold value of δcore, P and P3 are marked as core points.

Figure 4.1: KNN list for point P where K = 5

Structure of the proposed batch incremental SNNDB clustering al-

InSDB facilitates the detection of clusters dynamically while adding points to the base datasetD one at a time. Only the affected points are targeted by the algorithm, while the unaffected points are allowed to exist in their previous state.

Batch-Incremental SNNDB Clustering Algorithms for Addition

The Batch − Inc1 clustering algorithm
The Batch − Inc2 clustering algorithm
The BISDB add clustering algorithm
Shared link properties between affected points post insertion 92

However, they can be part of the updated KNN list of any KN−Sadd type point. Saddtype points can be determined from the updated KNN list of any KN− Sadd type point.

Figure 4.4: The formation of KN − S add type affected points upon entry of new points.

Time complexity analysis of the BISDB add clustering algorithm

Since Sadd type points are determined from the updated KNN list of the KN −Sadd type points, we have the following equation:. A maximum of K number of points can be displaced from the updated KNN list by a KN −Sadd type point.

Experimental evaluation

Phase-1: Finding the most effective batch incremental vari-
Phase-2: Prove efficiency of the most effective batch incre-
Phase-3: BISDB add and SNNDB are more effective than
Phase-4: Prove the efficiency of BISDB add over SNNDB

Phase-3: Show that InSDB [1] becomes inefficient when major updates (addition) are made to the base dataset. Percentage of points added to base dataset (%) Variable updates(addition) to dataset v CPU time (Mopsi12 dataset).

Figure 4.6: Mopsi2012 dataset: Efficiency comparison between batch incre- incre-mental algorithms (addition).

Cluster analysis

Clustering results in brief

About 81% of the data were core points, resulting in 91.4% of the data points being assigned cluster membership, while rests were treated as noise points. Any change in these values can change the cluster output, along with the set of core, non-core, and noise points.

Conclusion

Chapter contributions

Computes KNN lists and similarity matrix incrementally, discovers same clusters as SNNDB, supports deletion by cluster. Reduces the time needed to calculate the KNN lists and build the K-SNN graph after new removals.

Related work and background

Compute the KNN list, similarity matrix and the set of core and non-core points incrementally, the same clusters as SNNDB are detected, support batchwise deletion. Reduce the time it takes to calculate the KNN lists, build K-SNN graph, identify core and non-core points after new deletions.

Preliminaries and Definitions

K-nearest neighbor (KNN) list
Shared nearest neighbors (SNN)
Similarity matrix or SNN graph
K-SNN graph
Core, non-core and noise points
Clustering
Batch Incremental Clustering (Deletion)

According to the second point, the similarity between the points x and y is greater than or equal to a threshold value δsim and x is core but y is non-core. The third point states that if x is a non-core point and no core point y exists with which x has a similarity value greater than or equal to δsim, then x is categorized as a noise point.

Table 5.3: Major notations used in this chapter (third contribution).

Problem formulation

As in the first point, if the degree of closeness or similarity between the points x and y is greater than or equal to a threshold value δsim and x, y are close or core points, then both x and y are part of the same cluster. Another core point z exists and the similarity between y and z is greater than or equal to δsim.

Structure of the proposed batch incremental SNNDB clustering al-

The nearest core point is the one that has a strong connection to the side point in question and has a higher edge weight compared to other neighboring core points (see Chapter 4).

Batch-Incremental SNNDB Clustering Algorithms for Deletion

The Batch − Dec1 clustering algorithm
The Batch − Dec2 clustering algorithm
The BISDB del clustering algorithm
Shared link properties between affected points post deletion 136

TheBatch−Dec1 algorithm builds the updated e-KNN list incrementally for each of the existing data points. Any surviving non-KN-Sdel point in the updated KNN list of a KN-Sdel type point is designated as Sdeltype.

Figure 5.1: The formation of KN − S del type affected points upon deletion of existing points

Time complexity analysis of the BISDB del clustering algorithm

Time complexity proof BISDB del

LetTe−KN N be the time it takes to compute the updated e-KNN list of existing data points in D0. The free place(s) are filled with the points found in additional (w−1)·K places on the e-KNN list.

Experimental evaluation

Phase-1: Finding the most effective batch incremental vari-
Phase-2: Prove efficiency of the most effective batch incre-
Phase-3: BISDB del and SNNDB are more effective than
Phase-4: Prove the efficiency of BISDB del over SNNDB

Percentage of points deleted from base dataset (%) Variable updates (deletion) to dataset v CPU time (Mopsi12 dataset). For constant updates, a fixed number of points were deleted from the base dataset in several batches.

Cluster analysis

Clustering results in brief

This is because the removal of data points leads to the splitting of clusters and the simultaneous breaking of shared strong connections. In the case of the Birch3 dataset, about 87.9% of the data points belong to a cluster with an average cluster size of 7.68.

Conclusion

Chapter contributions

We propose an approximate incremental solution to KNNOD [5] in the form of the KAGO algorithm (Table 6.2).

Related work and background

O Set of outliers before any changes in the data set O0 Set of outliers after the data set has been changed.

Preliminaries and Definitions

Local outlier
Neighborhood
Kernel centers
Kernel Density Estimate (KDE)
Kernel functions
Grid local outlier score (glos)
Mean grid local outlier score (mglos)
Incremental anomaly detection

Calculate the density (local density)1 at point xi together with the densities at the neighboring points of xi. Typically S represents the set of points within a grid cell gc and Lcj is a data point sampled from S.

Figure 6.2: Gaussian kernel as univariate KDE with different kernel band- band-widths.

Problem formulation

The KNNOD algorithm in brief

Framework of the KAGO algorithm

The KAGO algorithm

Theoretical model

Steps of the KAGO outlier detection algorithm

Time Complexity of the KAGO algorithm

Experimental evaluation

Experimental setup and datasets used

Experimental Results and Analysis

Memory usage

Brief outlier analysis

Key properties of the KAGO algorithm

Lemmas related to the KAGO algorithm

Conclusion

Future scopes

Representation of various clustering domains

Dense areas represent the clusters and subsequently the noises are

Detecting clusters of arbitrary shapes and densities

Dense grid cells accumulate to form clusters

Illustration of outliers in a 2-D data

Representation of various outlier detection paradigms

Latency scenarios due to changes in the input data

Cluster formation scheme in DBSCAN

Mass based dissimilarity matrix

Sequence of execution for the iM ass clustering algorithm

Compute the node mass incrementally upon new point insertion

Updated mass-matrix post insertion of a new point

Efficiency comparison between iM ass and MBSCAN along with

KNN list for point P where K = 5

Similarity value between points P and P3 in the K-SNN graph given

Cluster containing core points P and P3 in the K-SNN graph. If