Jitendriya Swain for their timely and constructive suggestions besides thanking the other faculty members of the CSE department. Bhriguraj Borah and all other staff of CSE department who have helped me at various times.
Incremental Algorithms
Some examples of incremental algorithms in data mining
Incremental K-means clustering algorithm [6]: An incremental extension of K-means clustering algorithm involves adding cluster centers individually during clustering. The scheme adopted in this work is designed to reduce cluster distortion by initiating movement of cluster centers.
Dynamic data and its applications
Any choice made by a user can be treated as a change in data and therefore the predictions can be made accordingly. The dynamic data, such as the number of users in different regions in each time period, can be mined for commercial interest such as to determine congestion patterns on the roads.
Data mining tasks of choice
- Density-based clustering
- Applications of density-based clustering algorithms 8
- Some applications of outlier detection algorithms . 11
- Improve efficiency
- Reduce latency
- Limit resource usage
Using DBCLAs, clusters are identified as regions of higher density than the rest of the data space [23]. However, observing Figure 1.7 (b (bottom figure)), the incremental algorithm A computes the output due to changes made to input Ik dynamically.
Motivation
Reasons inclining towards inefficiency of naive algorithms
This establishes the fact that no additional waiting time is involved before the next update is processed. Similarly, tfk+2 < fk+3i resulting in no waiting time before handling the third element of change.
Reasons inclining towards robustness of data mining tasks
Excessive consumption of computer resources, e.g. CPU usage, additional buffering requirements due to non-intelligent handling of continuous updates can prove to be detrimental.
Objectives
Primary Contributions
- The iM ass clustering algorithm providing an approximate
- The BISDB add clustering algorithm providing an exact in-
- The BISDB del clustering algorithm providing an exact in-
- The KAGO outlier detection algorithm providing an ap-
BISDBadd outperformed SNN-DBSCAN by more than 3 times (≈ 1000 times) as found in three real and two synthetic datasets. BISDBdel (Batchwise Deletion Nearest Neighbor Density Clustering Algorithm) incrementally extends SNN-DBSCAN with support for batch mode deletion of data points.
Summary
Organization of the Thesis
The algorithm targets only the affected points, and the rest of the points can exist in their previous state. This selective handling of data points ensures that the reconstruction time of the updated KNN lists and the joint nearest neighbor (SNN) graph [1] is significantly reduced.
Related density-based incremental outlier detection algorithms
Within the limits of limited memory, the algorithm is able to detect outliers from high-volume data streams. The algorithm introduced the concept of abstract kernel center (aKDE) [51] to accurately estimate the local data density.
Other related naive algorithms
However, the preset value of the Minpts parameter prevents the algorithm from detecting clusters with variable densities. If either point p or q is not present in each other's KNN lists, no border is formed between them.
Mixture of other related naive and incremental clustering algorithms 29
- MBSCAN
- SNN-DBSCAN
- KNNOD
- Chapter contributions
However, the scheme adopted by SNN-DBSCAN is slightly different from that of the SNN clustering algorithm. Therefore, to efficiently extract clusters after new updates, we provide an approximate incremental extension of the MBSCAN clustering algorithm known as iM ass.
Related work and background
The next set of related works presents concepts that lead to the use of data-dependent dissimilarity measures in MBSCAN as well as the iM ass algorithm. Core (.) Set of core points in the dataset Non-core (.) Set of non-core points in the dataset.
Preliminaries and Definitions
- Modeling a region
- Mass of a region
- Mass of smallest local region
- Mass-based dissimilarity
- Mass-based neighborhood
- Clustering
- Approximate Incremental Clustering
- Core and Non-core points
- Noise points
The mass of a region is defined as the number of data points within that region. The mass of the smallest local regionr is the number of elements in the lowest level node including the pair of points a and b.
Problem formulation
For any point a ∈ D, if the size of the µ-quarter mass Mu(a) exceeds a threshold value for nucleationδnucleus, then na is designated as a nucleation point, otherwise it is a non-nucleation point. For a non-core point a∈D, if it obtains no cluster membership, then that point qualifies as a noise point.
The MBSCAN Clustering Algorithm
Consequently, the mass of the smallest local region containing the pair of points (c, d) is 2. For the pair of points (b, g), (a, c), the mass of the node with the lowest degree is 3, since there are three data points.
Experimental evaluation of MBSCAN in brief
Inferences drawn from experiments on MBSCAN
The mass matrix turned out to be the most expensive component of the MB-SCAN algorithm. During the design of the iM ass algorithm, we aim to build the mass matrix incrementally, as this is one of the more difficult components of MBSCAN.
The iM ass Clustering Algorithm
- Theoretical Model
- Assumptions made for the iM ass clustering algorithm
- Retain the components of MBSCAN algorithm
- Steps of the iM ass clustering algorithm
Based on this comparison, the new point On+1 traces itself to the right child of the root node (∵2.7≥2.5). The mass value for new point On+i ∈D0 (updated data set) with each of the old points x∈D is calculated in the same way as the MBSCAN method.
Time Complexity comparison between MBSCAN and iM ass
However, for the calculation of the probability average mass between newly inserted pointOn+i and each of the older points, we use the methodology identical to the MBSCAN algorithm. Calculating the lowest node masses between the new point and each of the old points takes O(t(n+ 1) log2Ψ) time.
Theoretical analysis of the iM ass clustering algorithm
Cases related to updated mass-matrix
The updated mass becomes independent of the number of iT rees and is reduced to a function of D and old mass value me(x, y). Since the value of the term me(x, y) is already known from the execution of MBSCAN, corresponding to case 1, the updated mass matrix becomes independent of the number of iT re-involved.
Lemmas related to the iM ass clustering algorithm
Proof: Threshold μ (Refer to Equation 3.5) can change after inserting a point, but despite the reduction of the average values of the pairwise probability measure (Lemma 3.2), it cannot be guaranteed that the size of the neighborhood measure μ for each point will continue to increase or decrease.
Experimental evaluation
Experimental procedure
Total n odes in orest iF Along with the above mentioned metrics used to evaluate iM ass ie. CPU execution times and the percentage of affected nodes (Equation 3.9, Equation 3.10), we also determined the degree of reduction achieved by iM ass in the time required for the construction of the mass matrix and the calculation of the nodal mass by iF orest .
Experimental results
Total # points after each insertion (# points) Percentage of nodes affected by iForest post insertion (Libras data). of iTrees: 20 Basis Point Threshold: 5 Percentage of affected nodes. Total # of points after each insertion (# points) Percentage of nodes affected by iForest post insertion (Dataset S1). of iTrees: 20 Basis Point Threshold: 9 Percentage of Nodes Affected.
Cluster analysis
Whenever a new point places itself in the correct nodes of an iT ree, the iM ass algorithm approximates the lowest node ID of (x, y) within that iT ree. For most datasets with class labels in Table 3.11, we see that the iM ass clustering algorithm maintains an NMI value or has a better NMI value than that of MB-SCAN.
Conclusion
Chapter contributions
Computes the KNN lists and similarity matrix incrementally, detects same clusters as SNNDB, performs batchwise insertion. Reduces the time it takes to compute the KNN lists and construct the similarity matrix after new insertions.
Related work and background
Preliminaries and Definitions
- K-nearest neighbor (KNN) list
- Shared nearest neighbors (SNN)
- Similarity matrix or SNN graph
- K-SNN graph
- Core and non-core points
- Noise points
- Clustering
- Exact Batch Incremental Clustering (Addition)
The remaining edges present in the SNN graph are identified as strong links. This method of obtaining the residual graph from the original SNN graph is known as K-nearest Neighbor Parsification of SNN graph [24].
Problem formulation
Then an incremental clustering is given by a mappingh :D0 →C0, with C0 ⊆ P(D0) isomorphic to the one-time clustering f(D0) produced by the non-incremental algorithm.
The SNNDB and InSDB clustering algorithm
If the value of δsim is set to 2, the edge between points P and P3 is considered a strong connection because|{P2,P4}| ≥δsim. Since the density of points P and P3 exceeds the threshold value of δcore, P and P3 are marked as core points.
Structure of the proposed batch incremental SNNDB clustering al-
InSDB facilitates the detection of clusters dynamically while adding points to the base datasetD one at a time. Only the affected points are targeted by the algorithm, while the unaffected points are allowed to exist in their previous state.
Batch-Incremental SNNDB Clustering Algorithms for Addition
- The Batch − Inc1 clustering algorithm
- The Batch − Inc2 clustering algorithm
- The BISDB add clustering algorithm
- Shared link properties between affected points post insertion 92
However, they can be part of the updated KNN list of any KN−Sadd type point. Saddtype points can be determined from the updated KNN list of any KN− Sadd type point.
Time complexity analysis of the BISDB add clustering algorithm
Since Sadd type points are determined from the updated KNN list of the KN −Sadd type points, we have the following equation:. A maximum of K number of points can be displaced from the updated KNN list by a KN −Sadd type point.
Experimental evaluation
- Phase-1: Finding the most effective batch incremental vari-
- Phase-2: Prove efficiency of the most effective batch incre-
- Phase-3: BISDB add and SNNDB are more effective than
- Phase-4: Prove the efficiency of BISDB add over SNNDB
Phase-3: Show that InSDB [1] becomes inefficient when major updates (addition) are made to the base dataset. Percentage of points added to base dataset (%) Variable updates(addition) to dataset v CPU time (Mopsi12 dataset).
Cluster analysis
Clustering results in brief
About 81% of the data were core points, resulting in 91.4% of the data points being assigned cluster membership, while rests were treated as noise points. Any change in these values can change the cluster output, along with the set of core, non-core, and noise points.
Conclusion
Chapter contributions
Computes KNN lists and similarity matrix incrementally, discovers same clusters as SNNDB, supports deletion by cluster. Reduces the time needed to calculate the KNN lists and build the K-SNN graph after new removals.
Related work and background
Compute the KNN list, similarity matrix and the set of core and non-core points incrementally, the same clusters as SNNDB are detected, support batchwise deletion. Reduce the time it takes to calculate the KNN lists, build K-SNN graph, identify core and non-core points after new deletions.
Preliminaries and Definitions
- K-nearest neighbor (KNN) list
- Shared nearest neighbors (SNN)
- Similarity matrix or SNN graph
- K-SNN graph
- Core, non-core and noise points
- Clustering
- Batch Incremental Clustering (Deletion)
According to the second point, the similarity between the points x and y is greater than or equal to a threshold value δsim and x is core but y is non-core. The third point states that if x is a non-core point and no core point y exists with which x has a similarity value greater than or equal to δsim, then x is categorized as a noise point.
Problem formulation
As in the first point, if the degree of closeness or similarity between the points x and y is greater than or equal to a threshold value δsim and x, y are close or core points, then both x and y are part of the same cluster. Another core point z exists and the similarity between y and z is greater than or equal to δsim.
Structure of the proposed batch incremental SNNDB clustering al-
The nearest core point is the one that has a strong connection to the side point in question and has a higher edge weight compared to other neighboring core points (see Chapter 4).
Batch-Incremental SNNDB Clustering Algorithms for Deletion
- The Batch − Dec1 clustering algorithm
- The Batch − Dec2 clustering algorithm
- The BISDB del clustering algorithm
- Shared link properties between affected points post deletion 136
TheBatch−Dec1 algorithm builds the updated e-KNN list incrementally for each of the existing data points. Any surviving non-KN-Sdel point in the updated KNN list of a KN-Sdel type point is designated as Sdeltype.
Time complexity analysis of the BISDB del clustering algorithm
Time complexity proof BISDB del
LetTe−KN N be the time it takes to compute the updated e-KNN list of existing data points in D0. The free place(s) are filled with the points found in additional (w−1)·K places on the e-KNN list.
Experimental evaluation
- Phase-1: Finding the most effective batch incremental vari-
- Phase-2: Prove efficiency of the most effective batch incre-
- Phase-3: BISDB del and SNNDB are more effective than
- Phase-4: Prove the efficiency of BISDB del over SNNDB
Percentage of points deleted from base dataset (%) Variable updates (deletion) to dataset v CPU time (Mopsi12 dataset). For constant updates, a fixed number of points were deleted from the base dataset in several batches.
Cluster analysis
Clustering results in brief
This is because the removal of data points leads to the splitting of clusters and the simultaneous breaking of shared strong connections. In the case of the Birch3 dataset, about 87.9% of the data points belong to a cluster with an average cluster size of 7.68.
Conclusion
Chapter contributions
We propose an approximate incremental solution to KNNOD [5] in the form of the KAGO algorithm (Table 6.2).
Related work and background
O Set of outliers before any changes in the data set O0 Set of outliers after the data set has been changed.
Preliminaries and Definitions
- Local outlier
- Neighborhood
- Kernel centers
- Kernel Density Estimate (KDE)
- Kernel functions
- Grid local outlier score (glos)
- Mean grid local outlier score (mglos)
- Incremental anomaly detection
Calculate the density (local density)1 at point xi together with the densities at the neighboring points of xi. Typically S represents the set of points within a grid cell gc and Lcj is a data point sampled from S.
Problem formulation
The KNNOD algorithm in brief
Framework of the KAGO algorithm
The KAGO algorithm
Theoretical model
Steps of the KAGO outlier detection algorithm
Time Complexity of the KAGO algorithm
Experimental evaluation
Experimental setup and datasets used
Experimental Results and Analysis
Memory usage
Brief outlier analysis
Key properties of the KAGO algorithm
Lemmas related to the KAGO algorithm
Conclusion
Future scopes
Representation of various clustering domains
Dense areas represent the clusters and subsequently the noises are
Detecting clusters of arbitrary shapes and densities
Dense grid cells accumulate to form clusters
Illustration of outliers in a 2-D data
Representation of various outlier detection paradigms
Latency scenarios due to changes in the input data
Cluster formation scheme in DBSCAN
Mass based dissimilarity matrix
Sequence of execution for the iM ass clustering algorithm
Compute the node mass incrementally upon new point insertion
Updated mass-matrix post insertion of a new point
Efficiency comparison between iM ass and MBSCAN along with
Efficiency comparison between iM ass and MBSCAN along with
Efficiency comparison between iM ass and MBSCAN along with
KNN list for point P where K = 5
Similarity value between points P and P3 in the K-SNN graph given
Cluster containing core points P and P3 in the K-SNN graph. If