In this section, we present the definitions of different terms and concepts that are used in this chapter (See Table 3.2 for identifying the notations henceforth).
LetDbe a sample of data from probability density function F; and letH ∈ H be a hierarchical model that partitions the data space into non-empty regions, then the following concepts may be defined as follows:
1See details in Section 3.5 for explanation of mass-based neighborhood or µ-neighborhood mass as a part of the MBSCAN algorithm.
3.3.1 Modeling a region
A recursive partitioning methodology known as iF orest (isolation Forest) [65] is used to depict regions. Existing study [70] has shown thatiF orestis a special case of mass estimation technique. MBSCAN [2] uses a method based on completely random trees to construct aniT ree (isolation Tree) (Refer Section 3.5 for details).
An iF orest is a combination of multiple such iT rees. Each iT ree is a binary tree that represents a particular hierarchical partitioning model Hi, i= 1,2,3..., t where t denotes the maximum number ofiT rees.
LetR represent a region, then we have the following interpretations:
• iT reei, i= 1,2,3, . . . , tmodels a sub-region rj ⊂R, j= 1,2,3, . . . , t
• St
j=1rj =R;rj 6=φ, j = 1,2,3, ..., t
• ∀i, j, i6=j, ri∩rj =φ where 1≤i, j ≤t.
• If the number of points within any rj, j = 1,2,3, . . . , t belong to a set D, then the root node of correspondingiT reej, j = 1,2,3, . . . , tmay contain the elements of D and based on certain split condition (See details in Section 3.5), the internal nodes are created.
• The root node of anyiT reej, j = 1,2,3, . . . , teffectively represents the whole sub-regionrj, j = 1,2,3, . . . , tand the internal nodes denoterj’s division into smaller sub-regions.
3.3.2 Mass of a region
The mass of a region is defined as the number of data points within that region.
The following relation (Equation 3.1) defines the mass of a region containing a and b ∀a, b∈D:
Mr(a, b|H;D) = X
r⊆H s.t.{a,b}∈r, c∈D
1(c∈r) (3.1)
where 1(.) is an indicator function, r is any region, H is any hierarchical parti- tioning model represented by an iT ree, Dis set of elements involved. If any node of iT ree modeling the region r represents a sub-region within r containing data points a and b, then the number of elements within that node gives the mass of
that sub-region inclusive of the pair of points. The number of elements in the root node of the iT ree provides the mass of the whole regionr.
3.3.3 Mass of smallest local region
We have the following relation (Equation 3.2) defining the mass of smallest local region [2] containing points a and b ∀a, b∈D:
R(a, b|H;D) = argminr⊂H s.t.{a,b}∈r
X
c∈D
1(c∈r) (3.2)
where 1 (.) is an indicator function, r is the smallest local region, H is any hierarchical partitioning model represented by aniT ree. The smallest local region covering a, bis represented by the lowest leveled node of theiT ree containing the same pair of points. The mass of smallest local regionr is the number of elements in the lowest leveled node inclusive of the pair of points a and b.
3.3.4 Mass-based dissimilarity
Mass based dissimilarity [2] or mass or probability mean mass of a and b w.r.t., D and F is defined as the expected probability of R (a,b | H;D) (Equation 3.3) and is given as:
me(a, b|H;D) = EH(D)[PF(R(a, b|H;D))] (3.3) where PF (.) is the probability w.r.t., F and EH(D) is the expectation taken over all hierarchical models in H (D). In practice the mass based dissimilarity would be estimated from a finite number of hierarchical models (iT rees) Hi ∈ H (D), i= 1,2,3...., tas follows (Equation 3.4):
me(a, b|H;D) = 1 t
t
X
i=1
P˜(R(a, b|Hi;D)) (3.4)
where ˜P (R) = R(a,b|H|D|;D) denotes the probability mass w.r.t., a given Hi. It is to be noted that R (a,b | H;D) is the mass of smallest local region covering a and b. It is analogous to the shortest distance between a and b used in the geometric model.
3.3.5 Mass-based neighborhood
For a real valueµ(determined by the MBSCAN algorithm), the mass-based neigh- borhood or µ-neighborhood mass [2] for a point a∈D is given as:
Mu(a) = |{b∈D|me(a, b)≤µ}| (3.5) Equation 3.5 states that the size of µ-neighborhood mass for a point a is the number of points with which its probability mean-mass is less than or equal toµ.
3.3.6 Clustering
Given a dataset D, ∀a, b ∈ D if there exists a mass-based dissimilarity function me(a, b) and a mass-based neighborhood set Mµ(a), then we define clustering by a mapping f : D → C, where C ⊆ P (D). If a 6= b and there exists a threshold δcore, then we may have the following interpretations:
1. If |Mu(a)|> δcore and|Mu(b)|> δcore, where b∈Mu(a) and a∈Mu(b), then f(a) =f(b).
2. If |Mu(a)| > δcore and |Mu(b)| <= δcore, where b ∈ Mu(a) and a ∈ Mu(b), and ∃c ∈ D where a 6= b 6= c, b ∈ Mu(c) and |Mu(c)| > δcore . Then if me(a, b)< me(b, c), then f(b) =f(a) else f(b) =f(c).
3. If |Mu(a)|<=δcore, and 6 ∃c∈ D where a6=b 6=c and |Mu(c)| > δcore, then {a} 6∈C.
According to the above definitions, the first point states that if two data points a and b are dense or core2, and they both belong to each other’s mass-based neighborhood (µ-neighborhood), then they are a part of the same cluster.
The second point states that if two data pointsa and b are a part of each other’s mass based neighborhood such that ais core while bis non-core (non-dense), then b is associated with a cluster of its nearest core point.
The third point states that if any non-core point eg: a fails to find any core point within its mass-based neighborhood, it does not obtain any cluster membership.
2The detailed explanation of the core or non-core points is presented in Section 3.5 as part of the MBSCAN algorithm.
3.3.7 Approximate Incremental Clustering
Let the initial clustering defined by a mapping f : D → C represents the set of clusters obtained from the static algorithm. Let an insertion sequence of k points be made over a base dataset D(|D|=n, k n). Afterk insertions let D0 be the updated dataset, and an incremental clustering be defined as a mapping h : D0 → C0, where C0 ⊆ P (D0) represents the clusters produced from the incremental version. Now, if the updated dataset D0 is fed to the naive algorithm in its entirety and the clustering is given by a mapping f :D0 →C00, then in case of approximate incremental clustering we have C0 ≈C00.
3.3.8 Core and Non-core points
For any point a ∈ D, if the size of µ-neighborhood mass Mu(a) exceeds a core point formation thresholdδcore, thena is designated as a core point or else it is a non-core point.
3.3.9 Noise points
For a non-core point a∈D, if it fails to obtain any cluster membership, then that point qualifies as a noise point.