Preliminaries and Definitions - Density-Based Mining Algorithms for Dynamic Data

In this section, we present the definitions of different terms and concepts that are used in this chapter (See Table 3.2 for identifying the notations henceforth).

LetDbe a sample of data from probability density function F; and letH ∈ H be a hierarchical model that partitions the data space into non-empty regions, then the following concepts may be defined as follows:

1See details in Section 3.5 for explanation of mass-based neighborhood or µ-neighborhood mass as a part of the MBSCAN algorithm.

3.3.1 Modeling a region

A recursive partitioning methodology known as iF orest (isolation Forest) [65] is used to depict regions. Existing study [70] has shown thatiF orestis a special case of mass estimation technique. MBSCAN [2] uses a method based on completely random trees to construct aniT ree (isolation Tree) (Refer Section 3.5 for details).

An iF orest is a combination of multiple such iT rees. Each iT ree is a binary tree that represents a particular hierarchical partitioning model H_i, i= 1,2,3..., t where t denotes the maximum number ofiT rees.

LetR represent a region, then we have the following interpretations:

• iT ree_i, i= 1,2,3, . . . , tmodels a sub-region r_j ⊂R, j= 1,2,3, . . . , t

• St

j=1r_j =R;r_j 6=φ, j = 1,2,3, ..., t

• ∀i, j, i6=j, r_i∩r_j =φ where 1≤i, j ≤t.

• If the number of points within any r_j, j = 1,2,3, . . . , t belong to a set D, then the root node of correspondingiT reej, j = 1,2,3, . . . , tmay contain the elements of D and based on certain split condition (See details in Section 3.5), the internal nodes are created.

• The root node of anyiT ree_j, j = 1,2,3, . . . , teffectively represents the whole sub-regionr_j, j = 1,2,3, . . . , tand the internal nodes denoter_j’s division into smaller sub-regions.

3.3.2 Mass of a region

The mass of a region is defined as the number of data points within that region.

The following relation (Equation 3.1) defines the mass of a region containing a and b ∀a, b∈D:

M_r(a, b|H;D) = X

r⊆H s.t.{a,b}∈r, c∈D

1(c∈r) (3.1)

where 1(.) is an indicator function, r is any region, H is any hierarchical partitioning model represented by an iT ree, Dis set of elements involved. If any node of iT ree modeling the region r represents a sub-region within r containing data points a and b, then the number of elements within that node gives the mass of

that sub-region inclusive of the pair of points. The number of elements in the root node of the iT ree provides the mass of the whole regionr.

3.3.3 Mass of smallest local region

We have the following relation (Equation 3.2) defining the mass of smallest local region [2] containing points a and b ∀a, b∈D:

R(a, b|H;D) = argminr⊂H s.t.{a,b}∈r

c∈D

1(c∈r) (3.2)

where 1 (.) is an indicator function, r is the smallest local region, H is any hierarchical partitioning model represented by aniT ree. The smallest local region covering a, bis represented by the lowest leveled node of theiT ree containing the same pair of points. The mass of smallest local regionr is the number of elements in the lowest leveled node inclusive of the pair of points a and b.

3.3.4 Mass-based dissimilarity

Mass based dissimilarity [2] or mass or probability mean mass of a and b w.r.t., D and F is defined as the expected probability of R (a,b | H;D) (Equation 3.3) and is given as:

m_e(a, b|H;D) = E_H(D)[P_F(R(a, b|H;D))] (3.3) where P_F (.) is the probability w.r.t., F and E_H(D) is the expectation taken over all hierarchical models in H (D). In practice the mass based dissimilarity would be estimated from a finite number of hierarchical models (iT rees) H_i ∈ H (D), i= 1,2,3...., tas follows (Equation 3.4):

m_e(a, b|H;D) = 1 t

i=1

P˜(R(a, b|H_i;D)) (3.4)

where ˜P (R) = ^R(a,b|H_|D|^;D) denotes the probability mass w.r.t., a given H_i. It is to be noted that R (a,b | H;D) is the mass of smallest local region covering a and b. It is analogous to the shortest distance between a and b used in the geometric model.

3.3.5 Mass-based neighborhood

For a real valueµ(determined by the MBSCAN algorithm), the mass-based neighborhood or µ-neighborhood mass [2] for a point a∈D is given as:

M_u(a) = |{b∈D|m_e(a, b)≤µ}| (3.5) Equation 3.5 states that the size of µ-neighborhood mass for a point a is the number of points with which its probability mean-mass is less than or equal toµ.

3.3.6 Clustering

Given a dataset D, ∀a, b ∈ D if there exists a mass-based dissimilarity function m_e(a, b) and a mass-based neighborhood set M_µ(a), then we define clustering by a mapping f : D → C, where C ⊆ P (D). If a 6= b and there exists a threshold δ_core, then we may have the following interpretations:

1. If |M_u(a)|> δ_core and|M_u(b)|> δ_core, where b∈M_u(a) and a∈M_u(b), then f(a) =f(b).

2. If |Mu(a)| > δcore and |Mu(b)| <= δcore, where b ∈ Mu(a) and a ∈ Mu(b), and ∃c ∈ D where a 6= b 6= c, b ∈ Mu(c) and |Mu(c)| > δcore . Then if m_e(a, b)< m_e(b, c), then f(b) =f(a) else f(b) =f(c).

3. If |M_u(a)|<=δ_core, and 6 ∃c∈ D where a6=b 6=c and |M_u(c)| > δ_core, then {a} 6∈C.

According to the above definitions, the first point states that if two data points a and b are dense or core², and they both belong to each other’s mass-based neighborhood (µ-neighborhood), then they are a part of the same cluster.

The second point states that if two data pointsa and b are a part of each other’s mass based neighborhood such that ais core while bis non-core (non-dense), then b is associated with a cluster of its nearest core point.

The third point states that if any non-core point eg: a fails to find any core point within its mass-based neighborhood, it does not obtain any cluster membership.

2The detailed explanation of the core or non-core points is presented in Section 3.5 as part of the MBSCAN algorithm.

3.3.7 Approximate Incremental Clustering

Let the initial clustering defined by a mapping f : D → C represents the set of clusters obtained from the static algorithm. Let an insertion sequence of k points be made over a base dataset D(|D|=n, k n). Afterk insertions let D⁰ be the updated dataset, and an incremental clustering be defined as a mapping h : D⁰ → C⁰, where C⁰ ⊆ P (D⁰) represents the clusters produced from the incremental version. Now, if the updated dataset D⁰ is fed to the naive algorithm in its entirety and the clustering is given by a mapping f :D⁰ →C⁰⁰, then in case of approximate incremental clustering we have C⁰ ≈C⁰⁰.

3.3.8 Core and Non-core points

For any point a ∈ D, if the size of µ-neighborhood mass M_u(a) exceeds a core point formation thresholdδ_core, thena is designated as a core point or else it is a non-core point.

3.3.9 Noise points

For a non-core point a∈D, if it fails to obtain any cluster membership, then that point qualifies as a noise point.

Dalam dokumen Density-Based Mining Algorithms for Dynamic Data (Halaman 71-75)