Machine Learning: Clustering: DBSCAN

(1)

Program Studi Teknik Informatika

Fakultas Teknik – Universitas Surabaya

Clustering: DBSCAN

Week 10

1604C055 - Machine Learning

(2)

What is DBSCAN Clustering?

• DBSCAN stands for Density-Based Spatial Clustering of Application with Noise

• Proposed by Ester, Kriegel, Sander, and Xu (KDD96)

• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points.

• Discovers clusters of arbitrary shape in spatial databases

with noise

(3)

Density-Based Clustering Methods

• Density is the number of points within a specified radius r (Eps ) �

• In density-based clustering we partition points into dense regions separated by not-so-dense regions.

• Major features:

– Discover clusters of arbitrary shape – Handle noise

– One scan

– Need density parameters

(4)

DBSCAN Parameter

DBSCAN requires two input parameters before performing the clustering process. The input parameters are Epsilon (Eps) and Minimum Points (MinPts)

• Eps is simply a fixed value for the maximum distance between two points.

– The dotted line is a representation of that distance

– Eps is like a radius from a datapoint that broadcast out in all directions, forming a perimeter around that data point.

Eps

(5)

DBSCAN Parameter

•

(6)

Eps-Neighborhood

•

(7)

Characterization of Points

•

(8)

Reachability

• Directly density reachable:

– An object (or instance) q is

directly density reachable from object p if q is within the Eps- Neighborhood of p and p is a core point object.

– Here directly density

reachability is not symmetric.

Object p is not directly density- reachable from object q as q is not a core point object.

(9)

Reachability

• Density reachable:

– An object q is density- reachable

from p w.r.t Eps and MinPts if there is a chain of objects q1, q₂…, q_n, with q₁= p, q_n= q such that q_i+1 is directly density-

reachable from

q_i w.r.t Eps and MinPts for all 1

<= i <= n

(10)

Connectivity

• Density connectivity: Object q is density-connected to

object p w.r.t Eps and MinPts if there is a object o such that both p and q are density-reachable from o w.r.t Eps and MinPts.

• Here density connectivity is

symmetric. If object q is density- connected to object p then

object p is also density- connected to object q.

(11)

Cluster and Noise

•

(12)

DBSCAN Steps

•

(13)

Example

• Parameter

– Eps = 2 – MinPts = 3

for each x ∈ D do

if x is not yet classified then if x is a core-object then

collect all objects density-reachable from o and assign them to a new cluster.

else

assign x to NOISE

(14)

Example

• Parameter

– Eps = 2 – MinPts = 3

else

assign x to NOISE

(15)

Example

• Parameter

– Eps = 2 – MinPts = 3

else

assign x to NOISE

(16)

Determining MinPts

• The larger the data set, the larger the value of MinPts should be

• If the data set is noisier, choose a larger value of MinPts.

• Generally, MinPts should be greater than or equal to the dimensionality of the dataset

• For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).

• If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim = the dimensions of your data set (Sander et al., 1998).

(17)

Determining Eps

• Cluster: Point density higher than specified by Eps and MinPts

• Heuristic: look at the distances to the k-nearest neighbors

• Function k-distance(p): distance from p to the its k-nearest neighbor

p

q

3-

distance(p) : 3-

distance(q) :

(18)

Determining Eps

• The number of k can be equal or greater than MinPts.

• k-distance plot: k-distances of all points, sorted in increasing order then find its Elbow.

Eps ~ 29 k = 4

(19)

When DBSCAN Works Well

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Clusters Original Points

(20)

When DBSCAN does not Works Well

• Varying Densities

• High Dimensional Data

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

(21)

DBSCAN using sklearn

(22)

DBSCAN using sklearn

(23)

DBSCAN using sklearn

(24)

DBSCAN using sklearn

• There are three clusters (0,1,2,3)

• Noise represented by (-1)

Noise Clusters

(25)

DBSCAN using sklearn

(26)

Assignment

• Download dataset here:

https://drive.google.com/file/d/1NLKHMoq71lNdoSdnq6D1uwjD88RS wkAz/view?usp=sharing

• Suppose you are working for government that they want to launch development program to society. The program is expected to

empower the society to improve their welfare.

• Given sample dataset of people.csv (that you have downloaded), you are asked to segment the data. Use K-Means and DBSCAN to segment the data.

(27)

Assignment

• The data attributes are: name, age, marital status, income range,

gender, total children, children at home, education, occupation, home owner, cars.

• Perform data pre-processing that you think is necessary.

• You may define your own variables, choose the number of segments and attributes that are used.

• Please provide your arguments of choosing those variables, and show (or visualize) the segments you create, including at least brief explanation (for both of K-Means and DBSCAN).

• Evaluate your segments using Silhouette Score. Higher Silhouette Score relates to a model with better defined segments.

(28)

Assignment

• The Silhouette Coefficient is defined for each sample and is composed of two scores:

– : the mean distance between a sample and all other points in the same class.

– : the mean distance between a sample and all other points in the next nearest cluster.

• The Silhouette Coefficient for a single sample is then given as:

• The Silhouette score the mean Silhouette Coefficient over all samples.

• The best value is 1 and the worst value is -1.

• Values near 0 indicate overlapping clusters.

• Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

•

(29)