Program Studi Teknik Informatika
Fakultas Teknik – Universitas Surabaya
Clustering: DBSCAN
Week 10
1604C055 - Machine Learning
What is DBSCAN Clustering?
• DBSCAN stands for Density-Based Spatial Clustering of Application with Noise
• Proposed by Ester, Kriegel, Sander, and Xu (KDD96)
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points.
• Discovers clusters of arbitrary shape in spatial databases
with noise
Density-Based Clustering Methods
• Density is the number of points within a specified radius r (Eps ) �
• In density-based clustering we partition points into dense regions separated by not-so-dense regions.
• Major features:
– Discover clusters of arbitrary shape – Handle noise
– One scan
– Need density parameters
DBSCAN Parameter
DBSCAN requires two input parameters before performing the clustering process. The input parameters are Epsilon (Eps) and Minimum Points (MinPts)
• Eps is simply a fixed value for the maximum distance between two points.
– The dotted line is a representation of that distance
– Eps is like a radius from a datapoint that broadcast out in all directions, forming a perimeter around that data point.
Eps
DBSCAN Parameter
•
Eps-Neighborhood
•
Characterization of Points
•
Reachability
• Directly density reachable:
– An object (or instance) q is
directly density reachable from object p if q is within the Eps- Neighborhood of p and p is a core point object.
– Here directly density
reachability is not symmetric.
Object p is not directly density- reachable from object q as q is not a core point object.
Reachability
• Density reachable:
– An object q is density- reachable
from p w.r.t Eps and MinPts if there is a chain of objects q1, q2…, qn, with q1 = p, qn = q such that qi+1 is directly density-
reachable from
qi w.r.t Eps and MinPts for all 1
<= i <= n
Connectivity
• Density connectivity: Object q is density-connected to
object p w.r.t Eps and MinPts if there is a object o such that both p and q are density-reachable from o w.r.t Eps and MinPts.
• Here density connectivity is
symmetric. If object q is density- connected to object p then
object p is also density- connected to object q.
Cluster and Noise
•
DBSCAN Steps
•
Example
• Parameter
– Eps = 2 – MinPts = 3
for each x ∈ D do
if x is not yet classified then if x is a core-object then
collect all objects density-reachable from o and assign them to a new cluster.
else
assign x to NOISE
Example
• Parameter
– Eps = 2 – MinPts = 3
for each x ∈ D do
if x is not yet classified then if x is a core-object then
collect all objects density-reachable from o and assign them to a new cluster.
else
assign x to NOISE
Example
• Parameter
– Eps = 2 – MinPts = 3
for each x ∈ D do
if x is not yet classified then if x is a core-object then
collect all objects density-reachable from o and assign them to a new cluster.
else
assign x to NOISE
Determining MinPts
• The larger the data set, the larger the value of MinPts should be
• If the data set is noisier, choose a larger value of MinPts.
• Generally, MinPts should be greater than or equal to the dimensionality of the dataset
• For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).
• If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim = the dimensions of your data set (Sander et al., 1998).
Determining Eps
• Cluster: Point density higher than specified by Eps and MinPts
• Heuristic: look at the distances to the k-nearest neighbors
• Function k-distance(p): distance from p to the its k-nearest neighbor
p
q
3-
distance(p) : 3-
distance(q) :
Determining Eps
• The number of k can be equal or greater than MinPts.
• k-distance plot: k-distances of all points, sorted in increasing order then find its Elbow.
Eps ~ 29 k = 4
When DBSCAN Works Well
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Clusters Original Points
When DBSCAN does not Works Well
• Varying Densities
• High Dimensional Data
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
DBSCAN using sklearn
DBSCAN using sklearn
DBSCAN using sklearn
DBSCAN using sklearn
• There are three clusters (0,1,2,3)
• Noise represented by (-1)
Noise Clusters
DBSCAN using sklearn
Assignment
• Download dataset here:
https://drive.google.com/file/d/1NLKHMoq71lNdoSdnq6D1uwjD88RS wkAz/view?usp=sharing
• Suppose you are working for government that they want to launch development program to society. The program is expected to
empower the society to improve their welfare.
• Given sample dataset of people.csv (that you have downloaded), you are asked to segment the data. Use K-Means and DBSCAN to segment the data.
Assignment
• The data attributes are: name, age, marital status, income range,
gender, total children, children at home, education, occupation, home owner, cars.
• Perform data pre-processing that you think is necessary.
• You may define your own variables, choose the number of segments and attributes that are used.
• Please provide your arguments of choosing those variables, and show (or visualize) the segments you create, including at least brief explanation (for both of K-Means and DBSCAN).
• Evaluate your segments using Silhouette Score. Higher Silhouette Score relates to a model with better defined segments.
Assignment
• The Silhouette Coefficient is defined for each sample and is composed of two scores:
– : the mean distance between a sample and all other points in the same class.
– : the mean distance between a sample and all other points in the next nearest cluster.
• The Silhouette Coefficient for a single sample is then given as:
• The Silhouette score the mean Silhouette Coefficient over all samples.
• The best value is 1 and the worst value is -1.
• Values near 0 indicate overlapping clusters.
• Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
•