• Tidak ada hasil yang ditemukan

Machine Learning: Clustering: DBSCAN

N/A
N/A
Khairul Huda

Academic year: 2023

Membagikan "Machine Learning: Clustering: DBSCAN"

Copied!
29
0
0

Teks penuh

(1)

Program Studi Teknik Informatika

Fakultas Teknik – Universitas Surabaya

Clustering: DBSCAN

Week 10

1604C055 - Machine Learning

(2)

What is DBSCAN Clustering?

• DBSCAN stands for Density-Based Spatial Clustering of Application with Noise

• Proposed by Ester, Kriegel, Sander, and Xu (KDD96)

• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points.

• Discovers clusters of arbitrary shape in spatial databases

with noise

(3)

Density-Based Clustering Methods

• Density is the number of points within a specified radius r (Eps ) �

• In density-based clustering we partition points into dense regions separated by not-so-dense regions.

• Major features:

– Discover clusters of arbitrary shape – Handle noise

– One scan

– Need density parameters

(4)

DBSCAN Parameter

DBSCAN requires two input parameters before performing the clustering process. The input parameters are Epsilon (Eps) and Minimum Points (MinPts)

Eps is simply a fixed value for the maximum distance between two points.

– The dotted line is a representation of that distance

– Eps is like a radius from a datapoint that broadcast out in all directions, forming a perimeter around that data point.

Eps

(5)

DBSCAN Parameter

(6)

Eps-Neighborhood

(7)

Characterization of Points

(8)

Reachability

Directly density reachable:

– An object (or instance) q is

directly density reachable from object p if q is within the Eps- Neighborhood of p and p is a core point object.

– Here directly density

reachability is not symmetric.

Object p is not directly density- reachable from object q as q is not a core point object.

(9)

Reachability

Density reachable:

– An object q is density- reachable

from p w.r.t Eps and MinPts if there is a chain of objects q1, q2…, qn, with q1 = p, qn = q such that qi+1 is directly density-

reachable from

qi w.r.t Eps and MinPts for all 1

<= i <= n

(10)

Connectivity

Density connectivity: Object q is density-connected to

object p w.r.t Eps and MinPts if there is a object o such that both p and q are density-reachable from o w.r.t Eps and MinPts.

Here density connectivity is

symmetric. If object q is density- connected to object p then

object p is also density- connected to object q.

(11)

Cluster and Noise

(12)

DBSCAN Steps

(13)

Example

• Parameter

Eps = 2 MinPts = 3

for each x D do

if x is not yet classified then if x is a core-object then

collect all objects density-reachable from o and assign them to a new cluster.

else

assign x to NOISE

(14)

Example

• Parameter

Eps = 2 MinPts = 3

for each x D do

if x is not yet classified then if x is a core-object then

collect all objects density-reachable from o and assign them to a new cluster.

else

assign x to NOISE

(15)

Example

• Parameter

Eps = 2 MinPts = 3

for each x D do

if x is not yet classified then if x is a core-object then

collect all objects density-reachable from o and assign them to a new cluster.

else

assign x to NOISE

(16)

Determining MinPts

• The larger the data set, the larger the value of MinPts should be

• If the data set is noisier, choose a larger value of MinPts.

• Generally, MinPts should be greater than or equal to the dimensionality of the dataset

• For 2-dimensional data, use DBSCAN’s default value of MinPts = 4 (Ester et al., 1996).

• If your data has more than 2 dimensions, choose MinPts = 2*dim, where dim = the dimensions of your data set (Sander et al., 1998).

(17)

Determining Eps

• Cluster: Point density higher than specified by Eps and MinPts

• Heuristic: look at the distances to the k-nearest neighbors

• Function k-distance(p): distance from p to the its k-nearest neighbor

p

q

3-

distance(p) : 3-

distance(q) :

(18)

Determining Eps

• The number of k can be equal or greater than MinPts.

k-distance plot: k-distances of all points, sorted in increasing order then find its Elbow.

Eps ~ 29 k = 4

(19)

When DBSCAN Works Well

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Clusters Original Points

(20)

When DBSCAN does not Works Well

• Varying Densities

• High Dimensional Data

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

(21)

DBSCAN using sklearn

(22)

DBSCAN using sklearn

(23)

DBSCAN using sklearn

(24)

DBSCAN using sklearn

• There are three clusters (0,1,2,3)

• Noise represented by (-1)

Noise Clusters

(25)

DBSCAN using sklearn

(26)

Assignment

• Download dataset here:

https://drive.google.com/file/d/1NLKHMoq71lNdoSdnq6D1uwjD88RS wkAz/view?usp=sharing

• Suppose you are working for government that they want to launch development program to society. The program is expected to

empower the society to improve their welfare.

• Given sample dataset of people.csv (that you have downloaded), you are asked to segment the data. Use K-Means and DBSCAN to segment the data.

(27)

Assignment

• The data attributes are: name, age, marital status, income range,

gender, total children, children at home, education, occupation, home owner, cars.

• Perform data pre-processing that you think is necessary.

• You may define your own variables, choose the number of segments and attributes that are used.

• Please provide your arguments of choosing those variables, and show (or visualize) the segments you create, including at least brief explanation (for both of K-Means and DBSCAN).

• Evaluate your segments using Silhouette Score. Higher Silhouette Score relates to a model with better defined segments.

(28)

Assignment

• The Silhouette Coefficient is defined for each sample and is composed of two scores:

– : the mean distance between a sample and all other points in the same class.

– : the mean distance between a sample and all other points in the next nearest cluster.

• The Silhouette Coefficient for a single sample is then given as:

• The Silhouette score the mean Silhouette Coefficient over all samples.

• The best value is 1 and the worst value is -1.

• Values near 0 indicate overlapping clusters.

• Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

(29)

Assignment

Referensi

Dokumen terkait

Density Based Spatial Clustering Application with Noise (DBSCAN) adalah salah satu algoritma Clustering yang menggunakan kepadatan atribut data, untuk

Our contribution including implementation and analysis of DBSCAN method, capture human hand by using depth camera (i.e. Kinect) and get the best clustering results.. 2

DBSCAN merupakan algoritma yang didesain oleh ester et.al pada tahun 1996 dapat mengidentifikasi kelompok-kelompok dalam kumpulan data spasial yang besar dengan

Berdasarkan hasil dan pembahasan yang telah dilakukan yaitu dengan mengelompokkan (clustering) data nasabah yang memiliki fasilitas kredit menggunakan metode DBSCAN

DBSCAN merupakan algoritma yang didesain oleh ester et.al pada tahun 1996 dapat mengidentifikasi kelompok-kelompok dalam kumpulan data spasial yang besar dengan

DBSCAN: Density Based Spatial Clustering of Applications with Noise In this section, we present the algorithm DBSCAN Density Based Spatial Clustering of Applications with Noise which

Gambar 4.5 Hasil Clustering dengan DBSCAN Proses clustering dengan DBSCAN dari ke-175 data koordinat UTM penyebaran tempat wisata di Malang Raya menggunakan parameter 𝜀 sebesar 5000

Metode DBSCAN dengan menggunakan metode elbow didapat eps 0,1 dan minPts 1 dan memperoleh hasil 144 cluster dan 0 data noise.144 cluster dikelompokkan berdasarkan jumlah persediaan yang