• Tidak ada hasil yang ditemukan

clustering

N/A
N/A
Protected

Academic year: 2024

Membagikan "clustering"

Copied!
31
0
0

Teks penuh

(1)

clustering

(2)

K-means Clustering

• Strengths

– Simple iterative method – User provides “K”

• Weaknesses

– Often too simple  bad results

– Difficult to guess the correct “K”

(3)

K-means Clustering

Basic Algorithm:

• Step 0: select K

• Step 1: randomly select initial cluster seeds

Seed 1 650 Seed 2 200

(4)

K-means Clustering

• An initial cluster seed represents the “mean value” of its cluster.

• In the preceding figure:

– Cluster seed 1 = 650

– Cluster seed 2 = 200

(5)

K-means Clustering

• Step 2: calculate distance from each object to each cluster seed.

• What type of distance should we use?

– Squared Euclidean distance

• Step 3: Assign each object to the closest

cluster

(6)

K-means Clustering

Seed 1

Seed 2

(7)

K-means Clustering

• Step 4: Compute the new centroid for each cluster

Cluster Seed 1 708.9

Cluster Seed 2 214.2

(8)

K-means Clustering

• Iterate:

– Calculate distance from objects to cluster centroids.

– Assign objects to closest cluster – Recalculate new centroids

• Stop based on convergence criteria

– No change in clusters

– Max iterations

(9)

K-means Issues

• Distance measure is squared Euclidean

– Scale should be similar in all dimensions

• Rescale data?

– Not good for nominal data. Why?

• Approach tries to minimize the within-cluster sum of squares error (WCSS)

– Implicit assumption that SSE is similar for each

group

(10)

WCSS

• The over all WCSS is given by:

𝑖=1𝑘 𝑥∈𝐶𝑖

𝑥

(11)

Bottom Line

• K-means

– Easy to use

– Need to know K

– May need to scale data – Good initial method

• Local optima

– No guarantee of optimal solution

– Repeat with different starting values

(12)

Region of interest Center of

mass

Mean Shift vector

Mean shift

(13)

Region of interest Center of

mass

Mean Shift vector

Mean shift

(14)

Region of interest Center of

mass

Mean Shift vector

Mean shift

(15)

Region of interest Center of

mass

Mean Shift vector

Mean shift

(16)

Region of interest Center of

mass

Mean Shift vector

Mean shift

(17)

Region of interest Center of

mass

Mean Shift vector

Mean shift

(18)

Region of interest Center of

mass

Mean shift

(19)

Kernel density estimation

Kernel density estimation function

Gaussian kernel

(20)

4. Ad-hoc I: Hierarchical clustering

Hierarchical versus Flat

Flat methods generate a single partition into k clusters. The number k of clusters has to be determined by the user ahead of time.

Hierarchical methods generate a hierarchy of partitions, i.e.

• a partition P1 into 1 clusters (the entire collection)

• a partition P2 into 2 clusters

• a partition Pn into n clusters (each object forms its own cluster)

It is then up to the user to decide which of the partitions reflects actual sub-populations in the data.

(21)

Note: A sequence of partitions is called "hierarchical" if each cluster in a given partition is the union of clusters in the next larger partition.

P4 P3 P2 P1

Top: hierarchical sequence of partitions Bottom: non hierarchical sequence

(22)

Hierarchical methods again come in two varieties, agglomerative and divisive.

Agglomerative methods:

• Start with partition Pn, where each object forms its own cluster.

• Merge the two closest clusters, obtaining Pn-1.

• Repeat merge until only one cluster is left.

Divisive methods

• Start with P1.

• Split the collection into two clusters that are as homogenous (and as different from each other) as possible.

• Apply splitting procedure recursively to the clusters.

(23)

Note:

Agglomerative methods require a rule to decide which clusters to merge.

Typically one defines a distance between clusters and then merges the two clusters that are closest.

Divisive methods require a rule for splitting a cluster.

(24)

4.1 Hierarchical agglomerative clustering

Need to define a distance d(P,Q) between groups, given a distance measure d(x,y) between observations.

Commonly used distance measures:

1. d1(P,Q) = min d(x,y), for x in P, y in Q ( single linkage ) 2. d2(P,Q) = ave d(x,y), for x in P, y in Q ( average linkage ) 3. d3(P,Q) = max d(x,y), for x in P, y in Q ( complete linkage ) 4. ( centroid method ) 5. ( Ward’s method )

d4 ( ,P Q) xP xQ

d P Q P Q

P Q xP xQ

5

2

2 ( , )

d5 is called Ward’s distance.

(25)

Motivation for Ward’s distance:

• Let Pk = P1 ,…, Pk be a partition of the observations into k groups.

• Measure goodness of a partition by the sum of squared distances of observations from their cluster means:

RSS

k

x

j

x

P

j P i

k

i i

( P )  

1

2

• Consider all possible (k-1)-partitions obtainable from Pk by a merge

• Merging two clusters with smallest Ward’s distance optimizes goodness of new partition.

(26)

4.2 Hierarchical divisive clustering

There are divisive versions of single linkage, average linkage, and Ward’s method.

Divisive version of single linkage:

• Compute minimal spanning tree (graph connecting all the objects with smallest total edge length.

• Break longest edge to obtain 2 subtrees, and a corresponding partition of the objects.

• Apply process recursively to the subtrees.

Agglomerative and divisive versions of single linkage give identical results (more later).

(27)

Divisive version of Ward’s method.

Given cluster R.

Need to find split of R into 2 groups P,Q to minimize

RSS P Q

i P

i P

j Q

j Q

( , )    

 x x

2

 x x

2

or, equivalently, to maximize Ward’s distance between P and Q.

Note: No computationally feasible method to find optimal P, Q for large |R|. Have to use approximation.

(28)

Iterative algorithm to search for the optimal Ward’s split Project observations in R on largest principal component.

Split at median to obtain initial clusters P, Q.

Repeat {

Assign each observation to cluster with closest mean Re-compute cluster means

} Until convergence

Note:

• Each step reduces RSS(P, Q)

• No guarantee to find optimal partition.

(29)

Divisive version of average linkage

Algorithm Diana, Struyf, Hubert, and Rousseuw, pp. 22

(30)

4.3 Dendograms

Result of hierarchical clustering can be represented as binary tree:

• Root of tree represents entire collection

• Terminal nodes represent observations

• Each interior node represents a cluster

• Each subtree represents a partition

Note: The tree defines many more partitions than the n-2 nontrivial ones constructed during the merge (or split) process.

Note: For HAC methods, the merge order defines a sequence of n subtrees of the full tree. For HDC methods a sequence of subtrees can be defined if there is a figure of merit for each split.

(31)

If distance between daughter clusters is monotonically increasing as we move up the tree, we can draw dendogram:

y-coordinate of vertex = distance between daughter clusters.

Point set and corresponding single linkage dendogram

x[,1]

x[,2]

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.51.01.52.02.53.0

1 2

3

4

Observati ons

1 2 3 4

0.51.01.52.02.5

Si ngl e l i nkage dendogram

Referensi

Dokumen terkait

karakteristik yang serupa tersebut dalam satu atau lebih dari satu cluster2.  Hierarchical clustering adalah

Menyatakan bahwa Tugas Akhir yang berjudul “ TEKNIK CLUSTERINGPADA LOG FILE IDS MENGGUNAKAN ALGORITMA HIERARCHICAL CLUSTERING ” Bukan merupakan karya orang lain

Dengan menerapkan GRAC (Graph Algorithm Clustering) dengan menggunakan metode Hierarchical Clustering yang membentuk Cluster yang hirarki, maka akan menghasilkan Graph

Teknik pengukuran kohesi menggunakan hierarchical clustering yang ditawarkan oleh penelitian sebelumnya tidak mempertimbangkan hubungan tidak langsung antara method dan

A complete binary tree with height k is a binary tree which has maximum number of nodes possible in levels 0 through k -1, and in (k -1)’th level all nodes with children are

Berikut adalah tabel hasil pengujian menggunakan beberapa jenis serangan yang sama dengan metode Hierarchical Clustering, K-Nearest Neighbor dan Fuzzy Neural

Complete Linkage: In complete link hierarchical clustering, we merge in the members of the clusters in each step, which provide the smallest maximum pairwise distance... Using complete

Manfaat Clustering Kuliah 13 - Hierarchical and K-means Clustering ANR – Data Mining & Knowledge Management - 2022 Keuntungan penggunaan metode hierarki dalam analisis Cluster adalah