Slide TIF311 DM 10 11

(1)

DATA MINING WITH

CLUSTERING AND

(2)

Overview

• Definition of Clustering

• Existing clustering methods

• Clustering examples

• Classification

(3)

Definition

• Clustering can be considered the most important

unsupervised learning

technique; so, as every

structure

in a collection of unlabeled data.

• Clustering is “the process of organizing objects into

groups whose members are similar in some way”.

• A

cluster

is therefore a collection of objects which

are “similar” between them and are “dissimilar” to

the objects belonging to other clusters.

(4)

(5)

Why clustering?

A few good reasons ...

• Simplifications

• Pattern detection

(6)

Where to use clustering?

• Data mining

• Information retrieval

• text mining

• Web analysis

• marketing

(7)

Which method should I

use?

• Type of attributes in data

• Scalability to larger dataset

• Ability to work with irregular data

• Time cost

• complexity

(8)

Major Existing clustering

methods

• Distance-based

• Hierarchical

(9)

Measuring Similarity

• Dissimilarity/Similarity metric: Similarity is expressed

in terms of a distance function, which is typically

metric:

d

(

i, j

)

• There is a separate “quality” function that measures

the “goodness” of a cluster.

• The definitions of distance functions are usually very

different for interval-scaled, boolean, categorical,

ordinal and ratio variables.

• Weights should be associated with different variables

based on applications and data semantics.

• It is hard to define “similar enough” or “good enough”

– the answer is typically highly subjective.

(10)

Distance based method

(11)

Hierarchical clustering

Agglomerative (bottom up)

1. start with 1 point (singleton)

2. recursively add two or more appropriate

clusters

3. Stop when k number of clusters is

achieved.

Divisive (top down)

1. Start with a big cluster 2. Recursively divide into

smaller clusters

(12)

general steps of hierarchical

clustering

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:

• Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain.

• Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one

cluster less.

• Compute distances (similarities) between the new cluster and each of the old clusters.

• Repeat steps 2 and 3 until all items are clustered into K number of clusters

(13)

Exclusive vs. non

exclusive clustering

• In the first case data are grouped in an

exclusive way, so that if a certain

datum belongs to a definite cluster

then it could not be included in

another cluster. A simple example of

that is shown in the figure below,

where the separation of points is

achieved by a straight line on a

bi-dimensional plane.

• On the contrary the second type, the

overlapping clustering, uses fuzzy sets

to cluster data, so that each point may

belong to two or more clusters with

(14)

Partitioning clustering

1. Divide data into proper subset

2. recursively go through each subset

and relocate points between

clusters (opposite to visit-once

approach in Hierarchical approach)

(15)

Probabilistic clustering

1. Data are picked from mixture of

probability distribution.

2. Use the mean, variance of each

distribution as parameters for

cluster

(16)

Single-Linkage

Clustering(hierarchical)

• The N*N proximity matrix is D = [d(i,j)]

• The clusterings are assigned sequence

numbers 0,1,..., (n-1)

• L(k) is the level of the kth clustering

• A cluster with sequence number m is

denoted (m)

• The proximity between clusters (r) and

(s) is denoted d [(r),(s)]

(17)

The algorithm is composed of

the following steps:

• _{Begin with the disjoint clustering having}

level L(0) = 0 and sequence number m = 0.

• _{Find the least dissimilar pair of clusters in}

the current clustering, say pair (r), (s),

according to

d[(r),(s)] = min d[(i),(j)]

(18)

The algorithm is composed of the

following steps:(cont.)

• Increment the sequence number : m = m +1. Merge

clusters (r) and (s) into a single cluster to form the

next clustering m. Set the level of this clustering to

L(m) = d[(r),(s)]

• Update the proximity matrix, D, by deleting the rows

and columns corresponding to clusters (r) and (s) and

adding a row and column corresponding to the newly

formed cluster. The proximity between the new

cluster, denoted (r,s) and old cluster (k) is defined in

this way:

d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]

(19)

Hierarchical clustering example

• Let’s now see a simple example: a hierarchical

clustering of distances in kilometers between some

Italian cities. The method used is single-linkage.

(20)

• The nearest pair of cities is MI and TO, at distance 138. These

are merged into a single cluster called "MI/TO". The level of

the new cluster is L(MI/TO) = 138 and the new sequence

number is m = 1.

Then we compute the distance from this new compound object

to all other objects.

In single link clustering the rule is that

the distance from the compound object to another object is

equal to the shortest distance from any member of the

(21)

(22)

• min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a

new cluster called NA/RM

(23)

• min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM

into a new cluster called BA/NA/RM

(24)

• min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and

FI into a new cluster called BA/FI/NA/RM

(25)

• Finally, we merge the last two clusters at level 295.

(26)

K-mean algorithm

1. It accepts the

number of clusters

to group

data into, and the

dataset

to cluster as input

values.

2. It then creates the first

K initial clusters

(K=

number of clusters needed) from the dataset by

choosing K rows of data randomly from the

dataset.

For Example

, if there are 10,000 rows

of data in the dataset and 3 clusters need to be

formed, then the first

K=3 initial clusters

will

be created by selecting 3 records randomly

(27)

3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The

Arithmetic Mean of a cluster is the mean of all the

individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for

students in a University, where a record P in the dataset S

is represented by a Height, Weight and Age

measurement, then P = {Age, Height, Weight). Then

a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where

John's Age = 20 years, Height = 1.70 metres and Weight = 80 Pounds. Since there is only one record in each

(28)

4. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or

similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure.

5. K-Means re-assigns each record in the dataset to the most

similar cluster and re-calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean P_mean

is represented as P_mean= {Age_mean, Height_mean, Weight_mean).

(29)

6. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity

7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new

clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable

(30)

Classification

• Classification Problem Overview

• Classification Techniques

classification problem and introduce some of

the basic algorithms

(31)

Classification Examples

• Teachers classify students’ grades

as A, B, C, D, or F.

• Identify mushrooms as poisonous

or edible.

• Predict when a river will flood.

• Identify individuals with credit risks.

• Speech recognition

(32)

Classification Ex: Grading

• If x >= 90 then grade

=A.

• If 80<=x<90 then

grade =B.

• If 70<=x<80 then

grade =C.

(33)

Classification Techniques

• Approach:

1. Create specific model by

evaluating training data (or using

domain experts’ knowledge).

2. Apply model developed to new

data.

• Classes must be predefined

(34)

Defining Classes

Partitioning

Based

(35)

Classification Using

Regression

• _Division:

_{Use regression function}

to divide area into regions.

• _Prediction

_{: Use regression}

function to predict a class

(36)

Height Example Data

Name

Gender Height Output1 Output2

(37)

(38)

(39)

Classification Using

Distance

• Place items in class to which they are

“closest”.

• Must determine distance between an

item and a class.

• Classes represented by

–

_Centroid:

Central value.

–

_Medoid:

Representative point.

– Individual points

(40)

K Nearest Neighbor

(KNN):

• Training set includes classes.

• Examine K items near item to be

classified.

• New item placed in class with the

most number of close items.

• O(q) for each tuple to be classified.

(Here q is the size of the training

(41)

(42)

(43)

Classification Using

Decision Trees

• Partitioning based:

Divide search

space into rectangular regions.

• Tuple placed into class based on the

region within which it falls.

• DT approaches differ in how the tree is

built:

DT Induction

(44)

Decision Tree

Given:

– D = {t

1

, …, t

n

} where t

i

=<t

i1

, …, t

ih

>

– Database schema contains {A

1

, A

2

, …, A

h

}

– Classes C={C

1

, …., C

m

}

Decision or Classification Tree

is a tree

associated with D such that

– Each internal node is labeled with attribute, A

i

– Each arc is labeled with predicate which can

be applied to attribute at parent

(45)

(46)

Comparing DTs

Balanced

(47)

ID3

• Creates tree using information theory concepts and

tries to reduce expected number of comparison..

(48)

Conclusion

• very useful in data mining

• applicable for both text and

graphical based data

• Help simplify data complexity

• classification

(49)

Reference

• Dr. M.H. Dunham -

http://engr.smu.edu/~mhd/dmbook/part2.ppt. • Dr. Lee, Sin-Min – San Jose State University • Mu-Yu Lu, SJSU

Slide TIF311 DM 10 11

DATA MINING WITH

CLUSTERING AND

Overview

• Definition of Clustering

• Existing clustering methods

• Clustering examples

• Classification

Definition

• Clustering can be considered the most important

unsupervised learning

technique; so, as every

other problem of this kind, it deals with finding a

structure

in a collection of unlabeled data.

• Clustering is “the process of organizing objects into

groups whose members are similar in some way”.

• A

cluster

is therefore a collection of objects which

are “similar” between them and are “dissimilar” to

the objects belonging to other clusters.

Why clustering?

A few good reasons ...

• Simplifications

• Pattern detection

Where to use clustering?

• Data mining

• Information retrieval

• text mining

• Web analysis

• marketing

Which method should I

use?

• Type of attributes in data

• Scalability to larger dataset

• Ability to work with irregular data

• Time cost

• complexity

Major Existing clustering

methods

• Distance-based

• Hierarchical

Measuring Similarity

• Dissimilarity/Similarity metric: Similarity is expressed

in terms of a distance function, which is typically

metric:

d

(

i, j

)

• There is a separate “quality” function that measures

the “goodness” of a cluster.

• The definitions of distance functions are usually very

different for interval-scaled, boolean, categorical,

ordinal and ratio variables.

• Weights should be associated with different variables

based on applications and data semantics.

• It is hard to define “similar enough” or “good enough”

Distance based method

Hierarchical clustering

general steps of hierarchical

clustering

Exclusive vs. non

exclusive clustering

• In the first case data are grouped in an

exclusive way, so that if a certain

datum belongs to a definite cluster

then it could not be included in

another cluster. A simple example of

that is shown in the figure below,

where the separation of points is

achieved by a straight line on a

bi-dimensional plane.

• On the contrary the second type, the

overlapping clustering, uses fuzzy sets

to cluster data, so that each point may

belong to two or more clusters with

Partitioning clustering

1.

_{Begin with the disjoint clustering having}

_{Find the least dissimilar pair of clusters in}