Clustering algorithms - Artificial intelligence and machine learning

Acknowledgments

3.9 Artificial intelligence and machine learning

3.9.1 Clustering algorithms

The maximum return of investment with $5M is therefore $18M. It can be achieved by investing $2M in firm2 and $3M in firms 0 and 1. The optimal choice is marked with a star in each table. Note that to determine how much money has to be allocated to maximize the return of investment requires storing past tables to be able to look up the solution to subproblems.

We can generalize eq.(3.92) and eq.(3.94) for any number of investment firms (decision stages):

f[i,k] = max

j|c[i,j]<kr[i,j] +f[i−1,k−c[i−1,j]] (3.96)

• Distribution-based clustering: These algorithms are based on statistics (more than the other two categories). They assume the points are gen- erated from a distribution (which mush be known a priori) and determine the parameters of the distribution. It provides clustering because the distribution may be a sum of more than one localized distribution (each being a cluster).

Bothk-means and distribution-based clustering assume an a priori knowl- edge about the data that often defies the purpose of using clustering: learn something we do now know about the data using an empirical algorithm.

They also require that the points be represented by vectors in a Euclidean space, which is not always the case. Consider the case of clustering DNA sequences or financial time series. Technically the latter can be presented as vectors, but their dimensionality can be very large, thus making the algorithms impractical.

Hierarchical clustering only requires the notion of a distance between points, for some of the points.

Figure3.6: Example of a dendrogram.

The following algorithm is a hierarchical clustering algorithm with the following characteristics:

• Individual points do not need to be vectors (although they can be).

• Points may have a weight used to determine their relative importance in identifying the characteristics of the cluster (think of clustering financial assets based on the time series of their returns; the weight could the average traded volume).

• The distance between points is computed by a metric function provided by the user. The metric can return^Noneif there is no known connection between two points.

• The algorithm can be used to build the entiredendrogram, or it can stop for a given value ofk, a target number of clusters.

• For points that are vectors and a given k, the result is similar to the result of thek-means clustering.

The algorithm works like any other hierarchical clustering algorithm. At the beginning, all-to-all distances are computed and stored in a list^d. Each point is its own cluster. At each iteration, the two clusters closer together are merged to form one bigger cluster. The distance between each other cluster and the merged cluster is computed by performing a weighted average of the distances between the other cluster and the two merged clusters. The weight factors are provided as input. This is equivalent to what thek-means algorithm does by computing the position of a centroid based on the vectors of the member points.

The algorithm ^self.q implements disjointed sets representing the set of clusters. The algorithm^self.qis a dictionary. If^self.q[i]is a list, thenⁱ is its own cluster, and the list contains the IDs of the member points. If

self.q[i]is an integer, then clusterⁱis no longer its own cluster as it was merged to the cluster represented by the integer.

At each point in time, each cluster is represented by one element, which can be found recursively byself.parent(i). This function returns the ID of the cluster containing elementⁱand returns a list of IDs of all points in the same cluster:

Listing3.20: in file:^nlib.py

1 class Cluster(object):

2 def __init__(self,points,metric,weights=None):

3 self.points, self.metric = points, metric

4 self.k = len(points)

5 self.w = weights or [1.0]*self.k

6 self.q = dict((i,[i]) for i,e in enumerate(points))

7 self.d = []

8 for i in xrange(self.k):

9 for j in xrange(i+1,self.k):

10 m = metric(points[i],points[j])

11 if notm is None:

12 self.d.append((m,i,j))

13 self.d.sort()

14 self.dd = []

15 def parent(self,i):

16 while isinstance(i,int): (parent, i) = (i, self.q[i])

17 return parent, i

18 def step(self):

19 if self.k>1:

20 # find new clusters to join

21 (self.r,i,j),self.d = self.d[0],self.d[1:]

22 # join them

23 i,x = self.parent(i) # find members of cluster i

24 j,y = self.parent(j) # find members if cluster j

25 x += y # join members

26 self.q[j] = i # make j cluster point to i

27 self.k -= 1 # decrease cluster count

28 # update all distances to new joined cluster

29 new_d = [] # links not related to joined clusters

30 old_d = {} # old links related to joined clusters

31 for (r,h,k) in self.d:

32 if h in (i,j):

33 a,b = old_d.get(k,(0.0,0.0))

34 old_d[k] = a+self.w[k]*r,b+self.w[k]

35 elif k in (i,j):

36 a,b = old_d.get(h,(0.0,0.0))

37 old_d[h] = a+self.w[h]*r,b+self.w[h]

38 else:

39 new_d.append((r,h,k))

40 new_d += [(a/b,i,k) for k,(a,b)in old_d.items()]

41 new_d.sort()

42 self.d = new_d

43 # update weight of new cluster

44 self.w[i] = self.w[i]+self.w[j]

45 # get new list of cluster members

46 self.v = [s for s in self.q.values() if isinstance(s,list)]

47 self.dd.append((self.r,len(self.v)))

48 return self.r, self.v

50 def find(self,k):

51 # if necessary start again

52 if self.k<k: self.__init__(self.points,self.metric)

53 # step until we get k clusters

54 whileself.k>k: self.step()

55 # return list of cluster members

56 returnself.r, self.v

Given a set of points, we can determine the most likely number of clusters representing the data, and we can make a plot of the number of clusters versus distance and look for a plateau in the plot. In correspondence with the plateau, we can read from the y-coordinate the number of clusters.

This is done by the function ^cluster in the preceding algorithm, which returns the average distance between clusters and a list of clusters.

For example:

Listing3.21: in file:^nlib.py

1 >>> def metric(a,b):

2 ... returnmath.sqrt(sum((x-b[i])**2 for i,x in enumerate(a)))

3 >>> points = [[random.gauss(i % 5,0.3) for j in xrange(10)] for i in xrange(200) ]

4 >>> c = Cluster(points,metric)

5 >>> r, clusters = c.find(1) # cluster all points until one cluster only

6 >>> Canvas(title='clustering example',xlab='distance',ylab='number of clusters'

7 ... ).plot(c.dd[150:]).save('clustering1.png')

8 >>> Canvas(title='clustering example (2d projection)',xlab='p[0]',ylab='p[1]'

9 ... ).ellipses([p[:2] for p in points]).save('clustering2.png')

With our sample data, we obtain the following plot (“clustering1.png”):

and the location where the curve bends corresponds to five clusters. Al- though our points live in10dimensions, we can try to project them into two dimensions and see the five clusters (“clustering2.png”):

Dalam dokumen Annotated Algorithms in Python (Halaman 128-132)