Algorithms of Combinatorial Cluster Analysis
3.2 EM Algorithm
Subsequently new cluster centres are determined. Their components being real num- bers are computed as in the classick-means algorithm, see Eq. (3.3), while for the nominal valued components the Lemma3.1.1is applied.
According to Huang [263], there are three prevailing differences between this algorithm (called by the authork-prototypes) and the algorithm CLARA, mentioned in the previous section:
• Thek-prototypes algorithm processes the entire data set, while CLARA exploits sampling when applied to large data sets.
• Thek-prototypes algorithm optimises the cost function on the entire data set so that at least a local optimum for the entire data collection is reached. CLARA performs optimization on samples only and hence runs at risk of missing an optimum, especially when for some reason the samples are biased.
• Sample sizes required by CLARA grow with increase of the size of the entire data set and the complexity of interrelationships within it. At the same time efficiency of CLARA decreases with the sample size: it cannot handle more than dozens of thousands of objects in a sample. On the contrary, thek-prototypes algorithm has no such limitations.
It is worth mentioning that the dissimilarity measure used here fulfils the triangle inequality, which implies that accelerations mentioned in Sect.3.1.4are applicable here.
3.2 EM Algorithm 97
Gaussian distribution:
p(x|Cj;μj, j)= 1 (2π)n/2
|j|exp −1
2(x−μj)T−1(x−μj)
(3.44) The assumption of Gaussian (or normal) distributions of elements in a cluster is a “classical” variant of the discussed approach but it is nonetheless sufficiently gen- eral. It is possible to demonstrate that every continuous and constrained probability density distribution can be approximated with any required precision by a mixture of Gaussian distributions [303]. Sample sets that were generated from a mixture of two normal distributions were presented already in Fig.2.13on p. 64. But, in general, mixtures of any type of distributions can be considered. A rich bibliography on this subject can be found on the Web page of David Dowehttp://www.csse.monash.edu.
au/~dld/mixture.modelling.page.html.
Let us reformulate the task of clustering as follows: We will treat the vectors of expected values as cluster centres (rows of the Mmatrix). Instead of the cluster membership matrix U, we will consider the matrix P = [pi j]m×k, where pi j = p(Cj|xi)means the probability that thei-th object belongs to the j-th cluster. Now let us define the target function for clustering as follows:
JE M(P,M)= − n
i=1
log k
j=1
p(xi|Cj)p(Cj)
(3.45) The function log(·)was introduced to simplify the computations. The minus sign allows to treat the problem of parameter estimation as the task of minimization of the expression (3.45).
Assuming that the set of observations consists really ofkclusters we can imag- ine that each observationxi,i = 1, . . . ,m was generated by exactly one of thek distributions, though we do not know by which. Hence, the feature that we shall sub- sequently call “class” is a feature with unknown values.22We can say that we have incomplete data of the formy=(x,class). Further, we do not know the parameters of the constituent distributions of the mixture (3.43). In order to estimate them, we can use the EM (Expectation Maximization) method, elaborated by Dempster, Laird and Rubin [144], which is, in fact, a maximum likelihood method, adapted to the situation when data is incomplete. The method consists in repeating the sequence of two steps (see [75,348]):
• StepE(Expectation): known feature values of objects, together with estimated parameters of models are used to compute for these objects the expected values of the unknown ones.
• StepM(Maximization): both known (observed) and estimated (unobserved) fea- ture values of objects are used to estimate the model parameters using the model likelihood maximisation given the data.
22More precisely, it is a so called “hidden feature” or “hidden variable”.
The EM algorithm is one more example of the broad class of hill-climbing algorithms.
Its execution starts with a choice of mixture parameters and proceeds by stepwise parameter re-estimation until the function (3.45) reaches a local optimum.
StepEconsists in estimation of the probability that thei-th object belongs to the class j. If we assume that we know the parameters of distributions belonging to the mixture (3.43) then we can estimate this probability using the Bayes theorem:
p(Cj|xi)= p(xi|Cj)p(Cj)
p(xi) = p(xi|Cj)p(Cj) k
l=1 p(xi|Cl)p(Cl) (3.46) p(xi|Cj)is the value of the density functionφ(xi;μ, ), having mean vectorμand covariance matrix, at the pointxi, i.e. p(xi|Cj)=φ(xi;μj, j). The valuep(xi) is computed according to the total probability theorem.
To execute stepM, we assume that we know the class membership of all objects, which is described by the aposteriorical distribution p(Cj|xi). Under these circum- stances the method of likelihood maximisation can be used to estimate the distribution parameters. The estimator of the vector of expected values of the j-th cluster is of the form:
μj = 1 mp(Ci)
m i=1
p(Cj|xi)xi (3.47)
Note that this equation has the same form as the equation (2.44) given that the weightsw(xi)are constant, e.g.w(xi)=1. If we assume, additionally, the preferen- tial class membership (i.e.p(Cj|xi)∈ {0,1}) then we will discover that the formula (3.47) can be reduced to the classical formula of computation of gravity centre of a class.
Similarly, we compute the covariance matrix from the formula:
j = 1
mnp(Cj) m
i=1
(xi−μj)(xi−μj)T (3.48)
and the apriorical class membership probability from the equation:
p(Cj)= 1 m
m i=1
p(Cj|xi) (3.49)
We can summarize these remarks with the Algorithm3.9.
If we speak in the language of thek-means algorithm, then we will say that the stepEis equivalent to an update of the assignment of points to clusters, and the step Mcan be deemed as determining characteristics of the new partition of the setX. A convenient approach (see e.g. [134]) to implement the above-mentioned pseudo- code consists in updating apriorical probabilities right after stepE, thereafter deter- mining centresμti+1, and lastly computing the matrixtj+1.
3.2 EM Algorithm 99 Algorithm 3.9EM algorithm for cluster analysis
Input:MatrixXrepresenting the set of objects, number of clustersk.
Output:A set of pairs(μ1, 1), . . . , (μk, k),representing the componentsCjof a mixture of Gaussian distributions fitting the data, for each object fromXthe probability of belonging to each of the componentsCj.
1: Initialisation. Set the iteration counter tot=0. Give the initial estimates of the parameters of the distributionsμtj, tjand of apriorical values of probabilitiesp(Cj),j=1, . . . ,k.
2: (Step E): Using Bayes Theorem compute the membership of i-th object in j-th class, pt+1(Cj|xi),i=1, . . . ,m,j=1, . . . ,k– compare equation (3.46).
3: (Step M): Knowing class memberships of objects update the distribution parametersμti+1, tj+1, and apriorical probabilitiespt+1(Cj),j=1, . . . ,k– equations (3.47) – (3.49).
4:t=t+1
5: Repeat steps 2 – 4 till the estimate values stabilise.
SystemWEKA[505] implements the EM algorithm for cluster analysis. Many other systems exploit mixtures of distributions, e.g.Snob[491],AutoClass[106] and MClust[186].
Experiments presented in [229] show that the EM algorithm is, with respect to quality, comparable to k-means algorithm. Just like in the case of the latter, also performance of the EM algorithm depends on initialisation.23 However, as noticed in [15], the EM algorithm is not suitable for processing of high-dimensional data due to losses in computational precision. Let us remember, too, that determining the covariance matrix requires computation of12(n2+n)elements, which becomes quite expensive with the growth of the number of dimensions. Finally, the disadvantage of the EM algorithm formulated in the here presented matter is its slow convergence, [144]. But, on the other hand, it is much more universal than thek-means algorithm: it can be applied in the analysis of non-numerical data. The quick algorithm developed by Moore in [360] allows to reduce, at least partially, the deficiencies listed.
Structural similarity between the EM andk-means algorithms permits to state that if the distribution p(Cj|x)is approximated by the rule “the winner takes all”, then the EM algorithm with a mixture of Gaussian distributions with covariance matrix j =I,j =1, . . . ,k, whereImeans a unit matrix, becomes thek-means algorithm when→0 [285].
Dasgupta and Schulman [134] analyse deeply the case of a mixture of spherical normal distributions, that is—those withj=σjI. They prove that if the clusters are sufficiently well separable, see Eq. (2.58), then the estimation of mixture parameters requires only two iterations of a slightly modified Algorithm3.9. This is particularly important in case of high-dimensional data (nlnk). They developed also a simple procedure, applicable in the cases when the exact value ofkis not known in advance.
They propose an initialisation method, which is a variant of the method (c) from Sect.3.1.3.
In recent years, we have been witnessing a significant advancement of theoretical description and understanding of the problems related to learning of parameters of
23One of initialisation methods for the EM algorithm, estimating mixture parameters, consists in applying thek-means algorithm.
a mixture of distributions. A reader interested in these aspects is recommended to study the paper [278].
Slow convergence of the standard version of the EM algorithm urged the develop- ment of its various modifications, such as greedy EM,24stochastic EM,25or random swap EM. Their brief review can be found in sections 4.4. i 4.5 of the Ph.D. thesis [535] and in paper [392], where a genetic variant of the EM algorithm has been proposed. It allows not only to discover the mixture parameters but also the number of constituent components.