Cluster Analysis
2.5 Other Methods of Cluster Analysis
2.5.7 Potential (Kernel) Function Methods
These methods originate from the work of Ajzerman, Braverman and Rozonoer [12], where these authors used the term of potential function. A short introduction to this approach can be found in Sect. 5.6 of the monograph [463]. Along with the development of the theory of support vector machines (SVM) [477] the term “kernel function” replaced the earlier term of the “potential function”. We present below, following [176], the fundamental assumptions of cluster analysis methods using the notion of kernel function.
We assume (for simplicity and in view of practical applications), that we deal with n-dimensional vectors having real-valued components (and not the complex num- bers, as this is assumed in the general theory). Hence, as until now,X= {x1, . . . ,xm} denotes the non-empty set of objects, withxi ∈Rn.
Definition 2.5.1 A functionK: X×X →Ris called the positive definite kernel function (Mercer kernel or simply kernel) if: (i)K(xi,xj)is a symmetric function, and (ii) for any vectors xi,xj ∈X and any real-valued constants c1, . . . ,cm the following inequality holds:46
46An introduction of Kernel functions should start with definition of avector space. LetV be a set,⊕:V×V→V be so-called inner operator (or vector addition) and:R×V→V be so-called outer operator (or scalar vector multiplication). Then(V,⊕,)is called vector space over the real numbers if the following properties hold:u⊕(v⊕w)=(u⊕v)⊕w, there exists 0v∈Vsuch thatv⊕0V=0V⊕v=v,v⊕u=u⊕v,α(u⊕v)=(αu)⊕(αv),(α+ β)v=(αv)⊕(βv),(α·β)v=α(βv), 1v=v.
Given the vector space, one can define the inner product space as a vector spaceV over real numbers in which the scalar product·,·:V×V→Rv, v ≥0,v, v =0⇔v=0,u, v = v,u,u, λv =λ· u, v,u, v+w = u, v + u, w,
Now let Xbe a the space of our input data. A mappingK :X×X→Ris called akernel, if there exists an inner product space(F,·,·)(Fbeing called the feature space) and a mapping :X→Fsuch that in this inner product spaceK(x,y)= (x), (y)for allx,y∈X. As for inner product(x), (y) = (y), (x)holds, obviouslyK(x,y)=K(y,x).
Mercel has shown that if%
x∈X
%
y∈XK2(x,y)d xd y<+∞(compactness ofK) and for each function f :X→R%
x∈X
%
y∈XK(x,y)f(x)f(y)d xd y≥0 (semipositive-definiteness ofK) then there exists a sequence of non-negative real numbers (eigenvalues)λ1, λ2, . . .and a sequence of functionsφ1, φ2, . . .:X→Rsuch thatK(x,y)=∞
i=1λiφi(x)φi(y), where the sum on the right side is absolute convergent. Moreover%
x∈Xφi(x)φj(x)d xis equal 1 ifi=jand equal 0 otherwise.
2.5 Other Methods of Cluster Analysis 57 m
i=1
m j=1
cicjK(xi,xj)≥0
IfK1(x,y),K2(x,y)are kernel functions, then their sum, product, andaK(x,y), wherea >0, are also kernel functions. The typical kernel functions, which are used in machine learning are:
(a) Linear kernelKl(x,y)=xTy+c. In the majority of cases, the algorithms, which use linear kernel functions, are close to equivalence with their “non-kernel”
counterparts (thus, e.g., the kernel-based variant of the principal component analysis with the linear kernel is equivalent to the classical PCA algorithm).
(b) Polynomial kernel K(x,y)=(αxTy+c)d, where α,cand the degree of the polynomial,d, are parameters. The polynomial kernel functions are applied, first of all, in the situations, in which normalised data are used.
(c) Gaussian kernelK(x,y)=exp
− x−y2/(2σ2)
, whereσ >0 is a parame- ter, whose choice requires special care. If its value is overestimated, the exponent will behave almost linearly, and the very nonlinear projection will lose its prop- erties. In the case of underestimation, functionKloses the regularisation capac- ities and the borders of the decision area become sensitive to the noisy data. The Gaussian kernel functions belong among the so-called radial basis functions of the form
K(x,y)=exp
&
− n
j=1|xaj −yaj|b 2σ2
' , b≤2
The strong point of the Gaussian kernel is that (for a correctly chosen value of the parameterσ) it filters out effectively the noisy data and the outliers.
Every Mercer kernel function can be represented as a scalar product
K(xi,xj)=(xi)T(xj) (2.49) where : X →F is a nonlinear mapping of the space of objects into a highly dimensional space of featuresF. An important consequence of this representation is the possibility of calculating the Euclidean distance in the spaceFwithout knowledge of the explicit form of the function. In fact,
(Footnote 46 continued)
Obviously, the functionφmay be of the form of an infinite vector=(√λ1φ1,√λ2φ2, . . .).
The above kernel definition constitutes a special application of this general formula to the case of a finite setX. The functionKover a finite setXcan be represented in such a case as a matrix, which must be therefore semipositive definite and the functionφcan be expressed for eachx∈X as the vector of corresponding eigenvector components multiplied with square root of the respective eigenvalues.
(xi)−(xj)2=
(xi)−(xj)T
(xi)−(xj)
=(xi)T(xi)+(xj)T(xj)−2(xi)T(xj)
=K(xi,xi)+K(xj,xj)−2K(xi,xj)
(2.50)
In view of the finiteness of the set Xit is convenient to form a matrixK having elementski j =K(xi,xj). Sinceki j =(xi)T(xj), then, from the formal point of view,Kis a Gram matrix, see the Definition B.2.3. Given this notation, we can write down the last equality in the form
(xi)−(xj)2=kii+kj j−2ki j (2.51) The kernel functions are being used in cluster analysis in three ways, referred to through the following terms, see [176,515]:
(a) kernelisation of the metrics, (b) clustering in feature spaceF, (c) description via support vectors.
In the first case we look for the prototypes in the spaceX, but the distance between objects and prototypes is calculated in the space of features, with the use of Eq. (2.50).
The counterpart to the criterion function (2.32) is now constituted by J1 =
k j=1
m i=1
ui j(xi)−(μj)2
= k
j=1
m i=1
ui j
K(xi,xi)+K(μj,μj)−2K(xi,μj) (2.52)
If, in addition,K(xi,xi)=1, e.g.Kis a Gaussian kernel, the above function simplifies to the form
J1=2 k
j=1
m i=1
ui j
1−K(xi,μj)
(2.53)
In effect, in this case the functiond(x,y)=
1−K(x,y)is a distance, and if, in addition,Kis a Gaussian kernel, thend(x,y)→ x−y)whenσ → ∞.
An example of such an algorithm is considered in deeper details in Sect.3.3.5.6.
The idea of calculating distances in the space of features was also made use of in the kernelised and effective algorithm of hierarchical grouping, as well as in the kernelised version of the mountain algorithm,47see also [288].
47The mountain algorithm is a fast algorithm for determining approximate locations of centroids.
See R.R. Yager and D.P. Filev. Approximate clustering via the mountain method.IEEE Trans. on Systems, Man and Cybernetics, 24(1994), 1279–1284.
2.5 Other Methods of Cluster Analysis 59 In the second case we operate with the images(xi)of the objects and we look for the prototypes μj in the space of features. The criterion function (2.32) takes now on the form
J2= k
j=1
m i=1
ui j(xi)−μj2 (2.54)
whereμj ∈F. In Sect.3.1.5.5we show how this concept is applied to the classical k-means algorithm, and in Sect.3.3.5.6.2.—to thek-fuzzy-means algorithm (FCM).
Finally, the description based on the support vectors refers to the single-class variant of the support vector machine (SVM), making it possible to find in the space of features the sphere of minimum radius, containingalmostall data, that is—the data with exclusion of theoutliers[63]. By denoting the centre of the sphere with the symbolv, and its radius with the symbolR, we obtain the constraint of the form (xi)−v2≤ R2+ξi, i =1, . . . ,m (2.55) whereξiare artificial variables. More extensive treatment of this subject is presented in Sect.3.5of the book [176].
The basic characteristics of the kernel-based clustering algorithms are as follows:
(a) They enable formation and description of clusters having shapes different from spherical or ellipsoidal.
(b) They are well adapted to analysing the incomplete data and the data containing outliersas well as disturbances (noise).
(c) Their shortcoming consists in the necessity of estimating additional parameters, e.g. the value ofσ in the case of the Gaussian kernel.
Even though the characteristic (a) sounds highly encouraging, it turns out that the classical partitional algorithms from Sect.2.4may also be applied in such situations.
We deal with this subject at greater length below.