• Tidak ada hasil yang ditemukan

CHAPTER 3 Centrifugal Pump Fault Classification Methodology based

3.2 Support Vector Machine Classifier

The SVM is a relatively new computational learning method based on the statistical learning theory (SLT) (Vapnik, 1998, 1999), and can serve as an expert system (ES). The SLT deals with supervised learning problems. The original SVM algorithm was developed by Vapnik

(Vapnik, 1995) and the current standard incarnation (soft margin approach) was proposed by Cortes and Vapnik (Cortes and Vapnik, 1995; Vapnik, 1998).

Many real world problems deal with binary classifications or decisions. For example, whether or not maintenance of a machine system is required, does a patient have a disease or not, etc.

Basically these decisions involve a true/ false or positive/ negative or yes/ no answers. One of the best ways of solving such binary problems is by using SVM classifier which can be introduced as a non-probabilistic binary linear classifier. The SVM is an abstract learning machine that learns from a training data set and attempts to make correct estimates on novel data sets. The training data comprises of input vectors denoted by xi and each input vector has a number of elements called the features. Each input vector is paired with a label denoted by yi and there is p such pairs (i = 1, 2… p). In a binary classification, it is obvious that there are two qualities (or classes) of data and these classes can be called as positive class or negative class, with data labels of yi = +1 or yi = ‒1, respectively.

The training data sets can be viewed as labeled data points in an input space, which can be depicted as shown in Figure 3.1. The plane that separates both the classes of data is called the hyperplane. There could be infinite orientations of the hyperplane that can separate the two classes of data as shown in Figure 3.1(a). But, the directed hyperplane found by a SVM is very intuitive: i.e., it is that hyperplane, which is maximally distant from the two classes of labeled points located on each side. The closest data points on both sides of the hyperplane have maximum effect on its position and orientation, and are therefore called support vectors.

They are denoted by the dark markers as shown in Figure 3.1(b). The classifying hyperplane

bias or offset of the hyperplane from the origin in input space, x are points located within the hyperplane, normal to the hyperplane and the weights w determine its orientation.

In Figure 3.1, two data clusters can be very nicely classified by a linear hyperplane in a 2-D space. Unfortunately, not always are the clusters linearly separable and could be highly meshed with overlying data points as shown in Figure 3.2. This situation is an impetus for the introduction of kernel tricks later in this chapter.

w.xw.xw x  .b   bb 101

Negative class

b w Margin

Support vectors

Test sample

Origin

x1

Origin

(a) Separating (b)

planes

Optim ized h

yperp lane Positive

class

x1

x2 x2

Figure 3.1 (a) Various orientations of the data separating planes; (b) the optimal orientation of the SVM hyperplane

The reason for considering SVM classifier comes from the theoretical upper bound on the generalization error, that is, the theoretical prediction error when applying the classifier to novel or unseen data points (training samples). This generalization error has two important attributes,

a) The bound is minimized by maximizing the margin. The margin is defined as the Euclidean gap between the sorting hyperplane and the support vector of every class.

Origin Origin

(a) (b)

Figure 3.2 (a) Intermeshed non linearly separable data (b) using a Gaussian kernel the data gets separable using a non-linear boundary in the feature space

Consider a binary classification case to classify p sets of training data

1 1 2 2

( ,x y ), (x ,y ),..., (xp,yp) into two classes, where, xi RN, i =1, 2… p, and yi { 1, 1}

specifies its class label. Now, let the decision function be,

( ) sign( )

f xw x b (3.1)

From the decision function given, it is clear that the data is correctly classified if it satisfies,

( ) 0

i i i

y w x  b  since (w x i b) is positive for yi  1 and (w x i b) is negative for yi  1. That is, the decision function is invariant of the positive rescaling of (w x i b) . Hence, an unspoken scale for ( , )w b is defined by setting w x   b 1 for closest data points on one side and w x   b 1 for closest data points on the other side. The hyperplanes passing through w x   b 1 and w x   b 1 are called canonical hyperplanes and the region between these hyperplanes is called margin band.

Let x1 and x2 be two points on the canonical hyperplanes on both sides. That is, if

        

hyperplane w x  b 0, the orientation of the hyperplane can be given as

w w 2 (where

2 is square root of T

w w w). The distance between the two canonical hyperplanes is equal to the projection of (x1x2) onto

w w 2, that is, 1 2

2 2

(x - x )w w 2 w . As margin is half the distance between the canonical hyperplanes it can be given as

1 w 2

  . Therefore, maximizing the margin implies minimizing the function given by Eq. (3.2), subject to constraints given in Eq. (3.3) (Cortes and Vapnik, 1995; Vapnik, 1995).

2 2

Minimize 1

2 w (3.2)

Subject to constraints, yi

w x  i b

1 i (3.3)

Eq.s (3.2) and (3.3) represent a constrained optimization problem. This can be reduced to a minimization problem using the following Lagrange function comprising a sum of the objective function and p constraints multiplied by their respective Lagrange multipliers (Everett III, 1963; Bertsekas, 2014). This function is called a primal and is given as,

1

, 1 1

2

p

i i i

i

L w b w w y w x b (3.4)

Where, i are the Lagrange multipliers and i 0. To minimize Eq. (3.4), partial derivatives with respect to w and b can be equated to zero. That gives,

1

, 0

p i i i i

L b

w y

w x

w (3.5)

1

, 0

p

i i i i

L b

b y

w x (3.6)

From Eq. (3.5), we have

1 p

i i i i

y

w x (3.7)

and from Eq. (3.6) we have

1

0

p i i i i

yx . Substituting these two deductions back into ,

L w b , we get a Wolfe dual formulation (Wolfe, 1961). It can be given as,

1 , 1

1 2

p p

i i j i j i j

i i j

W y y x x (3.8)

which must be maximized with respect to the following constraints,

1

0 and 0

p

i i i i

i

y (3.9)

This dual objective function is quadratic with respect to the Lagrange multipliers i, and it also has constraints in the form of Eq. (3.8). Hence this problem can be termed as a constrained quadratic programming problem.

But it can be seen given in the generalization theorem that: The bound is independent of the dimensionality of the input space, has not been used yet in the development of the formulation.

From the dual objective function in Eq. (3.8), it can be seen that the input vectors xi only appear in a scalar product. To get a different representation these data vectors can be mapped to another space of different dimensionality, called a feature space, by,

i j i j

x x x x (3.10)

Where, represents the mapping function. Here, if we revisit Figure 3.2 (a), we can see

transformation possible to shift the data represented by the circles into a plane parallel to that of the paper, towards the reader and the data represented by the parallelograms into a plane parallel to the paper, away from the reader, the two clusters of data could be accurately classified. That is the beauty of changing the dimension of the input space. Also, as the bound is independent of the dimensionality of the space, this is a valid transformation. This mapping transformation is called as a kernel transformation and is given as,

i j i j

K x x x x (3.11)

There are limits on the possible choice of kernel out of which one obvious restriction is that the feature space has to be an inner product space called a Hilbert space and that is where the inner product is consistently defined. Therefore, a proper choice of kernel function is necessary to minimize the training error. The introduction of kernel with its implied mapping to feature space is called kernel substitution. Some of the common kernel choices are,

2

, Linear kernel , Polynomial kernel , Gaussian RBF kernel , tanh Sigmoid kernel

d

K

K c

K e

K c

a b

a b a b

a b a b

a b

a b a b

(3.12)

Where, K a b a b for the vectors a, b and RBF stands for radial basis function.

For a binary classification with a given choice of kernel, the dual objective function in Eq.

(3.8) becomes,

1 , 1

1 2

p p

i i j i j i j

i i j

W y y K x x (3.13)

Subject to the constraints given in Eq. (3.9). For a data point with, yi 1 we note that,

1 1 1

min min 1

i i

p

j j i j

i y i y j

b y K b

w x x x (3.14)

Similarly, an expression can be written for data point with yi 1. On combining both the equations, we get,

1 1 1 1

1 min max

2 i i

p p

j j i j j j i j

i y j i y j

b y K x x y K x x (3.15)

Therefore, to construct a SVM binary classifier, the data xi,yi need to be substituted in Eq.

(3.13), subject to the constraints given in Eq. (3.9). After finding the optimal value of the Lagrange multiplier ( i*), the bias can be calculated from Eq. (3.15). Thus, for a new input vector n, the predicted class is then based on the sign of,

* *

1

,

p

i i i

i

f n y K x n b (3.16)

Where, b* denotes the optimum value of the bias. With the hyperplane of maximal margin, only those points that lie closest to the hyperplane have i* 0, and these points are the support vectors. All other points have i* 0, and the decision function is independent of these samples. That is, even if some of these samples were removed, there would not be any change in the position or orientation of the current hyperplane.