2.2 SVM Classifier
2.2.1 Binary SVM
The binary SVM deals with a two-class problem and is used to separate two possible classes in a space by a clear gap as wide as possible. When an unseen example is mapped into the same space, then the SVM predicts whether it falls into one category or the other. The basic model of the SVM is a maximal margin classifier, which works only for linearly separable data in the feature space.
However, it cannot use for real world problems, but it is a main building block model for a more complex SVM. The SVM constructs a hyper plane or a set of hyperplanes defined by a number of support vectors in a high or infinite dimensional space, which could be used for the further classification (Burgees, 1998).
The basic principle of the SVM is demonstrated in the two-dimensional plane as shown in Figure 2.1 and Figure 2.2. A set of data of two classes represented by a positive sign (class A) and a negative sign (class B) is considered here. The SVM tries to place a linear boundary (or also called the hyperplane) between the two classes in such as a way that the distance between the boundary and the closest data point of each class is maximized. These closest data points define the margin and known as support vectors (SVs). The SVs contain all the necessary information for the
prediction of unseen examples; hence, other data points can be discarded. The margin is chosen as a trade-off between the margin level and the generalization error.
Figure 2.1 depicts a number of hyperplanes; it is called a good separation if the hyper-planes have a largest distance to the closest training data point of any class. Figure 2.2 shows the optimal separating hyperplane that separates two class data, i.e. the positive and the negative. The SVM classification problem is solved by considering an optimization technique for maximizing the margin between two classes. Hence, for a higher margin, the generalization error will be very less.
Finally, the SVM algorithm builds a classification model that predicts if the new data falls into the positive class or to the negative class (Vapnik, 1999).
For an optimal separating hyperplane, consider a problem to separate the set of training vectors of two different classes,
S{( ,xi yi),xiRN,yi { 1, 1},i1, , }m (2.1)
where xi is the training sample and denotes the ith vector in a dataset with component
x
i, yi is the label associated with xi, and m is the number of samples. A separating hyperplane is defined by,
w, x b 0; w R N; b R (2.2)Optimal seprating hyperplane X X
+ +
+ + + + + +
+ + +
+ +
- - -
- -
- - - - -
- - -
2
1
Figure 2.1 Optimal separating hyperplane
Figure 2.2 Linear separating hyperplane for the non-separable data
where the w (a normal vector to the hyperplane) is known as wiegh vector, which defines the boundary, x is the input vector of dimension N, and b is a scalar threshold. If the set of training vector is separated without any error and two adjacent vectors have maximal distance to the
+
W
-
- -
-
+ +
+ +
+ +
+ +
-b/|W|
-
- - -
-
- - -
- -
-
-ξ /|W|
Margin
+
X Hyperplane
(W)
SV SV
SV
+ +
X2
1
hyperplane, then set of vectors is assumed to be optimally separated by a hyperplane. The following decision function can be used to classify the training data in the positive or negative class.
f x sign w.xb (2.3)
For a linear separable data, we have
( ) 0
yi w.x b (2.4)
Consider a canonical hyperplane, which are defined for the closest point and are called support vectors on either side of the separating hyperplane as described in Equation 2.5 and Equation 2.6, whereas, Equation 2.7 describes the separating hyperplane, as
1
b
w.x (2.5)
-1
b
w.x (2.6)
and
0
b
w.x (2.7)
A canonical separating hyperplane must satisfy the following constraint,
1; 1, 2,...,yi w.xi b i n (2.8)
Consider a data point
x
1 is a support vector on the positive side of hyperplane (w, b), then the distance, ,d ofx
1 from the hyperplane is,
b
d
w.x1
w (2.9)
By maximizing the margin, , the optimal separating hyperplane could be obtained, hence the margin is,
, b b
b w.x1 w.x2
w w (2.10)
w,b
2w (2.11)
In order to maximize the margin, the following term should be minimized for the optimal hyperplane, then the optimization problem becomes,
min ( ) 1 2
w 2 w (2.12)
subject to, yi
w.xi b
1; i1, 2,...,n (2.13)The optimization problem is solved by the Langrange multipliers, the Langrangian is
1
1 1
2
n i i i
L y b
w.w
w.xi (2.14)Where
i is the Langrange multiplier. The Langrangian has to be minimized with respect to b and w, and maximized with respect to
i 0
.1
0 0
n i i i
L y
b
(2.15)and
1
0
n i i i
L y
w w
xi (2.16)This problem is very difficult to solve directly because the constraints are quite complex. Hence, Lagrangian duality theory is used for simplifying this problem. This approach leads to solving the following dual problem
1 2
1 , 1
Max{ ( )} { ( )}
m m
k i i j j
k i j
D y y
x , xi j (2.17)
Constraint,
1
0 and 0, 1, 2, ,
n
j j i
j
y i n
(2.18)This dual problem is computationally easier because its constraints are much simpler. After solving the above problem, the optimal separating hyperplane is given by,
1 n
i i i
y
*
w xi (2.19) and
b 12
w , x* pxq
(2.20)where xp and xq are any SV from each class satisfying,
p, q0; yp 1; yq 1 (2.21)The linear discriminate function can then be written as,
1
( ) { }
m i i i
f b y b
*
i x w x x , x (2.22)
This above optimization problem can solve only the linearly separable case. However, there are the cases where the data are not linearly separable. In that case the margin is “Soft”. This means that in-sample classification errors occur and also have to minimize. In order to solve the non- linear separation case, let i be a slack variables, which allow misclassification for some data in order to decrease the calculation complexity, C is a penalty parameter or soft margin classifier, which trade-off between the misclassification and boundary complexity. If i0, that means all
the data points are perfectly classified. The following constrained optimization problem has to solve in order to take care of the non-linear separable case (Gunn, 1998):
min 2
1
( ) 1 2
n i i
C
w w
(2.23)subject to yi
w.xi b
1 i; i1, 2,...,n (2.24) i 0 (2.25)
This formulation is known as the soft-margin SVM. Here, C controls the classification ability of the SVM. If C is higher, that means the higher weight is given to in-sample misclassifications, consequently it reduces the classification accuracy. The low classification means that the classifier may work well on the training data, but would not perform well on an unseen example. This is called the overfitting. Hence, by choosing a low C, the risk of overfitting of an SVM on the training data is reduced. It is noted that the C is directly related to the width of the margin. If C is smaller, then the margin is wider, the more classification errors are permitted. For the non-linear separation case, a hyperplane can be created by the SVM that allow the linear separation in the higher dimension. In the SVM, it is only possible by the transformation
x of the data from an N- dimensional space to Q-dimensional feature space as shown in Figure 2.3.
; where, N; Qs x xR sR (2.26)
The mapping or transformation by the SVM can be achieved by replacing the inner product,
x, xi
x . xi (2.27)A kernel function can be used for the mapping of SVM and the dot product in a single step provided the mapping can be replaced by an equivalent kernel function Boser et al. (1992). It reduces the computational load by retaining the effect of higher-dimensional transformation. The kernel function is given as:
.k x, xi x xi (2.28)
There are many function kernels available like the Gaussian Radial Basis function (RBF), polynomials, and certain sigmoid function. In this work the RBF kernel is adopted here because it is widely acceptable and famous kernel (Hsu and Lin, 2002). The RBF kernel can be expressed as,
exp
2
; 212 0b
k
i i
x, x x x (2.29)
where,
b is the width of the RBF kernel and is the kernel parameter. After solving the optimization problem, the basic form of SVM will be1
( ) { ( )}
m i i i
f yK b
i x x, x (2.30)
X +
-
+ +
+ + +
- -
- -
-
Ø (+)
Ø (+) Ø (+)
Ø (+)
Ø (+) Ø
X1 X1
X2
Ø ( -)
Ø ( -)
Ø ( -)
Ø ( -) Ø ( -)
2
Figure 2.3 Mapping from the input space to the high dimensional space
The above formulation of SVM is based on the binary classification, and it can handle the multiclass case by simply combining the binary case.