Lecture-03: Support Vector Machines – Separable Case

(1)

Lecture-03: Support Vector Machines – Separable Case

1 Linear binary classification

Let the input space beX=_R^Nfor the number of dimensionsN⩾1, the output spaceY={−1, 1}, and the target function be some mappingc:X→_Y.

Definition 1.1. For a vectorw∈_R^Nand scalarb∈R, we define a hyperplaneE_w,b≜ⁿx∈_R^N:^⟨w,x⟩_∥w∥

2 =−_∥w∥^b

2

o. Definition 1.2. Distance of a point from a set is defined asd(x,A)≜min{d(x,y)_:y∈A}. Forx,y∈R^N, the distanced(x,y)≜∥x−y∥²₂.

Lemma 1.3. For a vector w∈_R^Nand b∈R, we have d(0,E_w,b) =−_∥w∥^b .

Proof. For a vectorw, we define a unit vectoru≜_∥w∥^w . It follows thatx0≜−_∥w∥^b ulies on the hyperplane Ew,b, which is parallel to the unit vectoruand at distanced(0,x0) =−_∥w∥^b from the origin. Any point x∈E_w,bon the hyperplane can be written as a sum of two orthogonal vectorsx=x0+x−x0where<

x−x₀,w>=0. Therefore,d(_0,x)²=d(_0,x₀)²+d(x₀,x)²⩾d(_0,x₀²), and henced(_0,E_w,b) =d(_0,x₀)_.

Remark1. A hyperplaneE_w,b=x∈_R^N:⟨w,x⟩+b=0 is defined in terms of the unit vectorw/∥w∥₂ and its distance−b/∥w∥₂from the origin.

Lemma 1.4. The distance of any point x∈R^Nto a hyperplane Ew,bis given by d(x,Ew,b) = ^{|⟨w,x⟩+b|}_∥w∥ . Proof. Letu=w/∥w∥be the unit vector in the direction ofw. Any pointyon a hyperplaneE_w,b, can be written as sum of two orthogonal vectorsy=x0+y−x0where<y−x0,x0>=0 andx0=−_∥w∥^b u.

Any pointx∈R^Ncan be represented asx=⟨x,u⟩u+v, such that⟨v,w⟩=0. Therefore, d(x,E_w,b)²= min

y∈E_w,bd(x,y)²= min

y∈E_w,bd(x0+y−x0,⟨x,u⟩u+v)²⩾⟨x,w⟩+b

∥w∥ 2

.

Remark2. The distance of a pointx∈_R^Nfrom the hyperplaneE_w,bis given byd(x,E_w,b). If⟨w,x⟩+b>

0, then the point x lies above the hyperplaneE_w,b, and if⟨w,x⟩+b<0, then point x lies below the hyperplaneE_w,b.

Assumption 1.5. We are given a training samplez∈(_X×_Y)^mconsisting ofmlabeled training examples z_i= (x_i,y_i)∈_X×Y, where each examplex_i is generatedi.i.d.by a fixed but unknown distributionD, and the labelyi=c(xi)for an unknown conceptc:X→_Y.

Assumption 1.6. We define the hypothesis set as a collection of separating hyperplanes H≜ⁿx7→sign(⟨w,x⟩+b):w∈R^N,b∈Ro

.

Remark3. Any hypothesish∈His identified by the pair(_w,_b)_{such that}_h(_x) =_sign(⟨w,x⟩+_b)_{for all} x∈R^N. A hypothesish∈ Hlabels positively all points falling on one side of the hyperplaneE_w,b≜ x∈_R^N:⟨w,x⟩+b=0 and labels negatively all others. This problem is referred to aslinear binary classification problem.

Remark4. For the Hamming loss functionL(y,y^′) =₁_{y̸=y′}, the generalization error isRD(h) =P{h(x)̸=c(x)}. The objective is to select anh∈Hsuch that the generalization errorRD(h)is minimized.

1

(2)

2 SVMs — separable case

Support vector machines are one of the most theoretically well motivated and practically most effective binary classification algorithms. We first introduce this algorithm for separable datasets, then present its general version for non-separable datasets.

Assumption 2.1. LetT1≜{i∈[m]:yi=−1}andT2≜{i∈[m]:yi=1}. The training samplezcan be separated into two non-empty sets by a hyperplane. That is, there exists a hyperplaneE_w,b such that [m] =T₁∪T₂and

T₁={i∈[m]:h(x_i) =⟨w,x_i⟩+b>0}, T₂={i∈[m]:h(x_i) =⟨w,x_i⟩+b<0}. Remark5. For such a hyperplaneE_w,b, we havey_ih(x_i)>0 for alli∈[m].

Remark6. LetE_w,bbe one of infinite such planes. Which hyperplane should a learning algorithm select?

The solutionE_w^∗_,b^∗ returned by the SVM algorithm is the hyperplane with the maximummargin, or the distance to the closest points, and is thus known as themaximum-margin hyperplane.

2.1 Primal optimization problem

Assumption??confirms the existence of at least one pair(w,b)and labeled sample(x,y)∈Tsuch that

⟨w,x⟩+b̸=0. We can normalize the pair(w,b)by the scalar min_i∈[m]|⟨w,x⟩+b|, such that if the closest point isx0∈S, then

|⟨w,x0⟩+b|=1.

We define this representation of the hyperplaneEw,b as thecanonical hyperplane. Letρbe the mini- mum distance of any point to the canonical hyperplane, i.eρ≜mini∈[m]|⟨w,x_i⟩+b|

∥w∥ = _∥w∥¹ . The maximiz- ing the margin is equivalent to minimizing the norm∥w∥or ¹₂∥w∥².

We can show in a figure the margin for a maximum-margin hyperplane with a canonical representation(w,b). We also see themarginal hyperplanes, parallel to the separating hyperplane and passing through the closest points on the negative or positive sides. Since they are parallel to the separating hyperplane, they admit the same normal vector w. By the definition of a canonical representation, for a pointxon a marginal hyperplane,|⟨w,x⟩+_b|=1, and thus the marginal hyperplanes are

⟨w,x⟩+b=±1. Correct classification is achieved for a labeled point(_x_i_,y_i)_wheny_i=_sign(⟨w,xi⟩+b)_. Since|⟨w,x_i⟩+b|⩾1 for all labeled points(_x_i,y_i)by the definition of canonical hyperplanes, a correct classification is achieved when

y_i(⟨w,x_i⟩+b)⩾1 for alli∈[m].

Hence our original problem statement translates to finding (w,b)so as to maximize the marginρ such that all points are correctly separated is equivalent to

minw,b

1

2∥w∥² (1)

subject to: yi(⟨w,xi⟩+b)⩾1 for alli∈[m].

The objective function F:w7→ ¹₂∥w∥² is infinitely differentiable, its gradient is∇w(F) =_w _{and its} Hessian is the identity matrix∇²F(_w) =_Iwith strictly positive eigenvalues. Therefore,∇²F(_w)≻0 and Fis strictly convex. The constraints are all defined by the affine functionsg_i:(w,b)7→1−y_i(⟨w,xi⟩+b) and are thus qualified. Thus the optimization problem in (??) has a unique solution, and can be solved by aquadratic program.

2.2 Support vectors

In this section, we will show that the normal vectorwto the resulting hyperplane is a linear combination of some feature vectors, referred to assupport vectors. Consider the dual variablesα_i⩾0 for alli∈[m] associated to themaffine constraints and letα≜(α_i:i∈[m]). Then, we can define the Lagrangian for all canonical pairs(w,b)∈_R^N+1and Lagrange dual variablesα∈_R^m₊as

L(w,b,α)≜¹

2∥w∥²−

∑

m i=1

α_i[yi(⟨w,xi⟩+b)−1].

2

(3)

Since the primal problem in (??) has convex cost function with affine constraints,E_w^∗_,b^∗ is the optimal separating cannonical hyperplane if and only if there existsα^∗∈_R^m₊that satisfies the following three KKT conditions:

∇_w_L|_w=w^∗=w^∗−

∑

m i=1

α_iy_ix_i=0, ∇_b_L|_b=b∗=−

∑

m i=1

α^∗_iy_i=0, α^∗_i[y_i(⟨w^∗,x_i⟩+b^∗)−1] =0.

Remark7. The complementary condition implies thatα^∗_i =0 if the labeled points are not on the supporting hyperplane, i.e.yi(⟨w^∗,xi⟩+b^∗)̸=1.

Definition 2.2 (Support vectors). We can define thesupport vectorsas the examples or feature vectors

for which the corresponding Lagrange variableα^∗_i ̸=0, i.e.S≜i∈[m]:α_i^∗̸=0 ⊆ {i∈[m]:yi(⟨w^∗,xi⟩+b^∗) =1}. Remark 8. The optimal primal variables (w^∗,b^∗) in the SVM solution are the stationary points of the

associated Lagrangian, and hence we can write the normal vector as a linear combination of support vectors, i.e.

w^∗=

∑

m i=1

α^∗_iyixi=

∑

i∈S

α^∗_iyixi. (2)

Remark9. Support vectors completely determine the maximum-margin hyperplane solution. Vectors not lying on the marginal hyperplane do not affect the definition of these hyperplanes.

Remark10. The slope of the hyperplanew^∗is unique but the support vectors are not unique. A hyperplane is sufficiently determined byN+1 points inNdimensions. Thus, when more thanN+1 points lie on a marginal hyperplane, different choices are possible for theN+1 support vectors.

Remark11. We have expressed the normal vectorw^∗for the optimal hyperplane, in terms of the optimal dual variableα^∗∈_R^m₊. We have not yet found the optimal dual variables, or the normalized distance b^∗.

2.3 Dual optimization problem

In this section, we will show that the hypothesish∈Hand distancebcan be expressed as inner products. To this end, we look at the the dual form of the constrained primal optimization problem (??).

Recall that the dual function F(α) =inf_w,bL(w,b,α). The LagrangianLis minimized at the optimal primal variables(w^∗,b^∗)such that∇_wL(w^∗,b^∗) =∇_bL(w^∗,b^∗) =0 to write the optimal normal vector w^∗=_∑^m_i=1α_iyixiin terms of the dual variablesα∈R^m₊as expressed in (??), together with the constraint

∑^m_i=1α_iyi=0.

Definition 2.3 (Gram matrix). For a labeled samplez∈(_X×_Y)^m, we can define a Gram matrix A∈ R^m×mdefined by the(i,j)th entriesA_ij≜y_ix_i,y_jx_j

for alli,j∈[m].

Remark12. The matrixAis the Gram matrix associated with vectors(y₁x1, . . . ,ymxm)and hence is positive semidefinite. We can easily check that for anyα∈_R^m, we have

α^TAα=

∑

i,j∈[m]

α_iy_ix_i,α_jy_jx_j

=

i∈[m]

∑

α_iy_ix_i

2

⩾0.

Substitutingw^∗=_∑^m_i=1α_iyixi,∑^m_i=1α_iyi=0, and the definition of Gram matrixA, in the Lagrangian L(w^∗,b^∗,α), we can write the dual function asF(α) =L(w^∗,b^∗,α) =_∑^m_i=1α_i−¹₂_∑^m_i=1_∑^m_j=1α_iAijα_j. There- fore, we can write the dual SVM optimization problem as

max

α

∥α∥₁−¹

2α^TAα (3)

subject to: αi⩾0, for alli∈[m], and

∑

m i=1

αiyi=0.

The objective functionG:α7→ ∥α∥₁− ¹₂α^TAα is infinitely differentiable, and its Hessian is given by

∇²G=−A⪯0, and hence Gis a concave function. Since the constraints are affine and convex, the dual maximization problem (??) is equivalent to a convex optimization problem. SinceGis a quadratic function of Lagrange variablesα, this dual optimization problem is also a quadratic program, as in the case of the primal optimization. Since the constraints are affine, they are qualified and strong duality

3

(4)

holds. Thus, the primal and dual problems are equivalent, i.e., the solutionα^∗of the dual problem (??) can be used directly to determine the hypothesis returned by SVMs. Using (??) for the normal to the supporting hyperplane, we can write the hypothesis

h(x) =sign(⟨w^∗,x⟩+b^∗) =sign

∑

m i=1

α^∗_iyi⟨xi,x⟩+b^∗

! .

For any support vectorx_ifori∈S, we havey_i=⟨w^∗,xi⟩+b^∗, and hence we can write for allj.∈S b^∗=y_j−

∑

m i=1

α_i^∗y_i x_i,x_j

. (4)

Combining the above two results, we get for anyj∈S h(x) =sign

∑

i∈S

α^∗_iy_i

x_i,x−x_j +y_j

! .

Remark13. The hypothesis solution depends only on inner products between vectors and not directly on the vectors themselves.

Remark14. Since (??) holds for alli∈S, that is for allisuch thatα^∗_i ̸=0, we can write 0=

∑

m i=1

α^∗_iyib^∗=

∑

m i=1

α^∗_iy²_i −

∑

m i,j=1

α^∗_iAi,jα^∗_j =

∑

m i=1

α^∗_i − ∥w^∗∥². That is, we can write the optimal marginρasρ²= ¹

∥w^∗∥²₂ = _∥α¹∗∥₁.

4