Lecture-12: Multi-class classification: Generalization bounds

(1)

Lecture-12: Multi-class classification: Generalization bounds

1 Generalization bounds: mono-label case

In the binary setting, classifiers are often defined based on the sign of a scoring function. In the multi- class setting, a hypothesis is defined based on a scoring functionh:X×Y→R. The label associated to pointxis the one resulting in the largest scoreh(x,y), which defines the following mapping fromXto Y

x7→arg max{h(x,y):y∈Y}.

Definition 1.1. The marginρ_h(x,y)for a scoring functionh:X→Yat a labeled example(x,y)_{is defined} as

ρ_h(x,y)≜h(x,y)−max

y̸=y^′h(x,y^′).

A scoring functionhmisclassifies an examplexiffρ_h(x,y)⩽0. For anyρ>0, we can define theempir- ical margin lossof a hypothesishfor multi-class classification as

Rˆ_S,ρ(h) = ¹ m

∑

m i=1

Φρ(ρ_h(x_i,y_i)), whereΦρis the margin loss function defined asΦρ(x) =₁_{x⩽0}+ (1−^x

ρ)₁_{0⩽x⩽ρ}. Remark1. SinceΦρ(x)⩽1_{x⩽ρ}, we obtain ˆRS,ρ(h)⩽_m¹∑^mi=11_{ρ_h_(x_i_,y_i_)⩽ρ}.

Lemma 1.2. LetF1, . . . ,FLbe L hypothesis sets inR^X, and letG≜{max{h₁, . . . ,h_L}:h_i∈Fi,i∈[L]}. Then, for any sample S of size m, the empirical Rademacher complexity ofGcan be upper bounded as

RˆS(G)⩽

∑

L ℓ=1

RˆS(Fℓ).

Proof. Let S= (x₁, . . . ,xm)∈_X^m be a sample of size m. We show this for L=2, and then it follows inductively. We observe that forh1∈F1,h2∈F2, we haveh1∨h2= ¹₂(h1+h2+|h1−h2|). Therefore, we can write from the definition of Rademacher complexity, that

RˆS(G) = ¹

mEσ[ sup

h₁∈F1,h₂∈F2

∑

m i=1

σ_imax{h1(xi),h2(xi)}]

= ¹

2mEσ[ sup

h₁∈F1,h₂∈F2

∑

m i=1

σ_i(h1+h2+|h1−h2|)(xi)]

⩽¹

2(R^ˆS(F1) +R^ˆS(F2)) + ¹

2mEσ[ sup

h₁∈F1,h₂∈F2

∑

m i=1

σ_i|h1−h2|(xi)]. The result follows from Talagrand’s Lemma sincex7→ |x|is 1-Lipschitz.

Definition 1.3. For any familyHof hypotheses mappingX×Y→R, we define Π1(_H)≜{x7→h(x,y):y∈_Y,h∈_H}.

Remark2. Recall that for a family of functionsG⊆[0, 1]^Z and anyi.i.d.sampleS∈_Z^m, we have with probability greater than or equal to 1−δ, for any functiong∈_G

Eg(z)⩽ ¹ m

∑

m i=1

g(zi) +2Rm(G) + s

ln¹_δ 2m. 1

(2)

This follows from the application of McDiarmid’s inequality. In addition, recall that the empirical Rademacher complexity of familyGforσ:Ω→ {−1, 1}^mi.i.d.Rademacher random sequence, is given by

Rm(G)≜ ¹ mEσ

h sup

g∈G

∑

m i=1

σ_ig(zi)ⁱ.

Theorem 1.4 (Margin bound for multi-class classification). LetH⊆_R^X^×^Ybe a hypothesis set withY= [k]. Fix ρ>0. Then, for any δ>0, with probability at least 1−δ, the following multi-class classification generalization bound holds for all h∈_H

R(h)⩽R^ˆS,ρ(h) +^4k

ρ Rm(_Π₁(_H)) + s

ln¹_δ 2m.

Proof. Let us define the marginρ_θ,h(x,y)≜min_y^′∈Y[h(x,y)−h(x,y^′) +θ1_{y=y^′_}]for some constantθ>0.

Then, we observe that

ρ_θ,h(x,y)⩽min

y^′̸=y[h(x,y)−h(x,y^′) +_θ1_{y=y′}] =ρ_h(x,y)_.

Therefore, it follows that1{ρ_h(x,y)⩽0}⩽1{ρθ,h(x,y)⩽0}^{. Since}¹^{u⩽0}⩽Φρ(u)for allu∈_{R, we have}R(h) =

E1{ρ_h(x,y)⩽0}⩽E1{ρ_θ,h(x,y)⩽0}⩽EΦρ(ρ_θ,h(x,y)). Defining the family of functions ˜H≜(x,y)7→ρ_θ,h(x,y):h∈_H and family of composition of functions ˜H≜Φρ◦h^˜: ˜h∈_H^˜ . Applying the remark to family ˜H, we get

R(h)⩽EΦρ(ρ_θ,h(x,y))⩽ ¹ m

∑

m i=1

Φρ(ρ_θ,h(xi,yi)) +2Rm(H) +^˜ s

ln¹_δ 2m.

Fixingθ=2ρ, we observe thatρ_θ,h(x_i,y_i) =ρ_h(x_i,y_i)if ρ_h(x_i,y_i)<0, andρ_θ,h(x_i,y_i) =2ρ⩽ρ_h(x_i,y_i) otherwise. This implies that

Φρ(ρ_θ,h(x_i,y_i)) =₁{ρθ,h(x_i,y_i)⩽0}+ (1−^ρ^θ,h(x_i,y_i)

ρ )₁{0⩽ρθ,h(x_i,y_i)⩽ρ}=₁{ρθ,h(x_i,y_i)⩽0}=₁_{ρ

h(x_i,y_i)⩽0}=_Φ_ρ(ρ_h(x_i,y_i)). From Talagrand’s Lemma, we haveRm(H)^˜ ⩽¹_ρRm(H^˜)sinceΦρis¹_ρ-Lipschitz function. Therefore, with

probability at least 1−δ, we have for allh∈_H R(h)⩽R^ˆ_S,ρ(h) + ²

ρRm(_H^˜) + s

ln¹_δ 2m. It suffices to show thatRm(_H^˜)⩽2kRm(_Π₁(_H)). To this end, we write

Rm(_H^˜) = ¹ mES,σ

hsup

h∈H

∑

m i=1

σ_i h(x_i,y_i)−max

y (h(x_i,y)−_2ρ1_{y=y

i})ⁱ

⩽ ¹ mES,σ

hsup

h∈H

∑

m i=1

σ_ih(x_i,y_i)ⁱ+ ¹ mES,σ

hsup

h∈H

∑

m i=1

σ_imax

y (h(x_i,y)−_2ρ1_{y=y

i})ⁱ.

We bound both the terms on the right hand side of the above equation individually. Definingϵ_i ≜ 21{y_i=y}−1∈ {−1, 1}, and observing thatσ_iϵ_iandσ_ihave identical distribution, we can write the first term as

1 2mES,σ

hsup

h∈H

∑

m i=1

∑

y∈Y

σ_ih(x_i,y)(ϵ_i+1)ⁱ⩽

∑

y∈Y

1 2mES,σ

hsup

h∈H

∑

m i=1

σ_ih(x_i,y)(ϵ_i+1)ⁱ⩽kRm(_Π₁(_H)). We apply Lemma??to the second term, to obtain

1 mES,σ

h sup

h∈H

∑

m i=1

σ_imax

y (h(x_i,y)−2ρ1{y=y_i})ⁱ⩽

∑

y∈Y

1 mES,σ

h sup

h∈H

∑

m i=1

σ_i(h(x_i,y)−2ρ1{y=y_i})ⁱ

=

∑

y∈Y

1

mES,σsup

h∈H

∑

m i=1

σ_i(h(x_i,y)⩽kRm(_Π₁(_H)).

Remark3. Larger margin means smaller second term and larger first term. That is, there is a trade-off between empirical error and complexity.

2

(3)

1.1 Rademacher complexity of family Π

1

( _H )

LetK:X×X→Rbe a PDS kernel and letΦ:X→Hbe the associated feature map. In multi-class classification, a kernel-based hypothesis is based onkweight vectorsw1, . . . ,wk∈H, where each weight vectorwidefines a scoring functionx7→ ⟨wi,Φ(x)⟩for eachi∈[k]and the class associated to pointx∈X is given by arg maxy∈Y

wy,Φ(x). LetW≜w₁ · · · w_kT

and forp⩾1 we define theL_H,pgroup norm ofWas

∥W∥_H,p≜

∑

k i=1

∥wi∥^p_H

1 p.

For anyp⩾1, the family of kernel-bases hypotheses under consideration is HK,p≜ⁿ(x,y)7→wy,Φ(x):∥W∥_H_,p⩽Λo

.

Proposition 1.5 (Rademacher complexity of multi-class kernel-based hypotheses). Let K:X×X→R be a PDS kernel and letΦ:X→Hbe the associated feature mapping. Assume that there exists r>0such that K(x,x)⩽r²for all x∈X. Then, for any m∈N, we have

Rm(_Π₁(_H_K,p))⩽

rr²Λ² m .

Proof. LetS∈X^mbe ani.i.d.sample. We observe that for each weight vector, we have∥wi∥_H⩽∥W∥_H,p for alli∈[_k]. Thus, for anyW∈HK,p, we havewi⩽Λfor alli∈[_k]. Therefore, form Cauchy-Schwarz and Jensen’s inequality, we have

Rm(Π1(HK,p)) = ¹ mES,σ

h sup

y∈Y,∥W∥⩽Λ

* wy,

∑

m i=1

σ_iΦ(xi) +

i

⩽^Λ mES,σ

∑

m i=1

σ_iΦ(xi) _H

⩽ ^Λ m

ES,σ

∑

m i=1

σ_iΦ(xi)

2

H

¹₂ .

Since the Rademacher random sequenceσisi.i.d.zero mean, we getES,σ∥_∑^m_i=1σ_iΦ(x_i)∥²_H=_E_S_∑^m_i=1∥Φ(x_i)∥²_H= ES∑^mi=1K(x_i,x_i)⩽mr², and the result follows.

Corollary 1.6 (Margin bound for multi-class classification with kernel-based hypotheses). Let K: X×X→Rbe a PDS kernel and letΦ:X→Hbe an associated feature map. Assume that there exists r>0 such that K(x,x)⩽r²for all x∈X. Fixρ>0. Then, with probability at least1−δfor all h∈HK,p

R(h)⩽R^ˆS,ρ(h) +4k s

r²Λ² ρ²m +

s ln¹_δ

2m.

3

Lecture-12: Multi-class classification: Generalization bounds