The convergence of the algorithm is proved and we report its successful applications to boosting

(1)

AN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD

JOOYEON CHOI, BORA JEONG, YESOM PARK, JIWON SEO, AND CHOHONG MIN¹ ABSTRACT. Boosting, one of the most successful algorithms for supervised learning, searches the most accurate weighted sum of weak classifiers. The search corresponds to a convex programming with non-negativity and affine constraint. In this article, we propose a novel Con- jugate Gradient algorithm with the Modified Polak-Ribiera-Polyak conjugate direction. The convergence of the algorithm is proved and we report its successful applications to boosting.

1. INTRODUCTION

Boosting refers to constructing a strong classifier based on the given training set and weak classifiers, and has been one of the most successful algorithms for supervised learning [1, 9, 8].

A first and seminal boosting algorithm, named AdaBoost, was introduced by [3]. AbaBoost can be understood as a gradient descent algorithm to minimize the margin, a measure of confidence of the strong classifier [10, 7, 3].

Though simple and explicit, Adaboost is still one of the most popular boosting algorithms for classification and supervised learning. According to the analysis by [10], Adaboost tries to minimize a smooth margin. The hard margin refers to a direct sum of the confidence of each data, and the soft margin takes the log-sum-exponential function. LPBoost invented by [4],[2]

minimizes the hard margin, resulting in a linear programming. It is observed that LPBoost does not perform well in most cases compared to Adaboost [11].

The strong classifier is a weighted sum of the weak classifiers. Adaboost determines the weight by the stagewise and unconstrained gradient descent. Adaboost increases the support of the weight one-by-one for each iteration. Due to the stagewise search and the stop of its search when the support is enough, Adaboost is not the optimal search.

The optimal solution needs to be sought among all the linear combinations of weak classifiers. The optimization becomes valid with a constraint that sum of the weights is bounded, and the bound was observed to be proportional to the support size of the weight [11].

In this article, we propose a new and efficient algorithm that solves the constrained opti- mized problem. Our algorithm is based on the Conjugate-Gradient method with non-negativity constraint by [5]. They showed the convergence of CG with the modified Polak-Ribiera-Polyak (MPRP) conjugate direction.

Received by the editors 2018; Revised 2018; Accepted in revised form 2018.

2010Mathematics Subject Classification. 47N10,34A45.

Key words and phrases. convex programming, boosting, machine learning, convergence analysis.

†Corresponding author : [email protected].

1

(2)

The optimization that arise in Boosting has the non-negativity constraint and an affine constraint. Our novel algorithm extends the CG with non-negativity to hold the affine constraint.

The addition of the affine constraint is a deal as big as adding the non-negative constraint.

We present a mathematical setting of boosting in section 2, introduce the novel CG and prove its convergence in section 3, and report its applications to bench mark problems of boosting in section 4.

2. MATHEMATICAL FORMULATION OF BOOSTING

In boosting, one is given with a set of training examples{x₁,· · · , x_M}with binary labels {y₁,· · ·, yM} ⊂ {±1}, and weak classifiers{h₁, h2,· · ·, hN}. Each weak classifierhj gives a label to each example, and hence it is a functionh_j :{x₁,· · · , x_M} → {±1}.

A strong classifier F is made up of a weighted sum of the weak classifiers, so thatF(x) :=

PN

j=1wjhj(x)for somew∈R^N withw≥0.

For each examplexi, a label+1is put whenF(xi)>0, and−1otherwise. Hence the strong classifier is successful onx_i if the sign ofF(x_i)matches the given labely_i, orsign (F(x_i))· yi = +1and unsuccessful onxiifsign (F(xi))·yi=−1.

The hard margin, which is a measure of the fidelity of the strong classifier, is thus given as (Hard margin) :−PM

i=1sign (F(x_i))·y_i

When the margin is smaller, more ofsign (F(x_i))·y_iare+1, andF can be said to be more reliable. Due to the discontinuity present in the hard margin, the soft margin of Adaboost takes the form, via the monotonicity of log and exponential,

(Soft margin) :log PM

i=1e^−F^(xⁱ^)·yⁱ

The composition of log-sum-exponential functions is referred tolse. Let us denote byA ∈ {±1}^M^×N, the matrix whose entry isa_ij =h_j(x_i)·y_i. Then the soft margin can be simply put tolse (−Aw),wherew= [w₁,· · · , w_N]^T.

The main goal of this work is to find out a weight that minimizes the soft margin, which is to solve the following optimization problem.

minimize lse(−Aw) subject tow≥0 andw·1 = 1

T (1)

Here, A ∈ {±1}^M^×N is a given matrix from the training data and weak classifiers, and T is a parameter to control the support size ofw. We finish this section with the lemma that shows that the optimization is a convex programming, and we will introduce a novel algorithm to solve the optimization.

Lemma 1. lse (−Aw)is a convex function with respect tow.

Proof. Given anyw,w∈˜ R^N and anyθ∈(0,1), letz=−Awandz˜=−Aw.˜

(3)

= (1−θ) lse (z) +θlse (˜z)

= log





M

X

i=1

e^zⁱ

!¹^−θ

·

M

X

i=1

e^z^˜ⁱ

!^θ



= log





M

X

i=1

e^zⁱ⁽¹^−θ)₁¹_−θ

!^1−θ

·

M

X

i=1

e^z^˜ⁱ^θ¹_θ

!θ



≤log

M

X

i=1

e^zⁱ^(1−θ)·e^z^˜ⁱ^θ

!

by the Hlder’s inequality.

= lse ((1 −θ)z+θ˜z)

= lse (−A((1 −θ)w+θw)).˜

3. CONJUGATE GRADIENT METHOD

In this section, we introduce a conjugate gradient method for solving the convex programming (1).

min f(w)subject tow≥0andw·1 = 1 T

Throughout this section,f(w)denotes the convex functionlse(−Aw), andg(w) denotes its gradient∇f(w). Letdbe the direction at a positionwto seek the next position. Whenw is located on the boundary of the constraint,wcannot be moved to a certain directionddue to the constraints

w∈R^N |w≥0andw·1 = _T¹ .

We referdto be feasible atw, if w+αdstays in the constraint set for sufficiently small α >0.

Definition 1. (Feasible direction) Given a direction d ∈ R^N at a position w ∈ R^N with w≥0andw·1 = _T¹,the feasible directiond^f =d^f(d, w)associated withdis defiend as the nearest vector todamong the feasible directions at the position. Precisely, it is defined by the minimization

d^f = argmin

yI(w)≥0andy·1=0

kd−yk (2)

whereI(w) = {i|w_i= 0}.The domain of the minimization is convex, and the functional is strictly convex and coercive, so thatd^f is determined uniquely.

Define the index setJ(w) ={j|w_j >0}.

Lemma 2. ∀ωwithw·1 = _T¹,∀d, letd^f =d^f(d, w), thenw+αd^f ≥0and w+αd^f

·1 = _T¹ for sufficiently smallα ≥0.

(4)

FIGURE1. For a given directiondat a positionw, a colored region is a feasible region ofd. Sinced^fis the nearest vector todamong the feasible directions atw, it is the orthogonal projection ofdonto the colored region (a). d^f is decomposed into two orthogonal components,d^f = d^t+d^w, where d^t is the orthogonal projection ofd^f onto the tangent space (b).

Proof. Clearly,∀α, w+αd^f

·1 =w·1 + 0 = _T¹.

∀α≥0, ifi∈I(w), wi+αd^f_i = 0 +αd^f ≥0, and ifj∈J(w), w_j+αd^f_j ≥w_j −α

d^f_j

+ 1

.

Thus for anyα≥0withα≤min_j∈J(w) ^w^j

d^f_j

+1,w+αd^f ≥0.

Proposition 1. (Calculation of the feasible direction) For a given directiondat a positionw, d^fis calculated as

(d^f_i = (d_i−r)⁺, i∈I d^f_j =d_j−r , j ∈J

whereris a zero of(dJ−r·1J)·1J+ (dI1−r)⁺+· · ·+ (dI_k−r)⁺, k=|I |.

Proof. Sinced^f is the KKT point of(Def.2), there existλ_Iandµsuch that d^f−d=

λ_I 0

−r·1, withd^f_I ≥0,λ_I·d^f_I = 0,d^f ·1 = 0. From these conditions, we getd^f_J =dJ −r·1J anddi−r =d^f_i −λi, fori∈I. Ifdi−r >0, thend^f_i >0andλi= 0. Thus,d^f_i =di−r.

Ifd_i−r≤0, thend^f_i = 0andλ_i≥0.

(5)

Algorithm 1Computing the feasible direction,d^f. Input :w,d

Output :d^f Procedure :

1 : Make index setsI(w) :={i|w(i) = 0}andJ(w) :={j|w(j)>0}

2 : Define a functionp(r) =P

i∈I(di−r)⁺+P

j∈J(dj−r).And find α= argmax

i∈I,p(d_i)>0

i

β = argmin

i∈I,p(d_i)≤0

i

3 :r←zero ofP

j∈J(d_j−r) +P

i∈I,i≤α(d_i−r)−P

i∈I,i>β(d_i−r) 4 : Computed^f as following :d^f_i =

(d_i−max{0,−d_i+r}+r, i∈I

di−r, i∈J

By combining these two, we haved^f_i = (di−r)⁺, fori∈I. Sinced^f ·1 = 0, d^f ·1 =d^f_J·1_J+d^f_I ·1_I

= (dJ −r·1J)·1J+ (dI1 −r)⁺+ (dI2 −r)⁺+· · ·+ (dIk−r)⁺= 0.

ris the root of the monotonically decreasing function. The monotone function is piecewisely linear, so that the root can be easily obtained by probing intervals between {d_I₁,· · ·, d_I_k} where the monotone function changes the sign. Afterris obtained,d^fis defined as stated.

Definition 2. (Tangent Space) The domain forwis the simplex{w| w ≥0andw·1 = 0}.

Whenw >0,wis inside and the tangent spaceT = 1^⊥.Whenw_i= 0andw_j >0(∀j6=i),w is on the boundary, and the tangent space becomes smallerTw={1, e_i |i∈I}^⊥.In general, we define the tangent space ofwasT_w := [1∪ {e_i |w_i= 0}]^⊥⊂R^N.

Definition 3. (Orthogonal decomposition of direction) Given a directiond∈R^N on a position w∈R^Nwithw≥0and1·w= _T¹,the direction is decomposed into three mutually orthogonal vectors; tangential, wall, and non-feasible components.

d=d^f+

d−d^f

=d^t+d^w+

d−d^f

.

Here,d^f =d^f(d, w)is the feasible direction.d^tis its orthogonal projection onto the tangent spaceT_w, andd^w=d^f −d^t∈T_w^⊥. Their mutual orthogonality is proved below.

(6)

Lemma 3. The above vectorsd^t, d^w, and(d−d^f)are orthogonal to each other. Furthermore, d−d^f ∈T_w^⊥.

Proof. By the definition of the orthogonal projection,d^t⊥d^w.The KKT condition of the minimization (2) is

d^f−d

= λ_I

0

+r1 for someλ_I ≥0withλ_I·d^f_I = 0and somerwithr(d·1) = 0, whereI =I(w). Sinced^t∈T_w={1, e_I}^⊥, d^t·

λ_I 0

= 0andd^t·1 = 0, thusd^t⊥d−d^f. Fromd^f·(d−d^f) =d^f_I·λI+r(1·d) = 0, we haved^f⊥d−d^f andd^w =d^f−d^t⊥d−d^f which completes the proof of their mutual orthogonalities.

SinceT_w ={1, e_I}^⊥andd−d^f ∈span{1, e_I}, d−d^f is orthogonal to the tangent space.

Definition 4. (MPRP direction) Let w be a point with w ≥ 0 and w· 1 = _T¹, and let g = ∇f(w). Putting tilde for the variable in the previous step : let˜g be the gradient and d˜be the search direction in the previous step, then the modified Polak-Ribiera-Polyak direction d^{M P RP} =d^{M P RP}

w,˜g,d˜

is defined as

d^{M P RP} = (−g)^f −(−g)^t·(g−g)˜ ^t

˜ g·g˜

d˜^t+ (−g)^t·d˜^t

˜

g·g˜ (g−˜g)^t

Theorem 1. (KKT condition) ∀w ≥ 0 with w·1 = _T¹,∀˜g,∀d,˜ let g = ∇f(w) andd = d^{M P RP}

w,g,˜ d˜

, then(−g)^f·d≥0. Moreover(−g)^f·d= 0if and only ifwis a KKT point of the minimization problem (1).

Proof.

(−g)^f ·d= (−g)^f ·

"

(−g)^f −(−g)^t·(g−g)˜ ^t

˜

g·g˜ d˜^t+ (−g)^t·d˜^t

˜

g·g˜ (g−˜g)^t

#

=k(−g)^f k²+ 1

˜ g·˜g

h

−

(−g)^t·(g−g)˜ ^th

(−g)^f·d˜^t i

+ h

(−g)^f ·(g−˜g)^t i h

(−g)^t·d˜^t ii

Since(−g)^w ⊥Tw,(−g)^w·(g−˜g)^t= 0and(−g)^f·(g−g)˜ ^t= (−g)^t·(g−˜g)^t. Similarly,(−g)^f ·d˜^t= (−g)^t·d˜^t, and we have(−g)^f·d=k(−g)^f k² ≥0.

The KKT condition for 1 is that

g=λ+r·1for someλ≥0withλ·w=0 and somerwithr

w·1− 1 T

= 0.

Since w_J > 0 and λ ≥ 0, λ_J = 0. Since w ·1 = _T¹, and w_I = 0, the condtions r w·1−_T¹

= 0 and λ·w = λ_I ·w_I +λ_J ·w_J = 0 are unnecessary. Therefore, the

(7)

Algorithm 2Algorithm based on nonlinear conjugate gradient

Input : Given constantsρ∈(0,1), δ >0, >0. Initial pointw₀ 0. Letk= 0, and g=∇f(w0) wheref =lse(−Aw).

Output :w Procedure :

1 : Computed= (dI, dJ)by Algorithm 1.

If

(−g)^f ·d

≤, then stop.

Otherwise, go to the next step.

2 : Determineα=maxn _−d

k·∇f(w)

dk·(∇²f(w)dk)ρ^j,j= 0,1,2,· · ·o

satisfyingw+αd≥0 andf(w+αd)≤f(w)−δα² kdk²

3 :w←w+αd

4 :k←k+ 1, and go to step 2.

KKT condition is simplified as g=

λ_I 0

+r·1for someλ_I ≥0and somer.

On the other hand,(−g)^f·d=k(−g)^f k²= 0if and only if0 = (−g)^f =argmin_y_I≥0andy·1=0k (−g)−yk, whose KKT condition is that

g= λ_I

0

+r·1for someλ_I ≥0and somer.

Each of the two minimization problems has a unique minimum point, accordingly a unique KKT condition. Since their KKT conditions are same, we have

(−g)^f ·d= 0 ⇐⇒ wis the KKT point of the minimization problem 1.

Next, we introduce some properties off(w)and Algorithm 2 to prove the global convergence of Algorithm 2.

Properties LetV =

w∈R^N |w≥0andw·1 = _T¹ . (1) Since the feasible set V is bounded, the level set

w∈R^N |f(w)≤f(w₀) is bounded.

Thus,f is bounded from below.

(2) The sequence{w_k}generated by Algorithm 2 is a feasible point sequence and the function value sequence{f(wk)}is decreasing. In addition, sincef(w)is bounded below,

∞

X

k=0

α²_kkdkk²<∞.

(8)

Thus we have

k→∞lim αkkdkk= 0.

(3)f is continuously differentiable, and its gradient is the Lipschitz continuous; there exists a constantL >0such that

k ∇f(w)− ∇f(y)k≤kx−yk,∀x, y∈V These imply that there exists a constantγ₁ such that

k ∇f(w)k≤γ₁,∀x∈V.

Lemma 4. If there exists a constant≥0such that kg(x_k)k≥, ∀k, then there exists a constantM >0such that

kd_kk≤M, ∀k.

Proof.

kd^{M P RP}_k k ≤k(−g)^f k+2k(−g)^tk · k(g−g)˜ ^tk · kd˜^t_kk kg˜k²

≤γ₁+2γ1Lαk kd˜^t_kk ² kd˜^t_kk

Sincelimk→∞αk kdk k= 0,∃a constantγ ∈(0,1) andk0 ∈Zsuch that 2Lγ₁

² αk−1kd˜^t_kk≤γ for allk≥k₀. Hence, for anyk≥k₀,

kd^{M P RP}_k k ≤2γ1+γ kdk−1 k

≤2γ₁

1 +γ+· · ·+γ^k−k⁰⁻¹

+γ^k−k⁰ kd^k⁰ k

≤ 2γ1

1−γ+kd_k₀ k LetM = maxn

kd₁ k,kd₂ k,· · ·,kd_k_r k,_1−γ^2γ¹+kd_k₀ ko

. Thenk d^{M P RP}_k k≤ M,∀k.

Lemma 5. (Success of Line search) In Algorithm 2, the line search step is guaranteed to succeed for eachk.Precisely speaking,

f(w_k+α_kd_k)≤f(w_k)−δα²_kkd_kk² for all sufficiently smallα_k.

(9)

Proof. By the Mean Value Theorem,

f(w_k+α_kd_k)−f(w_k) =α_kg(w_k+α_kθ_kd_k)·d_k,

for someθ_k ∈(0,1). The line search is performed only if(−g(w_k))^f ·d_k > . In Lemma7, we showed that(−g(w_k))−(−g(w_k))^f⊥T_wand(−g(w_k))−(−g(w_k))^f⊥(−g(w_k))^f.Since dk∈(−g(w_k))^f +Tw,

(−g(w_k))−(−g(w_k))^f

·dk= 0and we have

−g(w_k)·d_k= (−g(w_k))^f ·d_k> . From the continuity ofg(w),

−g(w_k+α_kθ_kd_k)·d_k>

2 for sufficiently smallα_k.Choosingα_k∈

0,_2δkd

kk²

,we get

f(w_k+α_kd_k) =f(w_k) +α_kg(w_k+α_kθ_kd_k)·d_k

< f(w_k)− 2α_k

≤f(w_k)−δα²_kkd_kk².

Theorem 2. Let{w_k}and{d_k}be the sequence generated by Algorithm 2, then

lim inf

k→∞ (−g_k)^f˙·d_k= 0.

Thus the minimum pointw^∗ of our main problem(1)is a limit point of the set{w_k}and Algo- rithm 2 is convergent.

Proof. We first note that(−g_k)^f ·d_k=−g_k·d_kthat appeared in the proof of Lemma 11. We prove the theorem by contradiction. Assume that the theorem is not true, then there exists an >0such that

k(−g_k)^f k²= (−g_k)^f ·d_k> , for allk By Lemma 10, there exists a constantMsuch that

kdkk≤M, for allk.

Iflim infk→∞α_k>0,thenlimk→∞kd_kk= 0.Sincekgk_∞<−r,limk→∞(−g_k)^f·d_k= 0.This contradicts assumption.

Iflim infk→∞α_k= 0,then there is an infinite index setK such that

k∈K,k→∞lim α_k= 0.

It follows from the step 2 of Algorithm 2, that whenk∈Kis sufficiently large,ρ⁻¹αkdoes not satisfyf(w_k+α_kd_k)≤f(w_k)−δα²_kkd_kk², that is

f w_k+ρ⁻¹α_kd_k

−f(w_k)>−δρ⁻²α²_k kd_kk² (3)

(10)

By the Mean Value Theorem and Lemma 10, there ish_k∈(0,1)such that f(w_k)−f w_k+ρ⁻¹α_kd_k

=ρ⁻¹α_kg w_k+h_kρ⁻¹α_kd_k

·d_k

=ρ⁻¹α_kg(w_k)·d_k+ρ⁻¹α_k g w_k+h_kρ⁻¹α_kd_k

−g(w_k)

·d_k

≤ρ⁻¹α_kg(w_k)·d_k+Lρ⁻²α²_kkd_k k²

Substitute the last inequality into (3) and applying−g(w_k)·dk = (−g)^f(wk)·dk, we get for allk∈Ksufficiently large,

0≤(−g)^f(w_k)·d_k≤ρ⁻¹(L+δ)α_kkd_kk².

Taking the limit on both sides of the equation, then by combiningkdkk≤M and recalling limk∈K,k→∞α_k= 0, we obtain thelimk∈K,k→∞ |(−g)^f(x_k)·d_k|= 0.

This also yields a contradiction.

Remark 1. To say the existence ofkwhich satisfies (3), we should verify thatwk+ρ⁻¹αkdk

is feasible. Since d_k ·1 = 0, (w_k +ρ⁻¹α_kd_k)·1 = w_k ·1 = _T¹. So, we should check wk+ρ⁻¹αkdk ≥ 0. Sincelimk∈K,k→∞αk = 0, αk is near to zero for sufficiently largek.

Thus,w_k+ρ⁻¹α_kd_k≥0except very special cases.

4. NUMERICAL RESULTS

In this section, we test our proposed CG algorithm on two boosting examples of non- negligible size. Through the tests, we check if their numerical results match the analyses presented in section 3.

Our algorithm is supposed to generate a sequence{w_k}on which the soft margin monotonically decreases, which is the first check point. According to Theorem (2), the stopping criteria (−g_k)^f ·dk < should be satisfied after a finite number of iterations for any given threshold > 0, which is the second check point. According to Theorem(1), the solution w_kwith the stopping criteria satisfied is the KKT point, which is the third one. The KKT point is the global minimizer of the soft margin, the optimal strong classifier, which is the final one.

4.1. Low dimensional example. We solve a boosting problem that minimizes lse(−Aw)with w≥0andw·1 = ¹₂, whereAis a4×3matrix given below.

A=







−1 1 1

−1 1 −1

1 −1 −1







As shown in Figure 4.1, the soft margin lse(−Aw)monotonically decreases and the stopping criteria(−g_k)^f ·d_kdrops to a very small number in finite iterations, which is equivalent to the statement of Theorem 2,lim infk→∞(−g_k)^f·d_k= 0.

(11)

FIGURE2. the convergence of the CG method for example 4.1

(12)

Home(+1) Field Goals Made Field Goals Attempted Field Goals Percentage 3 Point Goals 3 Point Goals Attempted 3 Point Goals Percentage Free Throws Made Free Throws Attempted Free Throws Percentage Offensive Rebounds Defensive Rebounds Total Rebounds

Assists Personal Fouls Steals Turnovers

Road(−1) Field Goals Made Field Goals Attempted Field Goals Percentage 3 Point Goals 3 Point Goals Attempted 3 Point Goals Percentage Free Throws Made Free Throws Attempted Free Throws Percentage Offensive Rebounds Defensive Rebounds Total Rebounds

Assists Personal Fouls Steals Turnovers

TABLE1. Statistics form the basketball league

4.2. Classifying win/loss of sports games. One of the primal applications of boosting is to classify win/loss of sports games [6]. As an example, we take the vast amount of statistics from the basketball league of a certain country*(for a patent issue, we do not disclose the details).

The statistics of each game is represented by the following 36 numbers.

In a whole year, there were 538 number of games with the win/loss results, from which we take a training data{x₁,· · · , x_M₌₂₆₉} with the win/loss of the home team{y₁,· · · , y_M} ⊂ {±1}. Eachxirepresents the statistics of a game, andxi ∈R^269×36.

Similarly to the previous example, Figure 4.2 shows that the soft margin monotonically decreases and the stopping criteria drops to a very small number in finite iterations, matching the analyses in Section 3.

5. CONCLUSION

We proposed a new Conjugate Gradient method for solving convex programmings with the non-negative constraints and a linear constraint, and successfully applied the method to the boosting problems. We also presented a convergence analysis for the method. Our analysis shows that the method is convergent in a finite iteration for any small stopping threshold. The solution with the stopping criteria satisfied is shown to be the KKT point of the convex programming and hence the global minimizer of the programming. We solved two benchmark boosting problems by the CG method, and obtained numerical results that completely cope with the analysis. Our algorithm with the guaranteed convergence can be successful in other boosting problems as well as other convex programmings.

REFERENCES

[1] Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms.

InProceedings of the 23rd international conference on Machine learning, pages 161–168. ACM, 2006.

[2] Ayhan Demiriz, Kristin P Bennett, and John Shawe-Taylor. Linear programming boosting via column gener- ation.Machine Learning, 46(1):225–254, 2002.

[3] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. InEuropean conference on computational learning theory, pages 23–37. Springer, 1995.

[4] Adam J Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI, pages 692–699, 1998.

(13)

FIGURE3. the convergences of the CG method for example 4.2

[5] Can Li. A conjugate gradient type method for the nonnegative constraints optimization problems.Journal of Applied Mathematics, 2013, 2013.

[6] B. Loeffelholz, B. Earl, and B.W. Kenneth. Predicting nba games using neural networks.Journal of Quanti- tative Analysis in Sports, pages 1–15, 2009.

[7] N.Duffy and D.Helmbold. A geometric approach to leveraging weak learners. InComputational Learning Theory, Lecture Notes in Comput. Sci., pages 18–33. Springer, 1999.

[8] R.E.Schapire. The boosting approach to machine learning: An overview. InNonlinear Estimation and Classi- fication. Lecture Notes in Statist., volume 171, pages 149–171. Springer, 2003.

(14)

[9] R.Meir and G.Ratsch. An introduction to boosting and leveraging. InAdvanced Lectures on Machine Learn- ing, volume 2600, pages 119–183. Springer, 2003.

[10] Cynthia Rudin, Robert E Schapire, Ingrid Daubechies, et al. Analysis of boosting algorithms using the smooth margin function.The Annals of Statistics, 35(6):2723–2768, 2007.

[11] Chunhua Shen and Hanxi Li. On the dual formulation of boosting algorithms.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2216–2231, 2010.