• Tidak ada hasil yang ditemukan

Approximate Newton Method: A Special Case

Chapter III: Second-Order Algorithms for Time-Varying Optimization

3.2 Approximate Newton Method: A Special Case

We first consider the special case where there are no explicit equality and inequality constraints in (3.1), i.e.,

x∈minRn

cτ(x)+hτ(x) (3.2)

for eachτ∈ T. By Lemma 2.8, its optimality condition is given by

−∇cτ xτ

∈∂hτ xτ, (3.3)

wherexτis any local optimal solution to (3.2). Just as in Chapter 2, in the following discussion, we use xτto denote an arbitrarily chosen local optimal solution in case there are multiple local optimal solutions to (3.2), and we will mainly focus on this particular sequence xτ

τ∈T.

The approximate Newton method for (3.2) is given by the iteration ˆ

xτ =arg min

x∈Rn

∇cτ(xˆτ−1)T(x− xˆτ−1)+ 1

2(x−xˆτ−1)TBτ(x− xˆτ−1)+hτ(x) (3.4) or equivalently in the form of a generalized equation

−∇cτ(xˆτ−1) −Bτ(xˆτ−xˆτ−1) ∈ ∂hτ(xˆτ) (3.5) for eachτ∈ T, whereBτ ∈Rn×nis a positive definite matrix. The rationale behind (3.4) is that we approximate the smooth part of the objective functioncτby a convex quadratic function for each τ, and when hτ is simple or has special structures, we expect that the resulting convex program can be solved much more efficiently than finding the batch solution. We do not put restrictions on the method of producing the matrix Bτ in (3.2) apart from the positive definiteness of Bτ. Therefore (3.2) gives a class of algorithms. For example, we can setBτto be a scalar matrix, and in this case (3.2) is a special case of the first-order proximal gradient method proposed in Chapter 2. We can also set Bτ = ∇2cτ(xˆτ−1), or use any quasi-Newton method to calculate Bτ, and in this case we obtain a generalization of the (quasi-)Newton method to the time-varying situation, which can be seen more clearly by noting that (3.4) is equivalent to

τ = xˆτ−1−Bτ−1∇cτ(xˆτ−1) whenhτ =0.

Since we use a quadratic approximation of cτ in deriving (3.4), we can tell from intuition that the quality of this quadratic approximation plays a significant role in the tracking performance. We now proceed to capture this intuition mathematically.

Tracking Performance

We define thetracking errorby

τ−xτ W

for eachτ ∈ T, where we denote kxkW := xTW x for any x ∈ Rnand any positive definiteW ∈ Rn×n. Tracking performance will then be evaluated quantitatively by the tracking error.

We then define

σW := sup

τ∈T \{0}

xτ−xτ−1 W,

which upper bounds the distance between consecutive optimal solutions. It charac- terizes how fast the optimal solutionxτdrifts as time proceeds.

Lemma 3.1. Letτ∈ T be arbitrary. Ifτ is generated by(3.4), then

τ− xτ B

τ ≤ ρτ

τ−1− xτ B

τ, (3.6)

where

ρτ :=

B−1τ ∇cττ−1

− ∇cτ xτ − xˆτ−1− xτ B

τ

kxˆτ−1−xτkBτ . Proof. The generalized equation (3.5) implies that

hτ(xτ) −hτ(xˆτ) ≥ xτ− xˆτT

−∇cττ−1

−Bτ(xˆτ −xˆτ−1). On the other hand, by the optimality condition (3.3), we have

hτ(xˆτ) −hτ(xτ) ≥ − xˆτ− xτT

∇cτ xτ. We then have

ˆ

xτ −xτT

∇cττ−1

− ∇cτ xt

+Bτ(xˆτ−xˆτ−1)

≤ 0, which leads to

τ−xτ

2

Bτ = xˆτ− xτT

Bττ −xτ

≤ xˆτ− xτT

Bτ

B−1τ ∇cτ xt

− ∇cττ−1 − xτ−xˆτ−1

τ−xτ Bτ ·

Bτ−1 ∇cτ xt

− ∇cττ−1 − xτ− xˆτ−1

Bτ

=

τ−xτ

Bτ · ρτ

τ−1− xτ

2 Bτ .

The inequality (3.6) now follows directly.

Lemma 3.1 directly leads to the following theorem on the tracking error bound for (3.4).

Theorem 3.1. LetW ∈Rn×nbe a positive definite matrix, and λM := sup

τ∈Tinf{λ∈R:λW Bτ}, λm := inf

τ∈Tsup{λ∈R:λW Bτ}. If

ρ:= sup

τ∈T

Bτ−1 ∇cττ−1

− ∇cτ xτ − xˆτ−1−xτ

Bτ

kxˆτ−1− xτkBτ <

m

λM

, (3.7)

then

τ −xτ

W ≤ ρσW

mM− ρ + ρ rλM

λm

!τ

0− x1

W − σW

mM

mM− ρ

!

(3.8) for anyτ ∈ T.

Proof. The definition ofρimplies that ρτ ≤ ρfor allτ ∈ T. Therefore by Lemma 3.1, forτ= 1, we have

1−x1 W

m−1

1− x1

B1 ≤ ρq λ−1m

0− x1 B1 ≤ ρ

M

λm

1−x0 W, while if (3.8) holds for someτ, then

τ+1−xτ+1 W

≤ qλ−1m

τ+1−xτ+1

Bτ ≤ ρq λ−1m

τ −xτ+1 Bτ

≤ ρ rλM

λm

τ− xτ W +

xτ−xτ+1 W

≤ ρ rλM

λm

ρσW

mM −ρ + ρ rλM

λm

!τ

0−x1

W − σW

mM

mM −ρ

! +σW

!

= ρσW

mM− ρ + ρ rλM

λm

!τ+1

0− x1

W − σW

mM

mM −ρ

! .

The bound (3.8) then follows from mathematical induction.

The condition (3.7) and the tracking error bound (3.8) suggest that ρ should be sufficiently small so that good tracking performance can be achieved. However, in practice, it is difficult to evaluate the quantity ρ unless we carry out the whole iterations (3.4), which makes Theorem 3.1 not very useful for estimating the tracking error a priori. Rather, Theorem 3.1 justifies our intuition that the quality of the

quadratic approximation in (3.4), which is now quantitatively assessed byρ, plays a central role in the tracking performance. Furthermore, in order to make ρas small as possible, we need

∇cττ−1

− ∇cτ xτ

≈ Bττ−1−xτ.

for eachτ; in other words, the matricesBτin (3.4) should approximate the (averaged) Hessian ofcτ along the direction ˆxτ−1−xτ. This implication partially substantiates our motivation of introducing second-order methods: although (3.4) is in general harder to compute than the proximal gradient method, the tracking error can be potentially reduced by employing a more sophisticatedBτ.

A Centralized Algorithm Based on L-BFGS-B

As Theorem 3.1 points out, the approach of producing the matrices Bτ has a ma- jor effect on the tracking performance of the approximate Newton method (3.4).

However, the Hessian of cτ is in general a dense n× n matrix, which makes the scalability of (3.4) questionable asnincreases. Another issue is that (3.4) involves optimization of a potentially nonsmooth function, and in the special case wherehτis the indicator function of some convex setXτ, we obtain a constrained optimization problem which cannot be solved by directly applying algorithms for unconstrained optimization.

In this subsection, we provide a specific algorithm based on L-BFGS-B to deal with these issues, when the nonsmooth convex function hτ is the indicator function of a rectangular set inRnwith a nonempty interior:

hτ = IXτ, Xτ =

x ∈Rn : xτ ≤ x ≤ xτ .

The L-BFGS-B algorithm [26, 69] is a quasi-Newton method that solves nonlinear programs with box constraints. It employs two key techniques: the limited-memory BFGS method to produce the approximate Hessian, and the gradient projection method1 to handle the box constraints. By applying these two key techniques to the time-varying optimization setting, we propose the following procedure for each τ ∈ T:

1. ProduceBτ by the limited-memory BFGS method.

1Not to be confused with the projected gradient method.

2. Use the gradient projection method to get an approximate solution to the quadratic program

minx∈Rn

qτ(x):= gTτ(x−xˆτ−1)+ 1

2(x−xˆτ−1)TBτ(x−xˆτ−1) s.t. xτ ≤ x ≤ xτ,

(3.9)

wheregτ =∇cτ(xˆτ−1).

Note that the gradient projection method which will be presented shortly will only produce an approximate solution to the quadratic program (3.9) in general. Never- theless, simulation shows that such an approximate solution usually suffices.

We now give a succinct introduction to these two key techniques under the time- varying optimization setting.

Limited-memory BFGS In limited-memory BFGS, the approximate Hessian Bτ

takes the form

BττI−KτMτKTτ, (3.10) where Kτ ∈ Rn×2d and Mτ ∈ R2d×2d for some d ∈ N. In practice d is typically between 3 and 20 and is usually much smaller than n. It can be seen that Bτ is a small rank correction of a scalar matrix.

To be specific, we denote

sτ = xˆτ −xˆτ−1, yτ = ∇cτ(xˆτ) − ∇cτ(xˆτ−1), and

Yτ = h

yτ−d · · · yτ−1

i, Sτ = h

sτ−d · · · sτ−1

i, The matricesKτ andMτ are then given by

Kτ = h

Yτ ϑτSτ

i, Mτ =

"

−Dτ LτT Lτ ϑτSTτSτ

#−1

,

whereDτ ∈Rd×d andLτ ∈Rd×dare given by Dτ = diag

sTτ−dyτ−d, . . . ,sTτ−1yτ−1

,

(Lτ)i j =

(sTτ−d−1+iyτ−d−1+j, ifi > j,

0, otherwise.

The scalarϑτ is given by

ϑτ = yTτ−1yτ−1 yτ−T 1sτ−1

.

For 1 < τ ≤ d, the matrices Yτ, Sτ, Dτ and Lτ will only be constructed from s1, . . . ,sτ−1and y1, . . . ,yτ−1, and consequentlyYτ ∈ Rn×(τ−1), Sτ ∈ Rn×(τ−1), Kτ ∈ Rn×2(τ−1) and Mτ ∈ R2(τ−1)×2(τ−1). For τ = 1, we let Bτ = ϑ0I where ϑ0 > 0 is an initial estimate. For the rationale behind the limited-memory BFGS method and minor modifications to guarantee the positive definiteness of Bτ, we refer to [26, 27, 72].

The gradient projection method The gradient projection method estimates the free variables of the box constraints in (3.9) and then performs a subspace mini- mization to provide an approximate solution to (3.9). The estimation of the free variables is via the generalized Cauchy point. Let

˜

xτ(t)= PXτ[wτ−1−t∇qτ(wτ−1)], wτ−1 :=PXτ(xˆτ−1),

and lettτc be the smallest minimizer of the piecewise quadratic function qτ(x˜τ(t)) overt > 0. The generalized Cauchy point is then given byxcτ := x˜τ tτc

. Let I= n

i ∈ {1, . . . ,n} : xτ,ic = xτ,i orxτ,ic = xτ,i o. We then solve

uτ =arg min

u∈Rn

qτ(xτc+u) s.t. uτ,I = 0.

(3.11) It can be immediately recognized that (3.11) is essentially an unconstrained quadratic program. After obtaininguτ, we compute2

˜

uτ = PXτ

xcτ+uτ

−wτ−1, (3.12)

and generate ˆxτ by

ˆ

xτ = wτ−1ττ,

whereατ is equal to 1 or is determined by some backtracking strategy.

Because of the compact representation of the limited-memory BFGS method (3.10), the search for tτc and the minimization (3.11) can be computed in a very efficient manner that has time complexityO(d2n+d3+nlogn). More details of the gradient projection method can be found in [72, Section 16.7] and also in [26, 69].

2 In the case where the ˜uτ generated by (3.12) does not satisfy∇qτ(wτ−1)Tu˜τ < 0, we set

˜

uτ =xτc+uτwτ−1, which is a descent direction ofqτ as shown in [26]. A backtracking is then needed to ensure that ˆxτis feasible.

3.3 Approximate Newton Method: The General Case