Approximate Newton Method: A Special Case

Chapter III: Second-Order Algorithms for Time-Varying Optimization

3.2 Approximate Newton Method: A Special Case

We first consider the special case where there are no explicit equality and inequality constraints in (3.1), i.e.,

x∈minRⁿ

c_τ(x)+h_τ(x) (3.2)

for eachτ∈ T. By Lemma 2.8, its optimality condition is given by

−∇cτ x_τ^∗

∈∂hτ x_τ^∗, (3.3)

wherex_τ^∗is any local optimal solution to (3.2). Just as in Chapter 2, in the following discussion, we use x_τ^∗to denote an arbitrarily chosen local optimal solution in case there are multiple local optimal solutions to (3.2), and we will mainly focus on this particular sequence x_τ^∗

τ∈T.

The approximate Newton method for (3.2) is given by the iteration ˆ

x_τ =arg min

x∈Rⁿ

∇c_τ(xˆ_τ−1)^T(x− xˆ_τ−1)+ 1

2(x−xˆ_τ−1)^TB_τ(x− xˆ_τ−1)+h_τ(x) (3.4) or equivalently in the form of a generalized equation

−∇cτ(xˆτ−1) −Bτ(xˆτ−xˆτ−1) ∈ ∂hτ(xˆτ) (3.5) for eachτ∈ T, whereBτ ∈R^n×nis a positive definite matrix. The rationale behind (3.4) is that we approximate the smooth part of the objective functioncτby a convex quadratic function for each τ, and when hτ is simple or has special structures, we expect that the resulting convex program can be solved much more efficiently than finding the batch solution. We do not put restrictions on the method of producing the matrix Bτ in (3.2) apart from the positive definiteness of Bτ. Therefore (3.2) gives a class of algorithms. For example, we can setB_τto be a scalar matrix, and in this case (3.2) is a special case of the first-order proximal gradient method proposed in Chapter 2. We can also set Bτ = ∇²cτ(xˆτ−1), or use any quasi-Newton method to calculate Bτ, and in this case we obtain a generalization of the (quasi-)Newton method to the time-varying situation, which can be seen more clearly by noting that (3.4) is equivalent to

xˆτ = xˆτ−1−B_τ⁻¹∇cτ(xˆτ−1) whenhτ =0.

Since we use a quadratic approximation of cτ in deriving (3.4), we can tell from intuition that the quality of this quadratic approximation plays a significant role in the tracking performance. We now proceed to capture this intuition mathematically.

Tracking Performance

We define thetracking errorby

xˆτ−x_τ^∗ W

for eachτ ∈ T, where we denote kxk_W := x^TW x for any x ∈ Rⁿand any positive definiteW ∈ R^n×n. Tracking performance will then be evaluated quantitatively by the tracking error.

We then define

σW := sup

τ∈T \{0}

x_τ^∗−x^∗_τ−1 _W,

which upper bounds the distance between consecutive optimal solutions. It charac- terizes how fast the optimal solutionx_τ^∗drifts as time proceeds.

Lemma 3.1. Letτ∈ T be arbitrary. Ifzˆ_τ is generated by(3.4), then

xˆτ− x_τ^∗ _B

τ ≤ ρτ

xˆτ−1− x^∗_τ _B

τ, (3.6)

where

ρτ :=

B⁻¹_τ ∇cτ xˆτ−1

− ∇cτ x^∗_τ − xˆτ−1− x_τ^∗ _B

kxˆτ−1−x_τ^∗k_B_τ . Proof. The generalized equation (3.5) implies that

hτ(x_τ^∗) −hτ(xˆτ) ≥ x^∗_τ− xˆτT

−∇cτ xˆτ−1

−Bτ(xˆτ −xˆτ−1). On the other hand, by the optimality condition (3.3), we have

hτ(xˆτ) −hτ(x_τ^∗) ≥ − xˆτ− x_τ^∗T

∇cτ x_τ^∗. We then have

xτ −x_τ^∗T

∇cτ xˆτ−1

− ∇cτ x_t^∗

+Bτ(xˆτ−xˆτ−1)

≤ 0, which leads to

xˆτ−x^∗_τ

Bτ = xˆτ− x^∗_τT

Bτ xˆτ −x_τ^∗

≤ xˆτ− x^∗_τT

Bτ

B⁻¹_τ ∇cτ x_t^∗

− ∇cτ xˆτ−1 − x_τ^∗−xˆτ−1

≤

xˆτ−x^∗_τ Bτ ·

B_τ⁻¹ ∇cτ x_t^∗

− ∇cτ xˆτ−1 − x^∗_τ− xˆτ−1

Bτ

xˆτ−x^∗_τ

Bτ · ρτ

xˆτ−1− x^∗_τ

2 Bτ .

The inequality (3.6) now follows directly.

Lemma 3.1 directly leads to the following theorem on the tracking error bound for (3.4).

Theorem 3.1. LetW ∈R^n×nbe a positive definite matrix, and λM := sup

τ∈Tinf{λ∈R:λW Bτ}, λm := inf

τ∈Tsup{λ∈R:λW Bτ}. If

ρ:= sup

τ∈T

B_τ⁻¹ ∇cτ xˆτ−1

− ∇cτ x_τ^∗ − xˆτ−1−x_τ^∗

Bτ

kxˆτ−1− x^∗_τk_B_τ <

rλm

λM

, (3.7)

then

xˆτ −x_τ^∗

W ≤ ρσW

pλm/λM− ρ + ρ rλM

λm

!τ

xˆ0− x₁^∗

W − σW

pλm/λM

pλm/λM− ρ

(3.8) for anyτ ∈ T.

Proof. The definition ofρimplies that ρ_τ ≤ ρfor allτ ∈ T. Therefore by Lemma 3.1, forτ= 1, we have

xˆ1−x₁^∗ W ≤

qλm⁻¹

xˆ1− x₁^∗

B₁ ≤ ρq λ⁻¹m

xˆ0− x^∗₁ B₁ ≤ ρ

rλM

λm

xˆ1−x₀^∗ W, while if (3.8) holds for someτ, then

xˆτ+1−x_τ^∗₊₁ W

≤ qλ⁻¹m

xˆτ+1−x^∗_τ₊₁

Bτ ≤ ρq λ⁻¹m

xˆτ −x_τ^∗₊₁ Bτ

≤ ρ rλM

λm

xˆτ− x_τ^∗ W +

x_τ^∗−x_τ^∗₊₁ W

≤ ρ rλM

λm

ρσW

pλm/λM −ρ + ρ rλM

λm

!τ

xˆ₀−x₁^∗

_W − σW

pλm/λM

pλm/λM −ρ

! +σW

= ρσW

pλm/λM− ρ + ρ rλM

λm

!τ+1

xˆ₀− x₁^∗

W − σW

pλm/λM

pλm/λM −ρ

! .

The bound (3.8) then follows from mathematical induction.

The condition (3.7) and the tracking error bound (3.8) suggest that ρ should be sufficiently small so that good tracking performance can be achieved. However, in practice, it is difficult to evaluate the quantity ρ unless we carry out the whole iterations (3.4), which makes Theorem 3.1 not very useful for estimating the tracking error a priori. Rather, Theorem 3.1 justifies our intuition that the quality of the

quadratic approximation in (3.4), which is now quantitatively assessed byρ, plays a central role in the tracking performance. Furthermore, in order to make ρas small as possible, we need

∇cτ xˆτ−1

− ∇cτ x_τ^∗

≈ Bτ xˆτ−1−x_τ^∗.

for eachτ; in other words, the matricesB_τin (3.4) should approximate the (averaged) Hessian ofcτ along the direction ˆxτ−1−x^∗_τ. This implication partially substantiates our motivation of introducing second-order methods: although (3.4) is in general harder to compute than the proximal gradient method, the tracking error can be potentially reduced by employing a more sophisticatedBτ.

A Centralized Algorithm Based on L-BFGS-B

As Theorem 3.1 points out, the approach of producing the matrices Bτ has a ma- jor effect on the tracking performance of the approximate Newton method (3.4).

However, the Hessian of cτ is in general a dense n× n matrix, which makes the scalability of (3.4) questionable asnincreases. Another issue is that (3.4) involves optimization of a potentially nonsmooth function, and in the special case whereh_τis the indicator function of some convex setX_τ, we obtain a constrained optimization problem which cannot be solved by directly applying algorithms for unconstrained optimization.

In this subsection, we provide a specific algorithm based on L-BFGS-B to deal with these issues, when the nonsmooth convex function hτ is the indicator function of a rectangular set inRⁿwith a nonempty interior:

h_τ = IX_τ, X_τ =

x ∈Rⁿ : x_τ ≤ x ≤ x_τ .

The L-BFGS-B algorithm [26, 69] is a quasi-Newton method that solves nonlinear programs with box constraints. It employs two key techniques: the limited-memory BFGS method to produce the approximate Hessian, and the gradient projection method1 to handle the box constraints. By applying these two key techniques to the time-varying optimization setting, we propose the following procedure for each τ ∈ T:

1. ProduceB_τ by the limited-memory BFGS method.

1Not to be confused with the projected gradient method.

2. Use the gradient projection method to get an approximate solution to the quadratic program

minx∈Rⁿ

q_τ(x):= g^T_τ(x−xˆ_τ−1)+ 1

2(x−xˆ_τ−1)^TB_τ(x−xˆ_τ−1) s.t. x_τ ≤ x ≤ xτ,

(3.9)

wheregτ =∇cτ(xˆτ−1).

Note that the gradient projection method which will be presented shortly will only produce an approximate solution to the quadratic program (3.9) in general. Never- theless, simulation shows that such an approximate solution usually suffices.

We now give a succinct introduction to these two key techniques under the time- varying optimization setting.

Limited-memory BFGS In limited-memory BFGS, the approximate Hessian Bτ

takes the form

Bτ =ϑτI−KτMτK^T_τ, (3.10) where K_τ ∈ R^n×2d and M_τ ∈ R^2d×2d for some d ∈ N. In practice d is typically between 3 and 20 and is usually much smaller than n. It can be seen that B_τ is a small rank correction of a scalar matrix.

To be specific, we denote

sτ = xˆτ −xˆτ−1, yτ = ∇cτ(xˆτ) − ∇cτ(xˆτ−1), and

Yτ = h

yτ−d · · · yτ−1

i, Sτ = h

sτ−d · · · sτ−1

i, The matricesKτ andMτ are then given by

Kτ = h

Yτ ϑτSτ

i, Mτ =

−Dτ L_τ^T Lτ ϑτS^T_τSτ

#−1

whereDτ ∈R^d×d andLτ ∈R^d×dare given by Dτ = diag

s^T_τ−dyτ−d, . . . ,s^T_τ−₁yτ−1

(L_τ)_{i j} =

(s^T_τ−d−1₊_iy_τ−d−1₊j, ifi > j,

0, otherwise.

The scalarϑτ is given by

ϑτ = y^T_τ−1y_τ−1 y_τ−^T ₁sτ−1

For 1 < τ ≤ d, the matrices Yτ, Sτ, Dτ and Lτ will only be constructed from s₁, . . . ,sτ−1and y₁, . . . ,yτ−1, and consequentlyYτ ∈ R^n×(τ−¹⁾, Sτ ∈ R^n×(τ−¹⁾, Kτ ∈ R^n×2(τ−1) and Mτ ∈ R2(τ−1)×2(τ−1). For τ = 1, we let Bτ = ϑ₀I where ϑ₀ > 0 is an initial estimate. For the rationale behind the limited-memory BFGS method and minor modifications to guarantee the positive definiteness of Bτ, we refer to [26, 27, 72].

The gradient projection method The gradient projection method estimates the free variables of the box constraints in (3.9) and then performs a subspace minimization to provide an approximate solution to (3.9). The estimation of the free variables is via the generalized Cauchy point. Let

xτ(t)= P_X_τ[wτ−1−t∇qτ(wτ−1)], wτ−1 :=P_X_τ(xˆτ−1),

and lett_τ^c be the smallest minimizer of the piecewise quadratic function qτ(x˜τ(t)) overt > 0. The generalized Cauchy point is then given byx^c_τ := x˜τ t_τ^c

. Let I= n

i ∈ {1, . . . ,n} : x_τ,i^c = xτ,i orx_τ,i^c = x_τ,i o. We then solve

uτ =arg min

u∈Rⁿ

qτ(x_τ^c+u) s.t. u_τ,I = 0.

(3.11) It can be immediately recognized that (3.11) is essentially an unconstrained quadratic program. After obtainingu_τ, we compute2

uτ = P_X_τ

x^c_τ+uτ

−wτ−1, (3.12)

and generate ˆxτ by

xτ = wτ−1+ατu˜τ,

whereα_τ is equal to 1 or is determined by some backtracking strategy.

Because of the compact representation of the limited-memory BFGS method (3.10), the search for t_τ^c and the minimization (3.11) can be computed in a very efficient manner that has time complexityO(d²n+d³+nlogn). More details of the gradient projection method can be found in [72, Section 16.7] and also in [26, 69].

2 In the case where the ˜u_τ generated by (3.12) does not satisfy∇q_τ(w_τ−1)^Tu˜_τ < 0, we set

u_τ =x_τ^c+u_τ−w_τ−1, which is a descent direction ofq_τ as shown in [26]. A backtracking is then needed to ensure that ˆx_τis feasible.

3.3 Approximate Newton Method: The General Case

Dalam dokumen Time-Varying Optimization and Its Application to Power System Operation (Halaman 91-98)