• Tidak ada hasil yang ditemukan

Regret bounded by the pathlength of ๐‘ค and the energy of ๐‘ฃ

Dalam dokumen Regret-Optimal Control (Halaman 58-65)

Chapter IV: Regret-Optimal Measurement-Feedback Control

4.3 Regret bounded by the pathlength of ๐‘ค and the energy of ๐‘ฃ

Suppose ๐‘ƒ

2 is chosen to be this solution, and define ๐พ

2 = ๐พ

2(๐‘ƒ

2), ฮฃ2 = ฮฃ2(๐‘ƒ

2). We immediately obtain the factorization (4.8), where we define

ฮ”2(๐‘ง) = ฮฃ1/2

2 (๐ผ +๐พ

2(๐‘ง ๐ผ โˆ’ ๐ดe)โˆ’1๐ต๐‘ค). (4.10) We have

ฮ”โˆ’1

2 (๐‘ง) = (๐ผโˆ’๐พ

2(๐‘ง ๐ผ โˆ’ (๐ดeโˆ’ ๐ต๐‘ค๐พ

2))โˆ’1๐ต๐‘ค)ฮฃโˆ’1/2

2 . We note that ๐ดeโˆ’๐ต๐‘ค๐พ

2 is stable and henceฮ”โˆ’1(๐‘ง) is causal and bounded since its poles are strictly contained in the unit circle. Define

๏ฃฑ๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด

๏ฃฒ

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃณ ๐ดb=

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐ด โˆ’๐ต๐‘ค๐พ

2

0 ๐ดeโˆ’๐ต๐‘ค๐พ

2

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ,

๐ตb๐‘ค =

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐ต๐‘คฮฃโˆ’1/2

2

๐ต๐‘คฮฃโˆ’1/2

2

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ,

๐ตb๐‘ข =

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐ต๐‘ข

0

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ,

๐ถb= h

๐›พ๐ถ 0 i

,

b๐ฟ = h

๐ฟ 0 i

.

(4.11)

Recall that ๐นb= ๐น ,๐บb = ๐บฮ”โˆ’1

2

,๐ปb = ๐›พ ๐ป ,๐ฝb= ๐›พ ๐ฝฮ”โˆ’1

2 ; it follows that ๐น ,b ๐บ ,b ๐ป ,b ๐ฝbare given by

๐นb(๐‘ง) =b๐ฟ(๐‘ง ๐ผ โˆ’ ๐ดb)โˆ’1๐ตb๐‘ข, ๐บb(๐‘ง)=b๐ฟ(๐‘ง ๐ผโˆ’ ๐ดb)๐ตb๐‘ค, ๐ปb(๐‘ง) =๐ถb(๐‘ง ๐ผ โˆ’ ๐ดb)โˆ’1๐ตb๐‘ข, ๐ฝb(๐‘ง) =๐ถb(๐‘ง ๐ผ โˆ’ b๐ด)๐ตb๐‘ค.

โ–ก

of the ๐‘ค is zero and the energy of๐‘ฃ is zero are pairs where๐‘ค is constant and ๐‘ฃ is zero. It is easy to guarantee that๐œ‹matches the offline controller๐œ‹

0on these specific instances, without constraining the behavior of๐œ‹on all other instances.

Analogously with the full-information setting, we derive the measurement-feedback controller whose regret has optimal dependence on the pathlength of the driving disturbance and the energy of the measurement disturbance via a reduction to ๐ปโˆž measurement-feedback control.

Theorem 14. Fix ๐›พ > 0 and define ๐ด,b ๐ตb๐‘ข,๐ตb๐‘ค,๐ถ ,b b๐ฟ as in (4.19). There exists a causal measurement-feedback controller๐œ‹such that

๐ฝ(๐œ‹, ๐‘ค , ๐‘ฃ) โˆ’๐ฝ(๐œ‹

0, ๐‘ค) < ๐›พ2( โˆฅ๐ท๐‘๐‘คโˆฅ2

2+ โˆฅ๐‘ฃโˆฅ2

2) (4.12)

if and only if the control DARE

๐‘ƒ๐‘ = ๐ดbโˆ—๐‘ƒ๐‘๐ดb+b๐ฟโˆ—b๐ฟโˆ’๐พโˆ—

๐‘๐‘…๐‘๐พ๐‘, and the estimation DARE

๐‘ƒ๐‘’ = ๐ด๐‘ƒb ๐‘’๐ดbโˆ—+๐ตb๐‘ค๐ตbโˆ—

๐‘คโˆ’๐พ๐‘’๐‘…๐‘’๐พโˆ—

๐‘’, where we define

๏ฃฑ๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด

๏ฃฒ

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃณ

๐พ๐‘= ๐‘…โˆ’1 ๐‘

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐ตbโˆ—

๐‘ข

๐ตbโˆ—

๐‘ค

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ๐‘ƒ๐‘๐ดb

๐‘…๐‘ =

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐ผ๐‘š 0 0 โˆ’๐ผ๐‘

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป +

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐ตbโˆ—

๐‘ข

๐ตbโˆ—

๐‘ค

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ๐‘ƒ๐‘

h

๐ตb๐‘ข ๐ตb๐‘ค i

,

๐พ๐‘’ = b๐ด๐‘ƒ๐‘‘ h

๐ถb ๐ฟ i

๐‘…โˆ’1

๐‘’ ,

๐‘…๐‘’ =

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐ผ๐‘Ÿ 0

0 โˆ’๐ผ๐‘›

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป +

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐ถb b๐ฟ

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ๐‘ƒ๐‘

h

๐ถbโˆ— b๐ฟโˆ— i

,

have solutions๐‘ƒ๐‘ โชฐ 0and๐‘ƒ๐‘’ โชฐ0such that 1. The matrix ๐ดbโˆ’ h

๐ตb๐‘ข ๐ตb๐‘ค i

๐พ๐‘ is stable.

2. The matrix๐‘…๐‘has๐‘špositive eigenvalues and๐‘negative eigenvalues.

3. The matrix ๐ดbโˆ’๐พ๐‘’

"

๐ถb b๐ฟ

#

is stable.

4. The matrix๐‘…๐‘’has๐‘Ÿ positive eigenvalues and๐‘›negative eigenvalues.

5. ๐œŒ(๐‘ƒ๐‘๐‘ƒ๐‘’) < 1.

If these conditions are satisfied, then one possible choice of๐œ‹is given by ๐‘ข๐‘ก =โˆ’๐พ๐‘ข(๐œ‰๐‘ก+๐‘ƒ๐ถbโˆ—(๐ผ๐‘Ÿ +๐ถ ๐‘ƒb ๐ถbโˆ—)โˆ’1(๐‘ฆ๐‘กโˆ’๐ถ ๐œ‰b ๐‘ก)),

where the synthetic state๐œ‰ โˆˆR2๐‘›+๐‘evolves according to the linear dynamics equa- tion

๐œ‰๐‘ก+

1 =(๐ดbโˆ’ ๐ตb๐‘ค๐พ๐‘ค) (๐œ‰๐‘ก+๐‘ƒ๐ถbโˆ—(๐ผ๐‘Ÿ +๐ถ ๐‘ƒb ๐ถbโˆ—)โˆ’1(๐‘ฆ๐‘กโˆ’๐ถ ๐œ‰b ๐‘ก)) +๐ตb๐‘ข๐‘ข๐‘ก. The matrices๐พ๐‘ข โˆˆR๐‘šร—(2๐‘›+๐‘) and๐พ๐‘ค โˆˆR๐‘ร—(2๐‘›+๐‘) are defined as

"

๐พ๐‘ข ๐พ๐‘ค

#

=๐พ๐‘,

and we define๐‘ƒ โˆˆR(2๐‘›+๐‘)ร—(2๐‘›+๐‘) as

๐‘ƒ =๐‘ƒ๐‘’(๐ผ๐‘›โˆ’๐‘ƒ๐‘๐‘ƒ๐‘’)โˆ’1.

We note that the regret-optimal controller can be easily obtained via bisection on๐›พ. Proof. The regret condition (14) can be rewritten in matrix form as

"

๐‘ค ๐‘ฃ

#โˆ—

๐‘‡โˆ—

๐พ๐‘‡๐พ

"

๐‘ค ๐‘ฃ

#

<

"

๐‘ค ๐‘ฃ

#โˆ—

๐›พ2

"

๐ทโˆ—

๐‘๐ท๐‘ 0

0 ๐ผ

# +๐‘‡โˆ—

๐พ0

๐‘‡๐พ

0

! "

๐‘ค ๐‘ฃ

#

. (4.13)

Letฮ”2be the unique causal and causally invertible operator such that ๐›พ2๐ทโˆ—

๐‘๐ท๐‘+๐บโˆ—(๐ผ +๐น ๐นโˆ—)โˆ’1๐บ = ฮ”โˆ—

2ฮ”2. (4.14)

Then

๐›พ2

"

๐ทโˆ—

๐‘๐ท๐‘ 0

0 ๐ผ

# +๐‘‡โˆ—

๐พ0

๐‘‡๐พ

0 =

"

ฮ”2 0 0 ๐›พ ๐ผ

#โˆ—"

ฮ”2 0 0 ๐›พ ๐ผ

# .

Define "

๐‘คb b๐‘ฃ

#

=

"

ฮ”2 0 0 ๐›พ ๐ผ

# "

๐‘ค ๐‘ฃ

# .

Condition (4.13) can be rewritten as

"

๐‘คb b๐‘ฃ

#โˆ—

๐‘‡โˆ—

๐พb

๐‘‡

๐พb

"

๐‘คb b๐‘ฃ

#

<

"

๐‘คb b๐‘ฃ

#

2

2

, or equivalently as

โˆฅ๐‘‡

๐พbโˆฅ < 1, where we define

๐‘‡

๐พb=๐‘‡๐พ

"

ฮ”2 0 0 ๐›พ ๐ผ

#โˆ’1

. Using the parameterization (4.1) of๐‘‡๐พ, we see that

๐‘‡

๐พb=

"

๐บฮ”โˆ’1

2 0

0 0

# +

"

๐น ๐ผ

#

(๐›พโˆ’1๐‘„)h ๐›พ ๐ฝฮ”โˆ’1

2 ๐ผ

i . Notice that๐‘‡

๐พbitself has the form of a transfer operator described in (4.1); it is the transfer operator with Youla parameter๐‘„b=๐›พโˆ’1๐‘„in the system

b๐‘  =๐นb b๐‘ข+๐บb

๐‘ค ,b

b๐‘ฆ=๐ปb b๐‘ข+๐ฝb

๐‘คb+b๐‘ฃ , (4.15)

where we define ๐นb = ๐น ,๐บb = ๐บฮ”โˆ’1

2 ,๐ปb = ๐›พ ๐ป ,๐ฝb= ๐›พ ๐ฝฮ”โˆ’1

2 . It is now clear that a controller๐พ satisfying (4.13) exists if and only if there exists a controller ๐พbin the system (4.15) such that โˆฅ๐‘‡

๐พbโˆฅ < 1. If such a controller๐พbexists, then we can easily recover ๐พ from ๐พbby setting ๐พ = ๐›พ๐พb; notice that this is the unique choice of ๐พ which is consistent with the relations๐ปb=๐›พ ๐ป ,๐‘„b=๐›พโˆ’1๐‘„.

In order to assign state-space structure to ๐น ,b ๐บ ,b ๐ป ,b ๐ฝb, we must first findฮ”2(๐‘ง). Let ฮ”be the unique causal and causally invertible operator such that

๐ผ+๐น ๐นโˆ— = ฮ”ฮ”โˆ—. Theorem 2 shows that

ฮ”(๐‘ง) = (๐ผ+๐ฟ(๐‘ง ๐ผ โˆ’ ๐ด)โˆ’1๐พ)ฮฃ1/2. It follows thatฮ”โˆ’1(๐‘ง)๐บ(๐‘ง)is given by

ฮ”โˆ’1(๐‘ง)๐บ(๐‘ง) = ฮฃโˆ’1/2๐ฟ(๐‘ง ๐ผ โˆ’ (๐ดโˆ’๐พ ๐ฟ))โˆ’1๐ต๐‘ค. Define

e๐ฟ =

"

ฮฃโˆ’1/2๐ฟ 0 0 ๐›พ ๐ผ๐‘

# , ๐‘†e=

"

0 ๐›พ2๐ผ๐‘

#

, ๐ดe=

"

๐ดโˆ’๐พ ๐ฟ 0

0 0

#

, ๐ตe๐‘ค =

"

๐ต๐‘ค

โˆ’๐ผ๐‘

#

. (4.16)

Notice that we can rewrite

"

๐›พ2๐ทโˆ—

๐‘(๐‘งโˆ’โˆ—)๐ท๐‘(๐‘ง) +๐บโˆ—ฮ”โˆ’โˆ—ฮ”โˆ’1๐บ 0

0 ๐›พ2๐ผ

#

as

h ๐ตeโˆ—

๐‘ค(๐‘งโˆ’โˆ—๐ผโˆ’ ๐ดe)โˆ’โˆ— ๐ผ i

"

e๐ฟโˆ—e๐ฟ e๐‘† e๐‘†โˆ— ๐›พ2๐ผ

# "

(๐‘ง ๐ผ โˆ’๐ดe)โˆ’1๐ตe๐‘ค ๐ผ

# . Applying Lemma 1, we see that this equals

h ๐ตeโˆ—

๐‘ค(๐‘งโˆ’โˆ—๐ผโˆ’ ๐ดe)โˆ’โˆ— ๐ผ i

ฮ›2(๐‘ƒ

2)

"

(๐‘ง ๐ผโˆ’ ๐ดe)โˆ’1๐ตe๐‘ค ๐ผ

# , where๐‘ƒ

2is an arbitrary Hermitian matrix and we define ฮ›2(๐‘ƒ

2) =

"

e๐ฟโˆ—e๐ฟโˆ’๐‘ƒ

2+๐ดeโˆ—๐‘ƒ

2๐ดe ๐‘†e+๐ดeโˆ—๐‘ƒ

2๐ตe๐‘ค e๐‘†โˆ—+๐ตeโˆ—

๐‘ค๐‘ƒ

2๐ดe ๐›พ2๐ผ+๐ตeโˆ—

๐‘ค๐‘ƒ

2๐ตe๐‘ค

# . Notice that theฮ›2(๐‘ƒ

2) can be factored as

"

๐ผ ๐พโˆ—

2(๐‘ƒ

2)

0 ๐ผ

# "

ฮ“2(๐‘ƒ

2) 0

0 ฮฃ2(๐‘ƒ

2)

# "

๐ผ 0

๐พ2(๐‘ƒ

2) ๐ผ

# , where we define

ฮ“2(๐‘ƒ

2)=e๐ฟโˆ—e๐ฟโˆ’๐‘ƒ

2+๐ดeโˆ—๐‘ƒ

2e๐ดโˆ’๐พโˆ—

2(๐‘ƒ

2)ฮฃ2๐พ

2(๐‘ƒ

2), ๐พ2(๐‘ƒ

2) = ฮฃโˆ’1

2 (๐‘ƒ

2) (e๐‘†โˆ—+๐ตeโˆ—

๐‘ค๐‘ƒ

2e๐ด), ฮฃ2(๐‘ƒ

2) =๐›พ2๐ผ+๐ตeโˆ—

๐‘ค๐‘ƒ

2๐ตe๐‘ค. (4.17) It is clear that (๐ด,e ๐ตe๐‘ค) is stabilizable (this follows from the stability of ๐ดโˆ’ ๐พ ๐ฟ), therefore the Riccati equationฮ“2(๐‘ƒ

2) =0 has a unique stabilizing solution (Theorem E.6.2 in [KSH00]). Suppose ๐‘ƒ

2 is chosen to be this solution, and define ๐พ

2 = ๐พ2(๐‘ƒ

2), ฮฃ2 = ฮฃ2(๐‘ƒ

2). We immediately obtain the factorization (4.14), where we define

ฮ”2(๐‘ง) = ฮฃ1/2

2 (๐ผ +๐พ

2(๐‘ง ๐ผ โˆ’ ๐ดe)โˆ’1๐ตe๐‘ค). (4.18) We have

ฮ”โˆ’1

2 (๐‘ง) = (๐ผโˆ’๐พ

2(๐‘ง ๐ผ โˆ’ (๐ดeโˆ’ ๐ตe๐‘ค๐พ

2))โˆ’1๐ตe๐‘ค)ฮฃโˆ’1/2

2 .

We note that e๐ดโˆ’๐ตe๐‘ค๐พ

2 is stable and henceฮ”โˆ’1

2 (๐‘ง) is causal and bounded since its poles are strictly contained in the unit circle. Define

๏ฃฑ๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด

๏ฃฒ

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃด๏ฃด

๏ฃณ ๐ดb=

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐ด โˆ’๐ต๐‘ค๐พ

2

0 ๐ดeโˆ’๐ตe๐‘ค๐พ

2

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป

๐ตb๐‘ข =

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ ๐ต๐‘ข

0

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ,

๐ตb๐‘ค =

๏ฃฎ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฏ

๏ฃฐ

๐ต๐‘คฮฃโˆ’1/2

2

๐ตe๐‘คฮฃโˆ’1/2

2

๏ฃน

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃบ

๏ฃป ,

๐ถb= h

๐›พ๐ถ 0 i

,

b๐ฟ = h ๐ฟ 0

i .

(4.19)

Recall that ๐นb= ๐น ,๐บb = ๐บฮ”โˆ’1

2

,๐ปb = ๐›พ ๐ป ,๐ฝb= ๐›พ ๐ฝฮ”โˆ’1

2 ; it follows that ๐น ,b ๐บ ,b ๐ป ,b ๐ฝbare given by

๐นb(๐‘ง) =b๐ฟ(๐‘ง ๐ผ โˆ’ ๐ดb)โˆ’1๐ตb๐‘ข, ๐บb(๐‘ง)=b๐ฟ(๐‘ง ๐ผโˆ’ ๐ดb)๐ตb๐‘ค, ๐ปb(๐‘ง) =๐ถb(๐‘ง ๐ผ โˆ’ ๐ดb)โˆ’1๐ตb๐‘ข, ๐ฝb(๐‘ง) =๐ถb(๐‘ง ๐ผ โˆ’ b๐ด)๐ตb๐‘ค.

โ–ก

C h a p t e r 5

CONNECTIONS TO ONLINE LEARNING

In this section we demonstrate a surprising connection between regret-optimal con- trol and online learning. Specifically, we show that the competitive controller is closely approximated by disturbance-action-control (DAC) policies, which general- ize linear state-feedback policies. A DAC policy has the form

๐‘ข๐‘ก =๐พ ๐‘ฅ๐‘ก+

๐ป

โˆ‘๏ธ

๐‘–=1

๐‘€[๐‘–โˆ’1]๐‘ค๐‘กโˆ’

1,

where๐พ is a stabilizing state-feedback controller and๐‘€ = (๐‘€[0]

, . . . , ๐‘€[๐ปโˆ’1])is a series of geometrically decreasing weights. The class of DAC policies has attracted much recent attention in the online learning community [Aga+19; FS20; Sim20;

CH21] due to the fact that that for any fixed๐พ, the states generated by a DAC policy have a linear dependence on the weights ๐‘€; it follows that the problem of selecting the weights to minimize cost can be formulated as a convex program. In particular, [Aga+19] showed that the weights can be optimized online via a gradient-based algorithm using a reduction to online convex optimization with memory. This problem, introduced in [AHM15], is a generalization of OCO where the costs incurred by the learner depend not only on the learnerโ€™s most recent decision but rather on the learnerโ€™s past๐ป decisions.

Using this approximation of the competitive controller by DAC policies, we show that online algorithms with sublinear policy regret against DAC policies can be in- stantiated so as to retain sublinear policy regret while also achieving an approximate competitive ratio guarantee on bounded disturbance sequences. The proof can be summarized as follows. We first construct a DAC policy หœ๐œ‹๐‘ which๐œ€-approximates the strictly causal competitive policy๐œ‹๐‘, in the sense that

๐ฝ๐‘‡(๐œ‹หœ๐‘, ๐‘ค) โˆ’๐ฝ๐‘‡(๐œ‹๐‘, ๐‘ค) โ‰ค๐œ€

for all bounded disturbances๐‘ค. Our construction of หœ๐œ‹๐‘ involves rewriting the com- petitive controller as the sum of a stabilizing component๐พb

0and a linear combination of all previous disturbances; the weights on the most recent๐ปdisturbances are used to instantiate หœ๐œ‹๐‘. We then observe that an algorithm which obtains sublinear pol- icy regret against a DAC class containing หœ๐œ‹๐‘ will automatically attain the optimal

competitive ratio of the strictly causal competitive controller ๐œ‹๐‘, up to a sublinear correction term. Surprisingly, these results can be extended even to the โ€œadaptive control" setting, where the online algorithm does not know the system dynamics ahead of time and must perform online system identification.

Our results shows that combining regret-optimal controllers, which are derived using๐ปโˆž control, can be fruitfully combined with gradient-based algorithms from online learning to give new performance guarantees in control. While the results we present here are stated in terms of the competitive controller, we expect that analogous results can be derived for other flavors of regret-optimal control using similar techniques.

5.1 Approximation of the competitive controller by DAC policies

We prove that we can find a DAC policy which generates a sequence of states and control actions which closely tracks the sequence of states and control actions generated by the optimal competitive policy by taking the history๐ปof the DAC to be sufficiently large, and choosing the stabilizing component and weights appropriately.

Before we state our approximation result, we introduce some convenient notation.

We introduce the partition

๐พb= h

๐พb

0 ๐พb

1

i

of๐พbinto two๐‘›ร—๐‘›block matrices, where๐พbis the state-feedback matrix appearing in the strictly causal controller we obtained in Theorem 6. Recall that ๐ดb+๐ตb๐พbis stable; this matrix is block upper-triangular, and the matrix in the (1, 1) block is ๐ด+๐ต๐พb

0. It follows that๐ด+๐ต๐พb

0is also stable, and hence there exist matrices๐‘†

1, ๐ฟ

1

such

๐ด+๐ต๐พb

0=๐‘†

1๐ฟ

1๐‘†โˆ’1

1

andโˆฅ๐ฟ

1โˆฅ < 1. Similarly, there exist matrices๐‘†

2, ๐ฟ

2such that๐ดโˆ’๐พ ๐‘„1/2 =๐‘†

2๐ฟ

2๐‘†โˆ’1 2

andโˆฅ๐ฟ

2โˆฅ <1.

We now show that the competitive controller can be arbitrarily well-approximated by DAC policies:

Theorem 15. Fix a time horizon๐‘‡ and a disturbance bound๐‘Š. Define ๐œ… =max(1,๐พb

0,๐พb

1,โˆฅ๐ต๐‘ขโˆฅ,โˆฅ๐ต๐‘คโˆฅ,โˆฅ๐‘†

1โˆฅ โˆฅ๐‘†โˆ’1

1 โˆฅ,โˆฅ๐‘†

2โˆฅ โˆฅ๐‘†โˆ’1

2 โˆฅ) and

๐›ฟ =1โˆ’max(1/2,โˆฅ๐ฟ

1โˆฅ,โˆฅ๐ฟ

2โˆฅ).

For any๐œ€ > 0, set

๐ป = 1

log(1โˆ’๐›ฟ/2) log

988๐‘Š2๐œ…12max(1,โˆฅ๐‘„โˆฅ2) ๐›ฟ4๐œ€

๐‘‡

and defineM be the set of (๐ป , ๐œ…2(1+ โˆฅ๐‘„1/2โˆฅ), ๐›ฟ/2)-DAC policies with stabilizing component ๐พb

0. Let ๐œ‹๐‘ be the strictly causal competitive controller described in Theorem 6. There exists a DAC policy๐œ‹หœ๐‘ โˆˆ M such that

๐ฝ๐‘‡(๐œ‹หœ๐‘, ๐‘ค) โˆ’๐ฝ๐‘‡(๐œ‹๐‘, ๐‘ค) โ‰ค๐œ€ for all disturbance sequences๐‘ค= (๐‘ค

0, . . . , ๐‘ค๐‘‡) such thatโˆฅ๐‘คโˆฅโˆž โ‰ค๐‘Š. Proof. Unrolling the recursive definition of๐œˆ๐‘กin Theorem 6, we see that

๐œˆ๐‘ก =

๐‘ก

โˆ‘๏ธ

๐‘–=1

(๐ดโˆ’๐พ ๐‘„1/2)๐‘–โˆ’1๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–.

Let{๐‘ฅ๐‘ก}๐‘‡

๐‘ก=0be the state sequence generated by the strictly causal competitive control policy ๐œ‹๐‘ in the original ๐‘š-dimensional system and let {๐œ‰๐‘ก}๐‘‡

๐‘ก=0 is the sequence of states generated by ๐œ‹๐‘ in the 2๐‘š-dimensional synthetic system. Recall that ๐‘คb๐‘ก = ฮฃโˆ’1/2๐‘„1/2๐œˆ๐‘ก and๐‘ข๐‘ก = ๐พ ๐œ‰ห† ๐‘ก, so Theorem 6 implies that the first๐‘› states of ๐œ‰๐‘ก are precisely๐‘ฅ๐‘กโˆ’ ๐œˆ๐‘ก. We decompose ๐‘ฅ๐‘ก as a linear combination of the disturbance terms:

๐‘ฅ๐‘ก+

1= ๐ด๐‘ฅ๐‘ก+๐ต๐‘ข๐พ ๐œ‰b ๐‘ก+๐ต๐‘ค๐‘ค๐‘ก

= ๐ด๐‘ฅ๐‘ก+๐ต๐‘ข h

๐พb

0 ๐พb

1

i

"

๐‘ฅ๐‘กโˆ’๐œˆ๐‘ก ฮฃโˆ’1/2๐‘„1/2๐œˆ๐‘ก

#

+๐ต๐‘ค๐‘ค๐‘ก

= (๐ด+๐ต๐‘ข๐พb

0)๐‘ฅ๐‘ก+๐ต๐‘ข(๐พb

1ฮฃโˆ’1/2๐‘„1/2โˆ’๐พb

0)๐œˆ๐‘ก+๐ต๐‘ค๐‘ค๐‘ก

=

๐‘ก

โˆ‘๏ธ

๐‘–=0

(๐ด+๐ต๐‘ข๐พb

0)๐‘–

๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–+๐ต๐‘ข(๐พb

1ฮฃโˆ’1/2๐‘„1/2โˆ’๐พb

0)๐‘ฃ๐‘กโˆ’๐‘–

=

๐‘ก

โˆ‘๏ธ

๐‘–=0

(๐ด+๐ต๐‘ข๐พb

0)๐‘–ยฉ

ยญ

ยซ

๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–+๐ต๐‘ข

๐‘กโˆ’๐‘–

โˆ‘๏ธ

๐‘—=1

๐‘€[๐‘—โˆ’1]๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–โˆ’๐‘—ยช

ยฎ

ยฌ

=

๐‘ก

โˆ‘๏ธ

๐‘–=0

ยฉ

ยญ

ยซ

(๐ด+๐ต๐‘ข๐พb

0)๐‘–+

๐‘–

โˆ‘๏ธ

๐‘—=1

(๐ด+๐ต๐‘ข๐พb

0)๐‘–โˆ’๐‘—๐ต๐‘ข๐‘€[๐‘—โˆ’1]ยช

ยฎ

ยฌ

๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–,

where we define the weights ๐‘€[๐‘–โˆ’1] = (๐พb

1ฮฃโˆ’1/2๐‘„1/2โˆ’๐พb

0) (๐ดโˆ’๐พ ๐‘„1/2)๐‘–โˆ’1

for all ๐‘– โ‰ฅ 1. We now describe a DAC policy หœ๐œ‹๐‘ which approximates ๐œ‹๐‘; recall that every DAC policy is parameterized by a stabilizing controller and a set of ๐ป weights. We take the stabilizing controller to be ๐พb

0 and the set of weights to be ๐‘€ = (๐‘€[0], . . . , ๐‘€[๐ปโˆ’1]). The control action selected by หœ๐œ‹๐‘ in each timestep is therefore

หœ ๐‘ข๐‘ก =๐พb

0๐‘ฅหœ๐‘ก +

๐ป

โˆ‘๏ธ

๐‘–=1

๐‘€[๐‘–โˆ’1]๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–, (5.1) where{๐‘ฅหœ๐‘ก}๐‘‡

๐‘ก=0is the state sequence generated by หœ๐œ‹๐‘. We now show that the weights {๐‘€[๐‘–โˆ’1]}โˆž

๐‘–=1 decay geometrically in time. Recall that ๐ด+ ๐ต๐พb

0 = ๐‘†

1๐ฟ

1๐‘†โˆ’1

1 , where โˆฅ๐‘†

1โˆฅ โˆฅ๐‘†โˆ’1

1 โˆฅ โ‰ค ๐œ… and โˆฅ๐ฟ

1โˆฅ โ‰ค 1โˆ’ ๐›ฟ. Similarly, ๐ดโˆ’๐พ ๐‘„1/2=๐‘†

2๐ฟ

2๐‘†โˆ’1

2 , where โˆฅ๐‘†

2โˆฅ โˆฅ๐‘†โˆ’1

2 โˆฅ โ‰ค ๐œ…andโˆฅ๐ฟ

2โˆฅ โ‰ค 1โˆ’๐›ฟ. It follows that max

โˆฅ (๐ด+๐ต๐พb

0)๐‘–โˆ’1โˆฅ,(๐ดโˆ’๐พ ๐‘„1/2)๐‘–โˆ’1) โˆฅ

โ‰ค ๐œ…(1โˆ’๐›ฟ)๐‘–โˆ’1. (5.2) Recall thatฮฃ = ๐ผ +๐‘„1/2๐‘ƒ๐‘„1/2โชฐ ๐ผ, therefore โˆฅฮฃโˆ’1/2โˆฅ โ‰ค1. It follows that

โˆฅ๐พb

1ฮฃโˆ’1/2๐‘„1/2โˆ’๐พb

0โˆฅ โ‰ค โˆฅ๐พb

1ฮฃโˆ’1/2๐‘„1/2โˆฅ + โˆฅ๐พb

0โˆฅ

โ‰ค ๐œ…(1+ โˆฅ๐‘„1/2โˆฅ) (5.3) Using (5.2) and (5.3) together, we immediately obtain

โˆฅ๐‘€[๐‘–โˆ’1]โˆฅ โ‰ค ๐œ…2(1+ โˆฅ๐‘„1/2โˆฅ) (1โˆ’๐›ฟ)๐‘–โˆ’1. (5.4)

We now bound the distance between๐‘ฅ๐‘ก and หœ๐‘ฅ๐‘ก . Plugging our choice of หœ๐‘ข๐‘ก given in (5.1) into the dynamics

๐‘ฅ๐‘ก+

1 = ๐ด๐‘ฅ๐‘ก+๐ต๐‘ข๐‘ข๐‘ก+๐ต๐‘ค๐‘ค๐‘ก, we see that

หœ ๐‘ฅ๐‘ก+

1= ๐ด๐‘ฅหœ๐‘ก +๐ต๐‘ข ๐พb

0๐‘ฅหœ๐‘ก+

๐ป

โˆ‘๏ธ

๐‘–=1

๐‘€[๐‘–โˆ’1]๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–

!

+๐ต๐‘ค๐‘ค๐‘ก

=

๐‘ก

โˆ‘๏ธ

๐‘–=0

(๐ด+๐ต๐‘ข๐พb

0)๐‘–ยฉ

ยญ

ยซ

๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–+๐ต๐‘ข

๐ป

โˆ‘๏ธ

๐‘—=1

๐‘€[๐‘—โˆ’1]๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–โˆ’๐‘—ยช

ยฎ

ยฌ

=

๐‘ก

โˆ‘๏ธ

๐‘–=0

ยฉ

ยญ

ยซ

(๐ด+๐ต๐‘ข๐พb

0)๐‘–+

min{๐ป ,๐‘–}

โˆ‘๏ธ

๐‘—=1

(๐ด+๐ต๐‘ข๐พb

0)๐‘–โˆ’๐‘—๐ต๐‘ข๐‘€[๐‘—โˆ’1]ยช

ยฎ

ยฌ

๐ต๐‘ค๐‘ค๐‘กโˆ’๐‘–.

Applying (5.2) and (5.4) we obtain

โˆฅ๐‘ฅ๐‘ก+

1โˆ’๐‘ฅหœ๐‘ก+

1โˆฅ โ‰ค

๐‘ก

โˆ‘๏ธ

๐‘–=๐ป+1 ๐‘–

โˆ‘๏ธ

๐‘—=๐ป+1

โˆฅ (๐ด+๐ต๐‘ข๐พb

0)๐‘–โˆ’๐‘—โˆฅ โˆฅ๐ต๐‘ขโˆฅ โˆฅ๐‘€[๐‘—โˆ’1]โˆฅ โˆฅ๐ต๐‘คโˆฅ๐‘Š

โ‰ค๐‘Š ๐œ…5(1+ โˆฅ๐‘„1/2โˆฅ)

๐‘ก

โˆ‘๏ธ

๐‘–=๐ป+1 ๐‘–

โˆ‘๏ธ

๐‘—=๐ป+1

(1โˆ’๐›ฟ)๐‘–โˆ’1

โ‰ค๐‘Š ๐œ…5(1+ โˆฅ๐‘„1/2โˆฅ) โˆ‘๏ธ

๐‘–โ‰ฅ๐ป+1

๐‘–(1โˆ’๐›ฟ)๐‘–โˆ’1

โ‰ค 4๐‘Š ๐œ…5(1+ โˆฅ๐‘„1/2โˆฅ) ๐›ฟ

โˆ‘๏ธ

๐‘–โ‰ฅ๐ป+1

(1โˆ’๐›ฟ/2)๐‘–,

โ‰ค 8๐‘Š ๐œ…5(1+ โˆฅ๐‘„1/2โˆฅ)

๐›ฟ2 (1โˆ’๐›ฟ/2)๐ป (5.5)

where in the penultimate line we applied Lemma 4 in Appendix A .

We now bound the distance between๐‘ข๐‘ก and หœ๐‘ข๐‘ก. Unrolling the dynamics, we see that ๐‘ข๐‘ก =๐พ ๐œ‰b ๐‘ก

=๐พb

0๐‘ฅ๐‘ก+ (๐พb

1ฮฃโˆ’1/2๐‘„1/2โˆ’๐พb

0)๐œˆ๐‘ก

=๐พb

0๐‘ฅ๐‘ก+

๐‘กโˆ’1

โˆ‘๏ธ

๐‘–=1

(๐พb

1ฮฃโˆ’1/2๐‘„1/2โˆ’๐พb

0) (๐ดโˆ’๐พ ๐‘„1/2)๐‘–โˆ’1๐‘ค๐‘กโˆ’๐‘–

=๐พb

0๐‘ฅ๐‘ก+

๐‘กโˆ’1

โˆ‘๏ธ

๐‘–=1

๐‘€[๐‘–โˆ’1]๐‘ค๐‘กโˆ’๐‘–.

The definition of หœ๐‘ข๐‘ก given in (5.1) and the bounds (5.4) and (5.5) imply that

โˆฅ๐‘ข๐‘กโˆ’๐‘ขหœ๐‘กโˆฅ โ‰ค โˆฅ๐พb

0โˆฅ โˆฅ๐‘ฅ๐‘กโˆ’๐‘ฅหœ๐‘กโˆฅ +

๐‘กโˆ’1

โˆ‘๏ธ

๐‘–=๐ป+1

โˆฅ๐‘€[๐‘–โˆ’1]โˆฅ๐‘ค๐‘กโˆ’๐‘–

โ‰ค 9๐‘Š ๐œ…6(1+ โˆฅ๐‘„1/2โˆฅ)

๐›ฟ2 (1โˆ’๐›ฟ/2)๐ป, where we used the fact that๐›ฟ <1 and ๐œ… โ‰ฅ max(1,โˆฅ๐พb

0โˆฅ).

We now show that by taking๐ปto be sufficiently large we can ensure that ๐ฝ๐‘‡(๐œ‹หœ๐‘, ๐‘ค) โˆ’๐ฝ๐‘‡(๐œ‹๐‘, ๐‘ค) โ‰ค ๐œ€ .

Lemma 5 shows that

max( โˆฅ๐‘ฅ๐‘กโˆฅ,โˆฅ๐‘ข๐‘กโˆฅ) โ‰ค 3๐‘Š ๐œ…5(1+ โˆฅ๐‘„1/2โˆฅ) ๐›ฟ2

.

We put the pieces together to bound the difference in costs incurred by the compet- itive policy and our DAC approximation in each timestep. Using the easily-verified inequality โˆฅ๐‘ฅโˆฅ2 โˆ’ โˆฅ๐‘ฆโˆฅ2 โ‰ค ( โˆฅ๐‘ฅโˆ’๐‘ฆโˆฅ) ( โˆฅ2๐‘ฅโˆฅ + โˆฅ๐‘ฅโˆ’๐‘ฆโˆฅ) and the fact that ๐œ… โ‰ฅ 1, 1โˆ’๐›ฟ/2< 1, we see that

โˆฅ๐‘„1/2๐‘ฅ๐‘กโˆฅ2โˆ’ โˆฅ๐‘„1/2๐‘ฅหœ๐‘กโˆฅ2 โ‰ค โˆฅ๐‘„โˆฅ โˆฅ๐‘ฅ๐‘กโˆ’๐‘ฅหœ๐‘กโˆฅ (2โˆฅ๐‘ฅ๐‘กโˆฅ + โˆฅ๐‘ฅ๐‘ก โˆ’๐‘ฅหœ๐‘กโˆฅ)

โ‰ค 112๐‘Š2๐œ…10(1+ โˆฅ๐‘„1/2โˆฅ)2โˆฅ๐‘„โˆฅ

๐›ฟ4 (1โˆ’๐›ฟ/2)๐ป Similarly,

โˆฅ๐‘ข๐‘กโˆฅ2โˆ’ โˆฅ๐‘ขหœ๐‘กโˆฅ2 โ‰ค โˆฅ๐‘ข๐‘กโˆ’๐‘ขหœ๐‘กโˆฅ (2โˆฅ๐‘ข๐‘กโˆฅ + โˆฅ๐‘ข๐‘กโˆ’๐‘ขหœ๐‘กโˆฅ)

โ‰ค 135๐‘Š2๐œ…12(1+ โˆฅ๐‘„1/2โˆฅ)2

๐›ฟ4 (1โˆ’๐›ฟ/2)๐ป. We observe that

(1+ โˆฅ๐‘„1/2|)2max(1,โˆฅ๐‘„โˆฅ) โ‰ค 4 max(1,โˆฅ๐‘„โˆฅ2). The difference in aggregate cost is therefore

๐ฝ๐‘‡(๐œ‹หœ๐‘, ๐‘ค) โˆ’๐ฝ๐‘‡(๐œ‹๐‘, ๐‘ค) =

๐‘‡

โˆ‘๏ธ

๐‘ก=1

(๐‘ฅโŠค

๐‘ก ๐‘„ ๐‘ฅ๐‘ก + โˆฅ๐‘ข๐‘กโˆฅ2โˆ’ (๐‘ฅหœโŠค

๐‘ก ๐‘„๐‘ฅหœ๐‘ก+ โˆฅ๐‘ขหœ๐‘กโˆฅ2)

โ‰ค 988๐‘Š2๐œ…12max(1,โˆฅ๐‘„โˆฅ2)

๐›ฟ4 (1โˆ’๐›ฟ/2)๐ป๐‘‡ . Taking

๐ป โ‰ฅ 1

log(1โˆ’๐›ฟ/2) log

988๐‘Š2๐œ…12max(1,โˆฅ๐‘„โˆฅ2) ๐›ฟ4๐œ€

๐‘‡

is thus sufficient to guarantee that

๐ฝ๐‘‡(๐œ‹หœ๐‘, ๐‘ค) โˆ’๐ฝ๐‘‡(๐œ‹๐‘, ๐‘ค) โ‰ค ๐œ€ .

โ–ก

We now show that this result implies best-of-both-worlds.

5.2 Best of both worlds: sublinear policy regret implies approximate compet- itive ratio

We now show that any online learning algorithmAwhich minimizes regret relative to the class M of DAC policies described in Theorem 15 automatically achieves the best of both worlds: sublinear regret against the best DAC policy selected in hindsight fromM, and optimal competitive ratio, up to a sublinear regret term:

Theorem 16. Fix ๐œ€, ๐‘Š > 0 and the system parameters (๐ด, ๐ต๐‘ข, ๐ต๐‘ค, ๐‘„), and let M be defined as in Theorem 15. LetAbe any algorithm such that

๐ฝ๐‘‡(A, ๐‘ค) โˆ’min

๐œ‹โˆˆM

๐ฝ๐‘‡(๐œ‹, ๐‘ค) โ‰ค ๐‘…(๐‘‡)

for all disturbances ๐‘ค such that โˆฅ๐‘คโˆฅโˆž โ‰ค ๐‘Š, where ๐‘…(๐‘‡) is a bound on the regret incurred by A over the time horizon๐‘‡. Then A also satisfies the โ€œapproximate competitive ratioโ€ condition

๐ฝ๐‘‡(A, ๐‘ค) < ๐›พ2ยท ๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) +๐‘…(๐‘‡) +๐‘‚(1),

where๐›พ2is the optimal competitive ratio attained by the strictly causal competitive controller.

Before turning to the proof, we note that the approximate competitive ratio guarantee described in Theorem 16 is substantially weaker than that offered by the standard competitive controller๐œ‹๐‘. Recall that๐œ‹๐‘offers the following guarantee:

sup

๐‘คโˆˆโ„“2

๐ฝ(๐œ‹๐‘, ๐‘ค) ๐ฝ(๐œ‹

0, ๐‘ค) < ๐›พ2.

Note that this holds for all disturbances ๐‘ค, including those which grow over time.

As we showed in Theorem 7, this guarantee can also be obtained in time-varying systems. By contrast, Theorem 16 holds only for bounded disturbances and LTI systems. Furthermore, the competitive ratio bound described in Theorem 16 is not necessarily constant in the time horizon๐‘‡; it says that

sup

โˆฅ๐‘คโˆฅโˆž<๐‘Š

๐ฝ๐‘‡(A, ๐‘ค) ๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) < ๐›พ2+ sup

โˆฅ๐‘คโˆฅโˆž<๐‘Š

๐‘…(๐‘‡) +๐‘‚(1) ๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) .

This bound on the competitive ratio can grow rapidly as๐‘‡ โ†’ โˆž, provided that๐‘คis constructed such that๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) โ‰ช๐‘…(๐‘‡). It is an open question whether there exist control algorithms with constant competitive ratio and sublinear policy regret.

We now present the proof of Theorem 16.

Proof. By Theorem 15, we know that there exists a DAC policy หœ๐œ‹๐‘ โˆˆ M such that ๐ฝ๐‘‡(๐œ‹หœ๐‘, ๐‘ค) โˆ’๐ฝ๐‘‡(๐œ‹๐‘, ๐‘ค) โ‰ค ๐œ€,

where๐œ‹๐‘ is the strictly causal competitive controller described in Theorem 6. The cost incurred byAcan be bounded as follows:

๐ฝ๐‘‡(A, ๐‘ค) โ‰ค min

๐œ‹โ˜…โˆˆM

๐ฝ๐‘‡(๐œ‹โ˜…, ๐‘ค) +๐‘…(๐‘Š , ๐‘‡)

โ‰ค ๐ฝ๐‘‡(๐œ‹หœ๐‘, ๐‘ค) +๐‘…(๐‘‡)

โ‰ค ๐ฝ๐‘‡(๐œ‹๐‘, ๐‘ค) +๐‘…(๐‘‡) +๐œ€

โ‰ค ๐ฝ(๐œ‹๐‘, ๐‘ค) +๐‘…(๐‘‡) +๐œ€

โ‰ค ๐›พ2ยท๐ฝ(๐œ‹

0, ๐‘ค) +๐‘…(๐‘‡) +๐œ€

โ‰ค ๐›พ2ยท๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) +๐‘…(๐‘‡) +๐‘‚(1),

where in the penultimate line we applied the competitive ratio bound of the infinite- horizon strictly causal competitive controller, and in the last line we used the fact that๐œ€is constant with respect to๐‘‡ and used the following bound which relates the finite-horizon cost of the optimal noncausal controller to its infinite-horizon cost:

๐ฝ(๐œ‹

0, ๐‘ค) โ‰ค ๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) +๐‘‚(1).

This bound can be proven as follows. The infinite-horizon cost incurred by ๐œ‹

0 on the disturbance sequence๐‘ค is the sum of the cost incurred by ๐œ‹

0up to time๐‘‡ and the cost incurred by๐œ‹

0from time๐‘‡ +1 to infinity. Recall that we define๐‘ค๐‘ก =0 for all๐‘ก โ‰ฅ ๐‘‡ +1. Using the characterization of๐œ‹

0 we obtained in Theorem 3, we see that๐œ‹

0 simply selects the ๐ป

2-optimal action starting at time๐‘‡ +1. The ๐ป

2policy is stabilizing and the state๐‘ฅ๐‘‡+

1generated by the optimal noncausal policy has norm which is๐‘‚(1), so the residual๐ฝ(๐œ‹

0, ๐‘ค) โˆ’๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) is also๐‘‚(1). โ–ก By leveraging recently proposed online control algorithms with sublinear policy regret against DAC classes, we immediately obtain several concrete best-of-both- worlds results as corollaries of Theorem 16. We begin by considering the case when the learner knows the linear system (๐ด, ๐ต๐‘ข, ๐ต๐‘ค); in this case, we can apply the algorithm from [Sim20], which has๐‘‚(polylog(๐‘‡))regret against the best DAC policy selected in hindsight from M. This algorithm hence exhibit the following best-of-both-worlds behavior, where we suppress dependence on ๐œ– , ๐‘Š, and the system parameters and focus on how the cost incurred by the algorithm scales in the time horizon:

Corollary 1. Fix ๐œ€, ๐‘Š > 0 and the system parameters (๐ด, ๐ต๐‘ข, ๐ต๐‘ค), and let M be defined as in Theorem 15. There exists a computationally efficient online control al- gorithmAwhich, given the system parameters(๐ด, ๐ต๐‘ข, ๐ต๐‘ค), simultaneously achieves the following performance guarantees:

1. (Approximate competitive ratio) The cost ofAsatisfies ๐ฝ๐‘‡(A, ๐‘ค) < ๐›พ2ยท๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) +๐‘‚(polylog(๐‘‡))

for all disturbances๐‘ค such thatโˆฅ๐‘คโˆฅโˆž < ๐‘Š, where๐›พ2is the optimal compet- itive ratio attained by the strictly causal competitive controller.

2. (Sublinear regret) The cost ofAsatisfies ๐ฝ๐‘‡(A, ๐‘ค) < min

๐œ‹โˆˆM

๐ฝ๐‘‡(๐œ‹, ๐‘ค) +๐‘‚(polylog(๐‘‡)) for all disturbances๐‘ค such thatโˆฅ๐‘คโˆฅโˆž < ๐‘Š.

We next consider the โ€œadaptive controlโ€ setting, where the online control algorithm does not know the system dynamics. The algorithm proposed in [CH21] achieves ๐‘‚(โˆš

๐‘‡) regret even when the system dynamics are unknown; we note that this is exponentially worse than the ๐‘‚(polylog(๐‘‡)) regret which is attainable when the dynamics are known. The algorithm operates in two phases; in the first phase it performs system identification by exciting the system using the control inputs and then estimating the system parameters by observing the system response to the excitation. In the second phase, it uses the estimates obtained in the first phase to construct an๐ป

2controller in the estimated system; this๐ป

2controller is stabilizing in the true system provided that the estimate is sufficiently accurate. The๐‘‚(โˆš

๐‘‡)regret bound leads to the following corollary of Theorem 16, where we again suppress dependence on๐œ– , ๐‘Š, and the system parameters and focus on how the cost incurred by the algorithm scales in the time horizon:

Corollary 2. Fix ๐œ€, ๐‘Š > 0 and the system parameters (๐ด, ๐ต๐‘ข, ๐ต๐‘ค), and let M be defined as in Theorem 15. There exists a computationally efficient online control algorithmAwhich, without knowing the system parameters simultaneously achieves the following performance guarantees:

1. (Approximate competitive ratio) The cost ofAsatisfies ๐ฝ๐‘‡(A, ๐‘ค) < ๐›พ2ยท๐ฝ๐‘‡(๐œ‹

0, ๐‘ค) +๐‘‚

โˆš ๐‘‡

for all disturbances๐‘ค such thatโˆฅ๐‘คโˆฅโˆž < ๐‘Š, where๐›พ2is the optimal compet- itive ratio attained by the strictly causal competitive controller.

2. (Sublinear regret) The cost ofAsatisfies ๐ฝ๐‘‡(A, ๐‘ค) < min

๐œ‹โˆˆM

๐ฝ๐‘‡(๐œ‹, ๐‘ค) +๐‘‚

โˆš ๐‘‡

for all disturbances๐‘ค such thatโˆฅ๐‘คโˆฅโˆž < ๐‘Š.

We note that [Aga+19] showed that algorithms with low regret relative to a DAC policy classM automatically also achieve low regret relative to the class of stabi- lizing linear state-feedback policiesK; this implies that Corollaries 1 and 2 can also be translated into statements about regret relative toK. We refer to their paper for details.

C h a p t e r 6

NUMERICAL EXPERIMENTS

6.1 Double Integrator

The double integrator is a simple dynamical system that models the one-dimensional kinematics of a moving object. The states of the system are the objectโ€™s position and velocity, which are represented by the variables๐‘ฅ โˆˆRand๐‘ฅยค โˆˆR, respectively. The continuous-time dynamics of the system are represented by the Newtonian equation

๐‘‘ ๐‘‘ ๐‘ก

"

๐‘ฅ(๐‘ก)

ยค ๐‘ฅ(๐‘ก)

#

=

"

ยค ๐‘ฅ(๐‘ก)

0

# +

"

0 ๐‘ข(๐‘ก)

# +

"

0 ๐‘ค(๐‘ก)

# ,

where๐‘ข(๐‘ก) โˆˆRand๐‘ค(๐‘ก) โˆˆRare the control input and the exogenous disturbance at time๐‘ก, respectively. The goal of the controller is to stabilize the object by keeping (๐‘ฅ ,๐‘ฅยค) as close to (0,0) as possible, while using little energy. The discrete-time dynamics are

"

๐‘ฅ๐‘ก+

1

ยค ๐‘ฅ๐‘ก+

1

#

=

"

1 ๐›ฟ๐‘ก 0 1

# "

๐‘ฅ๐‘ก

ยค ๐‘ฅ๐‘ก

# +

"

0 ๐›ฟ๐‘ก

# ๐‘ข๐‘ก+

"

0 ๐›ฟ๐‘ก

# ๐‘ค๐‘ก,

where๐›ฟ๐‘กis the discretization parameter; in our experiments we set๐›ฟ๐‘ก =0.1 seconds.

In our experiments we take๐‘„ =๐ผ

2and initialize๐‘ฅand๐‘ฅยคto zero.

We compare the regret-optimal controllers to standard๐ป

2and๐ปโˆžcontrollers in the causal, full-information setting. Each of the regret-optimal and ๐ปโˆž controllers is associated with a parameter๐›พ which quantifies its performance guarantee:

1. The competitive controller has๐›พ =2.14; in other words, the competitive ratio of the competitive controller is 4.58.

2. The energy-optimal controller has๐›พ =0.63; in other words, the regret of the energy-optimal controller against the optimal noncausal controller is bounded by 0.4 times the energy of the disturbance.

3. The pathlength-optimal controller has๐›พ =934.99; in other words, the regret of the pathlength-optimal controller against the optimal noncausal controller is bounded by 8.74ร—105times the pathlength of the disturbance.

4. The ๐ปโˆž-optimal controller has ๐›พ = 1.00, so the cost incurred by the ๐ปโˆž controller is at most the energy of the disturbance.

0 /2 3 /2 2 0

0.5 1 1.5

2 2.5

3

Figure 6.1: Frequency responses of causal controllers in the double integrator system.

Frequency Response

In Figure 6.1, we plot the frequency responseโˆฅ๐‘‡๐พ(๐‘’๐‘–๐œ”) โˆฅ2at different frequencies๐œ”, where๐‘‡๐พ is the transfer operator associated to the controller๐พand๐พ is alternately taken to be the competitive, energy-optimal, pathlength-optimal, ๐ป

2, ๐ปโˆž, and op- timal noncausal controller. The frequency response measures how much energy is transferred to the closed-loop control system from a sinusoidal disturbance with frequency๐œ”. The optimal noncausal controller has the lowest frequency response at all frequencies; its frequency response is a lower bound on the frequency response of any other controller. The๐ปโˆžcontroller is the causal controller which minimizes

sup

๐œ”โˆˆ[0,2๐œ‹]

ยฏ ๐œŽ ๐‘‡โˆ—

๐พ(๐‘’๐‘–๐œ”)๐‘‡๐พ(๐‘’๐‘–๐œ”) .

We see that the frequency response of the ๐ปโˆž controller peaks at ๐œ” = 0, where it matches the frequency response of the optimal noncausal controller; all other causal controllers have a higher peak frequency response. The ๐ป

2controller is the causal controller which minimizes

1 2๐œ‹

โˆซ 2๐œ‹ ๐œ”=0

Tr ๐‘‡โˆ—

๐พ(๐‘’๐‘–๐œ”)๐‘‡๐พ(๐‘’๐‘–๐œ”) , therefore the๐ป

2controller is the controller with the smallest area under its frequency response curve; we note that the disturbance in the double integrator system is scalar,

therefore ยฏ๐œŽ ๐‘‡โˆ—

๐พ(๐‘’๐‘–๐œ”)๐‘‡๐พ(๐‘’๐‘–๐œ”)

= Tr ๐‘‡โˆ—

๐พ(๐‘’๐‘–๐œ”)๐‘‡๐พ(๐‘’๐‘–๐œ”)

. The competitive ratio of a controller๐พ at frequency๐œ”is the spectral radius of the matrix

๐‘‡โˆ—

๐พ(๐‘’๐‘–๐œ”)๐‘‡๐พ(๐‘’๐‘–๐œ”) ๐‘‡โˆ—

๐พ0

(๐‘’๐‘–๐œ”)๐‘‡๐พ

0(๐‘’๐‘–๐œ”)โˆ’1

, where๐‘‡๐พ is the transfer operator associated to ๐พ and๐‘‡๐พ

0 is the transfer operator associated with the optimal noncausal controller. The disturbance is scalar, so the competitive ratio simplifies to

๐‘‡โˆ—

๐พ(๐‘’๐‘–๐œ”)๐‘‡๐พ(๐‘’๐‘–๐œ”) ๐‘‡โˆ—

๐พ0

(๐‘’๐‘–๐œ”)๐‘‡๐พ

0(๐‘’๐‘–๐œ”), which is simply the ratio of frequency responses.

We see that the competitive controller has a lower frequency response than all other causal controllers at all frequencies except near ๐œ” = 0 and ๐œ” = 2๐œ‹, where its frequency response is highest. Intuitively, this is because the competitive controller is constrained to maintain a frequency response within a factor of 4.58 of the frequency response of the optimal noncausal controller, so it has more leeway at frequencies near ๐œ” = 0 and๐œ” = 2๐œ‹, where the optimal noncausal controller also has a high frequency response.

The energy-optimal controller minimizes sup

๐œ”โˆˆ[0,2๐œ‹]

ยฏ ๐œŽ

๐‘‡โˆ—

๐พ(๐‘’๐‘–๐œ”)๐‘‡๐พ(๐‘’๐‘–๐œ”) โˆ’๐‘‡โˆ—

๐พ0(๐‘’๐‘–๐œ”)๐‘‡๐พ

0(๐‘’๐‘–๐œ”) .

In other words, the energy-optimal controller minimizes the gap between its own frequency response and the frequency response of the optimal noncausal controller.

Consulting Figure 6.1, we see that every other causal controller indeed has a larger peak difference in frequency response relative to the optimal noncausal controller.

Time Domain Analysis

We next benchmark our regret-optimal controllers in the double integrator system across a wide variety of disturbances. We first consider an i.i.d. standard Gaussian disturbance. In Figure 6.2 we see that the competitive controller closely tracks the performance of the ๐ป

2controller, while the pathlength-optimal and ๐ปโˆž controller incur almost identical cost. Intuitively, an i.i.d. Gaussian disturbance has no correlation across timesteps and hence is expected to have high pathlength, so it is unsurprising that the pathlength-optimal controller performs poorly. We see that the energy-optimal controller incurs cost which is roughly halfway between that of the

๐ป2and๐ปโˆžcontrollers. In Figure 6.3 we plot the costs of only the competitive and ๐ป2 controllers to better illustrate how closely the competitive controller is able to approximate the performance of the๐ป

2controller.

150 300 450 600

Time (sec) 0

0.5 1 1.5

2 2.5

Time-Averaged Cost

Figure 6.2: Relative performance of causal controllers in the double integrator system driven by an i.i.d. Gaussian disturbance.

We next consider sinusoidal disturbances of frequency๐œ” =0, i.e., constant distur- bances. The frequency responses shown in Figure 6.1 predict that the competitive controller will perform the worst, while the pathlength-optimal and ๐ปโˆžcontrollers will match the performance of the optimal noncausal controller; this prediction is confirmed in Figure 6.4. As expected, the pathlength-optimal controller incurs zero regret, since the pathlength of the disturbance is zero. We note that while the competitive controller incurs the most cost, its cost is still less than 4.58 times the optimal noncausal cost, which is consistent with its competitive ratio bound.

We also note that the time-averaged cost of the๐ปโˆž controller converges to 1.00 in steady-state, which is consistent with the fact that its๐ปโˆž gain is 1.00.

We next consider a sinusoidal disturbance of frequency๐œ” =๐œ‹, i.e., the disturbances alternate between+1 andโˆ’1 in each timestep. Due to the very large variation in the costs incurred by the causal controllers, we plot the costs on a log scale in Figure 6.5.

The frequency responses shown in Figure 6.1 predict that the competitive controller will perform the best out of all causal controllers; this prediction is confirmed

Dalam dokumen Regret-Optimal Control (Halaman 58-65)