Chapter IV: Regret-Optimal Measurement-Feedback Control
4.3 Regret bounded by the pathlength of ๐ค and the energy of ๐ฃ
Suppose ๐
2 is chosen to be this solution, and define ๐พ
2 = ๐พ
2(๐
2), ฮฃ2 = ฮฃ2(๐
2). We immediately obtain the factorization (4.8), where we define
ฮ2(๐ง) = ฮฃ1/2
2 (๐ผ +๐พ
2(๐ง ๐ผ โ ๐ดe)โ1๐ต๐ค). (4.10) We have
ฮโ1
2 (๐ง) = (๐ผโ๐พ
2(๐ง ๐ผ โ (๐ดeโ ๐ต๐ค๐พ
2))โ1๐ต๐ค)ฮฃโ1/2
2 . We note that ๐ดeโ๐ต๐ค๐พ
2 is stable and henceฮโ1(๐ง) is causal and bounded since its poles are strictly contained in the unit circle. Define
๏ฃฑ๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด
๏ฃฒ
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃณ ๐ดb=
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐ด โ๐ต๐ค๐พ
2
0 ๐ดeโ๐ต๐ค๐พ
2
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ,
๐ตb๐ค =
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐ต๐คฮฃโ1/2
2
๐ต๐คฮฃโ1/2
2
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ,
๐ตb๐ข =
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐ต๐ข
0
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ,
๐ถb= h
๐พ๐ถ 0 i
,
b๐ฟ = h
๐ฟ 0 i
.
(4.11)
Recall that ๐นb= ๐น ,๐บb = ๐บฮโ1
2
,๐ปb = ๐พ ๐ป ,๐ฝb= ๐พ ๐ฝฮโ1
2 ; it follows that ๐น ,b ๐บ ,b ๐ป ,b ๐ฝbare given by
๐นb(๐ง) =b๐ฟ(๐ง ๐ผ โ ๐ดb)โ1๐ตb๐ข, ๐บb(๐ง)=b๐ฟ(๐ง ๐ผโ ๐ดb)๐ตb๐ค, ๐ปb(๐ง) =๐ถb(๐ง ๐ผ โ ๐ดb)โ1๐ตb๐ข, ๐ฝb(๐ง) =๐ถb(๐ง ๐ผ โ b๐ด)๐ตb๐ค.
โก
of the ๐ค is zero and the energy of๐ฃ is zero are pairs where๐ค is constant and ๐ฃ is zero. It is easy to guarantee that๐matches the offline controller๐
0on these specific instances, without constraining the behavior of๐on all other instances.
Analogously with the full-information setting, we derive the measurement-feedback controller whose regret has optimal dependence on the pathlength of the driving disturbance and the energy of the measurement disturbance via a reduction to ๐ปโ measurement-feedback control.
Theorem 14. Fix ๐พ > 0 and define ๐ด,b ๐ตb๐ข,๐ตb๐ค,๐ถ ,b b๐ฟ as in (4.19). There exists a causal measurement-feedback controller๐such that
๐ฝ(๐, ๐ค , ๐ฃ) โ๐ฝ(๐
0, ๐ค) < ๐พ2( โฅ๐ท๐๐คโฅ2
2+ โฅ๐ฃโฅ2
2) (4.12)
if and only if the control DARE
๐๐ = ๐ดbโ๐๐๐ดb+b๐ฟโb๐ฟโ๐พโ
๐๐ ๐๐พ๐, and the estimation DARE
๐๐ = ๐ด๐b ๐๐ดbโ+๐ตb๐ค๐ตbโ
๐คโ๐พ๐๐ ๐๐พโ
๐, where we define
๏ฃฑ๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด
๏ฃฒ
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃณ
๐พ๐= ๐ โ1 ๐
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐ตbโ
๐ข
๐ตbโ
๐ค
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ๐๐๐ดb
๐ ๐ =
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐ผ๐ 0 0 โ๐ผ๐
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป +
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐ตbโ
๐ข
๐ตbโ
๐ค
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ๐๐
h
๐ตb๐ข ๐ตb๐ค i
,
๐พ๐ = b๐ด๐๐ h
๐ถb ๐ฟ i
๐ โ1
๐ ,
๐ ๐ =
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐ผ๐ 0
0 โ๐ผ๐
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป +
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐ถb b๐ฟ
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ๐๐
h
๐ถbโ b๐ฟโ i
,
have solutions๐๐ โชฐ 0and๐๐ โชฐ0such that 1. The matrix ๐ดbโ h
๐ตb๐ข ๐ตb๐ค i
๐พ๐ is stable.
2. The matrix๐ ๐has๐positive eigenvalues and๐negative eigenvalues.
3. The matrix ๐ดbโ๐พ๐
"
๐ถb b๐ฟ
#
is stable.
4. The matrix๐ ๐has๐ positive eigenvalues and๐negative eigenvalues.
5. ๐(๐๐๐๐) < 1.
If these conditions are satisfied, then one possible choice of๐is given by ๐ข๐ก =โ๐พ๐ข(๐๐ก+๐๐ถbโ(๐ผ๐ +๐ถ ๐b ๐ถbโ)โ1(๐ฆ๐กโ๐ถ ๐b ๐ก)),
where the synthetic state๐ โR2๐+๐evolves according to the linear dynamics equa- tion
๐๐ก+
1 =(๐ดbโ ๐ตb๐ค๐พ๐ค) (๐๐ก+๐๐ถbโ(๐ผ๐ +๐ถ ๐b ๐ถbโ)โ1(๐ฆ๐กโ๐ถ ๐b ๐ก)) +๐ตb๐ข๐ข๐ก. The matrices๐พ๐ข โR๐ร(2๐+๐) and๐พ๐ค โR๐ร(2๐+๐) are defined as
"
๐พ๐ข ๐พ๐ค
#
=๐พ๐,
and we define๐ โR(2๐+๐)ร(2๐+๐) as
๐ =๐๐(๐ผ๐โ๐๐๐๐)โ1.
We note that the regret-optimal controller can be easily obtained via bisection on๐พ. Proof. The regret condition (14) can be rewritten in matrix form as
"
๐ค ๐ฃ
#โ
๐โ
๐พ๐๐พ
"
๐ค ๐ฃ
#
<
"
๐ค ๐ฃ
#โ
๐พ2
"
๐ทโ
๐๐ท๐ 0
0 ๐ผ
# +๐โ
๐พ0
๐๐พ
0
! "
๐ค ๐ฃ
#
. (4.13)
Letฮ2be the unique causal and causally invertible operator such that ๐พ2๐ทโ
๐๐ท๐+๐บโ(๐ผ +๐น ๐นโ)โ1๐บ = ฮโ
2ฮ2. (4.14)
Then
๐พ2
"
๐ทโ
๐๐ท๐ 0
0 ๐ผ
# +๐โ
๐พ0
๐๐พ
0 =
"
ฮ2 0 0 ๐พ ๐ผ
#โ"
ฮ2 0 0 ๐พ ๐ผ
# .
Define "
๐คb b๐ฃ
#
=
"
ฮ2 0 0 ๐พ ๐ผ
# "
๐ค ๐ฃ
# .
Condition (4.13) can be rewritten as
"
๐คb b๐ฃ
#โ
๐โ
๐พb
๐
๐พb
"
๐คb b๐ฃ
#
<
"
๐คb b๐ฃ
#
2
2
, or equivalently as
โฅ๐
๐พbโฅ < 1, where we define
๐
๐พb=๐๐พ
"
ฮ2 0 0 ๐พ ๐ผ
#โ1
. Using the parameterization (4.1) of๐๐พ, we see that
๐
๐พb=
"
๐บฮโ1
2 0
0 0
# +
"
๐น ๐ผ
#
(๐พโ1๐)h ๐พ ๐ฝฮโ1
2 ๐ผ
i . Notice that๐
๐พbitself has the form of a transfer operator described in (4.1); it is the transfer operator with Youla parameter๐b=๐พโ1๐in the system
b๐ =๐นb b๐ข+๐บb
๐ค ,b
b๐ฆ=๐ปb b๐ข+๐ฝb
๐คb+b๐ฃ , (4.15)
where we define ๐นb = ๐น ,๐บb = ๐บฮโ1
2 ,๐ปb = ๐พ ๐ป ,๐ฝb= ๐พ ๐ฝฮโ1
2 . It is now clear that a controller๐พ satisfying (4.13) exists if and only if there exists a controller ๐พbin the system (4.15) such that โฅ๐
๐พbโฅ < 1. If such a controller๐พbexists, then we can easily recover ๐พ from ๐พbby setting ๐พ = ๐พ๐พb; notice that this is the unique choice of ๐พ which is consistent with the relations๐ปb=๐พ ๐ป ,๐b=๐พโ1๐.
In order to assign state-space structure to ๐น ,b ๐บ ,b ๐ป ,b ๐ฝb, we must first findฮ2(๐ง). Let ฮbe the unique causal and causally invertible operator such that
๐ผ+๐น ๐นโ = ฮฮโ. Theorem 2 shows that
ฮ(๐ง) = (๐ผ+๐ฟ(๐ง ๐ผ โ ๐ด)โ1๐พ)ฮฃ1/2. It follows thatฮโ1(๐ง)๐บ(๐ง)is given by
ฮโ1(๐ง)๐บ(๐ง) = ฮฃโ1/2๐ฟ(๐ง ๐ผ โ (๐ดโ๐พ ๐ฟ))โ1๐ต๐ค. Define
e๐ฟ =
"
ฮฃโ1/2๐ฟ 0 0 ๐พ ๐ผ๐
# , ๐e=
"
0 ๐พ2๐ผ๐
#
, ๐ดe=
"
๐ดโ๐พ ๐ฟ 0
0 0
#
, ๐ตe๐ค =
"
๐ต๐ค
โ๐ผ๐
#
. (4.16)
Notice that we can rewrite
"
๐พ2๐ทโ
๐(๐งโโ)๐ท๐(๐ง) +๐บโฮโโฮโ1๐บ 0
0 ๐พ2๐ผ
#
as
h ๐ตeโ
๐ค(๐งโโ๐ผโ ๐ดe)โโ ๐ผ i
"
e๐ฟโe๐ฟ e๐ e๐โ ๐พ2๐ผ
# "
(๐ง ๐ผ โ๐ดe)โ1๐ตe๐ค ๐ผ
# . Applying Lemma 1, we see that this equals
h ๐ตeโ
๐ค(๐งโโ๐ผโ ๐ดe)โโ ๐ผ i
ฮ2(๐
2)
"
(๐ง ๐ผโ ๐ดe)โ1๐ตe๐ค ๐ผ
# , where๐
2is an arbitrary Hermitian matrix and we define ฮ2(๐
2) =
"
e๐ฟโe๐ฟโ๐
2+๐ดeโ๐
2๐ดe ๐e+๐ดeโ๐
2๐ตe๐ค e๐โ+๐ตeโ
๐ค๐
2๐ดe ๐พ2๐ผ+๐ตeโ
๐ค๐
2๐ตe๐ค
# . Notice that theฮ2(๐
2) can be factored as
"
๐ผ ๐พโ
2(๐
2)
0 ๐ผ
# "
ฮ2(๐
2) 0
0 ฮฃ2(๐
2)
# "
๐ผ 0
๐พ2(๐
2) ๐ผ
# , where we define
ฮ2(๐
2)=e๐ฟโe๐ฟโ๐
2+๐ดeโ๐
2e๐ดโ๐พโ
2(๐
2)ฮฃ2๐พ
2(๐
2), ๐พ2(๐
2) = ฮฃโ1
2 (๐
2) (e๐โ+๐ตeโ
๐ค๐
2e๐ด), ฮฃ2(๐
2) =๐พ2๐ผ+๐ตeโ
๐ค๐
2๐ตe๐ค. (4.17) It is clear that (๐ด,e ๐ตe๐ค) is stabilizable (this follows from the stability of ๐ดโ ๐พ ๐ฟ), therefore the Riccati equationฮ2(๐
2) =0 has a unique stabilizing solution (Theorem E.6.2 in [KSH00]). Suppose ๐
2 is chosen to be this solution, and define ๐พ
2 = ๐พ2(๐
2), ฮฃ2 = ฮฃ2(๐
2). We immediately obtain the factorization (4.14), where we define
ฮ2(๐ง) = ฮฃ1/2
2 (๐ผ +๐พ
2(๐ง ๐ผ โ ๐ดe)โ1๐ตe๐ค). (4.18) We have
ฮโ1
2 (๐ง) = (๐ผโ๐พ
2(๐ง ๐ผ โ (๐ดeโ ๐ตe๐ค๐พ
2))โ1๐ตe๐ค)ฮฃโ1/2
2 .
We note that e๐ดโ๐ตe๐ค๐พ
2 is stable and henceฮโ1
2 (๐ง) is causal and bounded since its poles are strictly contained in the unit circle. Define
๏ฃฑ๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด
๏ฃฒ
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃด๏ฃด
๏ฃณ ๐ดb=
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐ด โ๐ต๐ค๐พ
2
0 ๐ดeโ๐ตe๐ค๐พ
2
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป
๐ตb๐ข =
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ ๐ต๐ข
0
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ,
๐ตb๐ค =
๏ฃฎ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฏ
๏ฃฐ
๐ต๐คฮฃโ1/2
2
๐ตe๐คฮฃโ1/2
2
๏ฃน
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃบ
๏ฃป ,
๐ถb= h
๐พ๐ถ 0 i
,
b๐ฟ = h ๐ฟ 0
i .
(4.19)
Recall that ๐นb= ๐น ,๐บb = ๐บฮโ1
2
,๐ปb = ๐พ ๐ป ,๐ฝb= ๐พ ๐ฝฮโ1
2 ; it follows that ๐น ,b ๐บ ,b ๐ป ,b ๐ฝbare given by
๐นb(๐ง) =b๐ฟ(๐ง ๐ผ โ ๐ดb)โ1๐ตb๐ข, ๐บb(๐ง)=b๐ฟ(๐ง ๐ผโ ๐ดb)๐ตb๐ค, ๐ปb(๐ง) =๐ถb(๐ง ๐ผ โ ๐ดb)โ1๐ตb๐ข, ๐ฝb(๐ง) =๐ถb(๐ง ๐ผ โ b๐ด)๐ตb๐ค.
โก
C h a p t e r 5
CONNECTIONS TO ONLINE LEARNING
In this section we demonstrate a surprising connection between regret-optimal con- trol and online learning. Specifically, we show that the competitive controller is closely approximated by disturbance-action-control (DAC) policies, which general- ize linear state-feedback policies. A DAC policy has the form
๐ข๐ก =๐พ ๐ฅ๐ก+
๐ป
โ๏ธ
๐=1
๐[๐โ1]๐ค๐กโ
1,
where๐พ is a stabilizing state-feedback controller and๐ = (๐[0]
, . . . , ๐[๐ปโ1])is a series of geometrically decreasing weights. The class of DAC policies has attracted much recent attention in the online learning community [Aga+19; FS20; Sim20;
CH21] due to the fact that that for any fixed๐พ, the states generated by a DAC policy have a linear dependence on the weights ๐; it follows that the problem of selecting the weights to minimize cost can be formulated as a convex program. In particular, [Aga+19] showed that the weights can be optimized online via a gradient-based algorithm using a reduction to online convex optimization with memory. This problem, introduced in [AHM15], is a generalization of OCO where the costs incurred by the learner depend not only on the learnerโs most recent decision but rather on the learnerโs past๐ป decisions.
Using this approximation of the competitive controller by DAC policies, we show that online algorithms with sublinear policy regret against DAC policies can be in- stantiated so as to retain sublinear policy regret while also achieving an approximate competitive ratio guarantee on bounded disturbance sequences. The proof can be summarized as follows. We first construct a DAC policy ห๐๐ which๐-approximates the strictly causal competitive policy๐๐, in the sense that
๐ฝ๐(๐ห๐, ๐ค) โ๐ฝ๐(๐๐, ๐ค) โค๐
for all bounded disturbances๐ค. Our construction of ห๐๐ involves rewriting the com- petitive controller as the sum of a stabilizing component๐พb
0and a linear combination of all previous disturbances; the weights on the most recent๐ปdisturbances are used to instantiate ห๐๐. We then observe that an algorithm which obtains sublinear pol- icy regret against a DAC class containing ห๐๐ will automatically attain the optimal
competitive ratio of the strictly causal competitive controller ๐๐, up to a sublinear correction term. Surprisingly, these results can be extended even to the โadaptive control" setting, where the online algorithm does not know the system dynamics ahead of time and must perform online system identification.
Our results shows that combining regret-optimal controllers, which are derived using๐ปโ control, can be fruitfully combined with gradient-based algorithms from online learning to give new performance guarantees in control. While the results we present here are stated in terms of the competitive controller, we expect that analogous results can be derived for other flavors of regret-optimal control using similar techniques.
5.1 Approximation of the competitive controller by DAC policies
We prove that we can find a DAC policy which generates a sequence of states and control actions which closely tracks the sequence of states and control actions generated by the optimal competitive policy by taking the history๐ปof the DAC to be sufficiently large, and choosing the stabilizing component and weights appropriately.
Before we state our approximation result, we introduce some convenient notation.
We introduce the partition
๐พb= h
๐พb
0 ๐พb
1
i
of๐พbinto two๐ร๐block matrices, where๐พbis the state-feedback matrix appearing in the strictly causal controller we obtained in Theorem 6. Recall that ๐ดb+๐ตb๐พbis stable; this matrix is block upper-triangular, and the matrix in the (1, 1) block is ๐ด+๐ต๐พb
0. It follows that๐ด+๐ต๐พb
0is also stable, and hence there exist matrices๐
1, ๐ฟ
1
such
๐ด+๐ต๐พb
0=๐
1๐ฟ
1๐โ1
1
andโฅ๐ฟ
1โฅ < 1. Similarly, there exist matrices๐
2, ๐ฟ
2such that๐ดโ๐พ ๐1/2 =๐
2๐ฟ
2๐โ1 2
andโฅ๐ฟ
2โฅ <1.
We now show that the competitive controller can be arbitrarily well-approximated by DAC policies:
Theorem 15. Fix a time horizon๐ and a disturbance bound๐. Define ๐ =max(1,๐พb
0,๐พb
1,โฅ๐ต๐ขโฅ,โฅ๐ต๐คโฅ,โฅ๐
1โฅ โฅ๐โ1
1 โฅ,โฅ๐
2โฅ โฅ๐โ1
2 โฅ) and
๐ฟ =1โmax(1/2,โฅ๐ฟ
1โฅ,โฅ๐ฟ
2โฅ).
For any๐ > 0, set
๐ป = 1
log(1โ๐ฟ/2) log
988๐2๐ 12max(1,โฅ๐โฅ2) ๐ฟ4๐
๐
and defineM be the set of (๐ป , ๐ 2(1+ โฅ๐1/2โฅ), ๐ฟ/2)-DAC policies with stabilizing component ๐พb
0. Let ๐๐ be the strictly causal competitive controller described in Theorem 6. There exists a DAC policy๐ห๐ โ M such that
๐ฝ๐(๐ห๐, ๐ค) โ๐ฝ๐(๐๐, ๐ค) โค๐ for all disturbance sequences๐ค= (๐ค
0, . . . , ๐ค๐) such thatโฅ๐คโฅโ โค๐. Proof. Unrolling the recursive definition of๐๐กin Theorem 6, we see that
๐๐ก =
๐ก
โ๏ธ
๐=1
(๐ดโ๐พ ๐1/2)๐โ1๐ต๐ค๐ค๐กโ๐.
Let{๐ฅ๐ก}๐
๐ก=0be the state sequence generated by the strictly causal competitive control policy ๐๐ in the original ๐-dimensional system and let {๐๐ก}๐
๐ก=0 is the sequence of states generated by ๐๐ in the 2๐-dimensional synthetic system. Recall that ๐คb๐ก = ฮฃโ1/2๐1/2๐๐ก and๐ข๐ก = ๐พ ๐ห ๐ก, so Theorem 6 implies that the first๐ states of ๐๐ก are precisely๐ฅ๐กโ ๐๐ก. We decompose ๐ฅ๐ก as a linear combination of the disturbance terms:
๐ฅ๐ก+
1= ๐ด๐ฅ๐ก+๐ต๐ข๐พ ๐b ๐ก+๐ต๐ค๐ค๐ก
= ๐ด๐ฅ๐ก+๐ต๐ข h
๐พb
0 ๐พb
1
i
"
๐ฅ๐กโ๐๐ก ฮฃโ1/2๐1/2๐๐ก
#
+๐ต๐ค๐ค๐ก
= (๐ด+๐ต๐ข๐พb
0)๐ฅ๐ก+๐ต๐ข(๐พb
1ฮฃโ1/2๐1/2โ๐พb
0)๐๐ก+๐ต๐ค๐ค๐ก
=
๐ก
โ๏ธ
๐=0
(๐ด+๐ต๐ข๐พb
0)๐
๐ต๐ค๐ค๐กโ๐+๐ต๐ข(๐พb
1ฮฃโ1/2๐1/2โ๐พb
0)๐ฃ๐กโ๐
=
๐ก
โ๏ธ
๐=0
(๐ด+๐ต๐ข๐พb
0)๐ยฉ
ยญ
ยซ
๐ต๐ค๐ค๐กโ๐+๐ต๐ข
๐กโ๐
โ๏ธ
๐=1
๐[๐โ1]๐ต๐ค๐ค๐กโ๐โ๐ยช
ยฎ
ยฌ
=
๐ก
โ๏ธ
๐=0
ยฉ
ยญ
ยซ
(๐ด+๐ต๐ข๐พb
0)๐+
๐
โ๏ธ
๐=1
(๐ด+๐ต๐ข๐พb
0)๐โ๐๐ต๐ข๐[๐โ1]ยช
ยฎ
ยฌ
๐ต๐ค๐ค๐กโ๐,
where we define the weights ๐[๐โ1] = (๐พb
1ฮฃโ1/2๐1/2โ๐พb
0) (๐ดโ๐พ ๐1/2)๐โ1
for all ๐ โฅ 1. We now describe a DAC policy ห๐๐ which approximates ๐๐; recall that every DAC policy is parameterized by a stabilizing controller and a set of ๐ป weights. We take the stabilizing controller to be ๐พb
0 and the set of weights to be ๐ = (๐[0], . . . , ๐[๐ปโ1]). The control action selected by ห๐๐ in each timestep is therefore
ห ๐ข๐ก =๐พb
0๐ฅห๐ก +
๐ป
โ๏ธ
๐=1
๐[๐โ1]๐ต๐ค๐ค๐กโ๐, (5.1) where{๐ฅห๐ก}๐
๐ก=0is the state sequence generated by ห๐๐. We now show that the weights {๐[๐โ1]}โ
๐=1 decay geometrically in time. Recall that ๐ด+ ๐ต๐พb
0 = ๐
1๐ฟ
1๐โ1
1 , where โฅ๐
1โฅ โฅ๐โ1
1 โฅ โค ๐ and โฅ๐ฟ
1โฅ โค 1โ ๐ฟ. Similarly, ๐ดโ๐พ ๐1/2=๐
2๐ฟ
2๐โ1
2 , where โฅ๐
2โฅ โฅ๐โ1
2 โฅ โค ๐ andโฅ๐ฟ
2โฅ โค 1โ๐ฟ. It follows that max
โฅ (๐ด+๐ต๐พb
0)๐โ1โฅ,(๐ดโ๐พ ๐1/2)๐โ1) โฅ
โค ๐ (1โ๐ฟ)๐โ1. (5.2) Recall thatฮฃ = ๐ผ +๐1/2๐๐1/2โชฐ ๐ผ, therefore โฅฮฃโ1/2โฅ โค1. It follows that
โฅ๐พb
1ฮฃโ1/2๐1/2โ๐พb
0โฅ โค โฅ๐พb
1ฮฃโ1/2๐1/2โฅ + โฅ๐พb
0โฅ
โค ๐ (1+ โฅ๐1/2โฅ) (5.3) Using (5.2) and (5.3) together, we immediately obtain
โฅ๐[๐โ1]โฅ โค ๐ 2(1+ โฅ๐1/2โฅ) (1โ๐ฟ)๐โ1. (5.4)
We now bound the distance between๐ฅ๐ก and ห๐ฅ๐ก . Plugging our choice of ห๐ข๐ก given in (5.1) into the dynamics
๐ฅ๐ก+
1 = ๐ด๐ฅ๐ก+๐ต๐ข๐ข๐ก+๐ต๐ค๐ค๐ก, we see that
ห ๐ฅ๐ก+
1= ๐ด๐ฅห๐ก +๐ต๐ข ๐พb
0๐ฅห๐ก+
๐ป
โ๏ธ
๐=1
๐[๐โ1]๐ต๐ค๐ค๐กโ๐
!
+๐ต๐ค๐ค๐ก
=
๐ก
โ๏ธ
๐=0
(๐ด+๐ต๐ข๐พb
0)๐ยฉ
ยญ
ยซ
๐ต๐ค๐ค๐กโ๐+๐ต๐ข
๐ป
โ๏ธ
๐=1
๐[๐โ1]๐ต๐ค๐ค๐กโ๐โ๐ยช
ยฎ
ยฌ
=
๐ก
โ๏ธ
๐=0
ยฉ
ยญ
ยซ
(๐ด+๐ต๐ข๐พb
0)๐+
min{๐ป ,๐}
โ๏ธ
๐=1
(๐ด+๐ต๐ข๐พb
0)๐โ๐๐ต๐ข๐[๐โ1]ยช
ยฎ
ยฌ
๐ต๐ค๐ค๐กโ๐.
Applying (5.2) and (5.4) we obtain
โฅ๐ฅ๐ก+
1โ๐ฅห๐ก+
1โฅ โค
๐ก
โ๏ธ
๐=๐ป+1 ๐
โ๏ธ
๐=๐ป+1
โฅ (๐ด+๐ต๐ข๐พb
0)๐โ๐โฅ โฅ๐ต๐ขโฅ โฅ๐[๐โ1]โฅ โฅ๐ต๐คโฅ๐
โค๐ ๐ 5(1+ โฅ๐1/2โฅ)
๐ก
โ๏ธ
๐=๐ป+1 ๐
โ๏ธ
๐=๐ป+1
(1โ๐ฟ)๐โ1
โค๐ ๐ 5(1+ โฅ๐1/2โฅ) โ๏ธ
๐โฅ๐ป+1
๐(1โ๐ฟ)๐โ1
โค 4๐ ๐ 5(1+ โฅ๐1/2โฅ) ๐ฟ
โ๏ธ
๐โฅ๐ป+1
(1โ๐ฟ/2)๐,
โค 8๐ ๐ 5(1+ โฅ๐1/2โฅ)
๐ฟ2 (1โ๐ฟ/2)๐ป (5.5)
where in the penultimate line we applied Lemma 4 in Appendix A .
We now bound the distance between๐ข๐ก and ห๐ข๐ก. Unrolling the dynamics, we see that ๐ข๐ก =๐พ ๐b ๐ก
=๐พb
0๐ฅ๐ก+ (๐พb
1ฮฃโ1/2๐1/2โ๐พb
0)๐๐ก
=๐พb
0๐ฅ๐ก+
๐กโ1
โ๏ธ
๐=1
(๐พb
1ฮฃโ1/2๐1/2โ๐พb
0) (๐ดโ๐พ ๐1/2)๐โ1๐ค๐กโ๐
=๐พb
0๐ฅ๐ก+
๐กโ1
โ๏ธ
๐=1
๐[๐โ1]๐ค๐กโ๐.
The definition of ห๐ข๐ก given in (5.1) and the bounds (5.4) and (5.5) imply that
โฅ๐ข๐กโ๐ขห๐กโฅ โค โฅ๐พb
0โฅ โฅ๐ฅ๐กโ๐ฅห๐กโฅ +
๐กโ1
โ๏ธ
๐=๐ป+1
โฅ๐[๐โ1]โฅ๐ค๐กโ๐
โค 9๐ ๐ 6(1+ โฅ๐1/2โฅ)
๐ฟ2 (1โ๐ฟ/2)๐ป, where we used the fact that๐ฟ <1 and ๐ โฅ max(1,โฅ๐พb
0โฅ).
We now show that by taking๐ปto be sufficiently large we can ensure that ๐ฝ๐(๐ห๐, ๐ค) โ๐ฝ๐(๐๐, ๐ค) โค ๐ .
Lemma 5 shows that
max( โฅ๐ฅ๐กโฅ,โฅ๐ข๐กโฅ) โค 3๐ ๐ 5(1+ โฅ๐1/2โฅ) ๐ฟ2
.
We put the pieces together to bound the difference in costs incurred by the compet- itive policy and our DAC approximation in each timestep. Using the easily-verified inequality โฅ๐ฅโฅ2 โ โฅ๐ฆโฅ2 โค ( โฅ๐ฅโ๐ฆโฅ) ( โฅ2๐ฅโฅ + โฅ๐ฅโ๐ฆโฅ) and the fact that ๐ โฅ 1, 1โ๐ฟ/2< 1, we see that
โฅ๐1/2๐ฅ๐กโฅ2โ โฅ๐1/2๐ฅห๐กโฅ2 โค โฅ๐โฅ โฅ๐ฅ๐กโ๐ฅห๐กโฅ (2โฅ๐ฅ๐กโฅ + โฅ๐ฅ๐ก โ๐ฅห๐กโฅ)
โค 112๐2๐ 10(1+ โฅ๐1/2โฅ)2โฅ๐โฅ
๐ฟ4 (1โ๐ฟ/2)๐ป Similarly,
โฅ๐ข๐กโฅ2โ โฅ๐ขห๐กโฅ2 โค โฅ๐ข๐กโ๐ขห๐กโฅ (2โฅ๐ข๐กโฅ + โฅ๐ข๐กโ๐ขห๐กโฅ)
โค 135๐2๐ 12(1+ โฅ๐1/2โฅ)2
๐ฟ4 (1โ๐ฟ/2)๐ป. We observe that
(1+ โฅ๐1/2|)2max(1,โฅ๐โฅ) โค 4 max(1,โฅ๐โฅ2). The difference in aggregate cost is therefore
๐ฝ๐(๐ห๐, ๐ค) โ๐ฝ๐(๐๐, ๐ค) =
๐
โ๏ธ
๐ก=1
(๐ฅโค
๐ก ๐ ๐ฅ๐ก + โฅ๐ข๐กโฅ2โ (๐ฅหโค
๐ก ๐๐ฅห๐ก+ โฅ๐ขห๐กโฅ2)
โค 988๐2๐ 12max(1,โฅ๐โฅ2)
๐ฟ4 (1โ๐ฟ/2)๐ป๐ . Taking
๐ป โฅ 1
log(1โ๐ฟ/2) log
988๐2๐ 12max(1,โฅ๐โฅ2) ๐ฟ4๐
๐
is thus sufficient to guarantee that
๐ฝ๐(๐ห๐, ๐ค) โ๐ฝ๐(๐๐, ๐ค) โค ๐ .
โก
We now show that this result implies best-of-both-worlds.
5.2 Best of both worlds: sublinear policy regret implies approximate compet- itive ratio
We now show that any online learning algorithmAwhich minimizes regret relative to the class M of DAC policies described in Theorem 15 automatically achieves the best of both worlds: sublinear regret against the best DAC policy selected in hindsight fromM, and optimal competitive ratio, up to a sublinear regret term:
Theorem 16. Fix ๐, ๐ > 0 and the system parameters (๐ด, ๐ต๐ข, ๐ต๐ค, ๐), and let M be defined as in Theorem 15. LetAbe any algorithm such that
๐ฝ๐(A, ๐ค) โmin
๐โM
๐ฝ๐(๐, ๐ค) โค ๐ (๐)
for all disturbances ๐ค such that โฅ๐คโฅโ โค ๐, where ๐ (๐) is a bound on the regret incurred by A over the time horizon๐. Then A also satisfies the โapproximate competitive ratioโ condition
๐ฝ๐(A, ๐ค) < ๐พ2ยท ๐ฝ๐(๐
0, ๐ค) +๐ (๐) +๐(1),
where๐พ2is the optimal competitive ratio attained by the strictly causal competitive controller.
Before turning to the proof, we note that the approximate competitive ratio guarantee described in Theorem 16 is substantially weaker than that offered by the standard competitive controller๐๐. Recall that๐๐offers the following guarantee:
sup
๐คโโ2
๐ฝ(๐๐, ๐ค) ๐ฝ(๐
0, ๐ค) < ๐พ2.
Note that this holds for all disturbances ๐ค, including those which grow over time.
As we showed in Theorem 7, this guarantee can also be obtained in time-varying systems. By contrast, Theorem 16 holds only for bounded disturbances and LTI systems. Furthermore, the competitive ratio bound described in Theorem 16 is not necessarily constant in the time horizon๐; it says that
sup
โฅ๐คโฅโ<๐
๐ฝ๐(A, ๐ค) ๐ฝ๐(๐
0, ๐ค) < ๐พ2+ sup
โฅ๐คโฅโ<๐
๐ (๐) +๐(1) ๐ฝ๐(๐
0, ๐ค) .
This bound on the competitive ratio can grow rapidly as๐ โ โ, provided that๐คis constructed such that๐ฝ๐(๐
0, ๐ค) โช๐ (๐). It is an open question whether there exist control algorithms with constant competitive ratio and sublinear policy regret.
We now present the proof of Theorem 16.
Proof. By Theorem 15, we know that there exists a DAC policy ห๐๐ โ M such that ๐ฝ๐(๐ห๐, ๐ค) โ๐ฝ๐(๐๐, ๐ค) โค ๐,
where๐๐ is the strictly causal competitive controller described in Theorem 6. The cost incurred byAcan be bounded as follows:
๐ฝ๐(A, ๐ค) โค min
๐โ โM
๐ฝ๐(๐โ , ๐ค) +๐ (๐ , ๐)
โค ๐ฝ๐(๐ห๐, ๐ค) +๐ (๐)
โค ๐ฝ๐(๐๐, ๐ค) +๐ (๐) +๐
โค ๐ฝ(๐๐, ๐ค) +๐ (๐) +๐
โค ๐พ2ยท๐ฝ(๐
0, ๐ค) +๐ (๐) +๐
โค ๐พ2ยท๐ฝ๐(๐
0, ๐ค) +๐ (๐) +๐(1),
where in the penultimate line we applied the competitive ratio bound of the infinite- horizon strictly causal competitive controller, and in the last line we used the fact that๐is constant with respect to๐ and used the following bound which relates the finite-horizon cost of the optimal noncausal controller to its infinite-horizon cost:
๐ฝ(๐
0, ๐ค) โค ๐ฝ๐(๐
0, ๐ค) +๐(1).
This bound can be proven as follows. The infinite-horizon cost incurred by ๐
0 on the disturbance sequence๐ค is the sum of the cost incurred by ๐
0up to time๐ and the cost incurred by๐
0from time๐ +1 to infinity. Recall that we define๐ค๐ก =0 for all๐ก โฅ ๐ +1. Using the characterization of๐
0 we obtained in Theorem 3, we see that๐
0 simply selects the ๐ป
2-optimal action starting at time๐ +1. The ๐ป
2policy is stabilizing and the state๐ฅ๐+
1generated by the optimal noncausal policy has norm which is๐(1), so the residual๐ฝ(๐
0, ๐ค) โ๐ฝ๐(๐
0, ๐ค) is also๐(1). โก By leveraging recently proposed online control algorithms with sublinear policy regret against DAC classes, we immediately obtain several concrete best-of-both- worlds results as corollaries of Theorem 16. We begin by considering the case when the learner knows the linear system (๐ด, ๐ต๐ข, ๐ต๐ค); in this case, we can apply the algorithm from [Sim20], which has๐(polylog(๐))regret against the best DAC policy selected in hindsight from M. This algorithm hence exhibit the following best-of-both-worlds behavior, where we suppress dependence on ๐ , ๐, and the system parameters and focus on how the cost incurred by the algorithm scales in the time horizon:
Corollary 1. Fix ๐, ๐ > 0 and the system parameters (๐ด, ๐ต๐ข, ๐ต๐ค), and let M be defined as in Theorem 15. There exists a computationally efficient online control al- gorithmAwhich, given the system parameters(๐ด, ๐ต๐ข, ๐ต๐ค), simultaneously achieves the following performance guarantees:
1. (Approximate competitive ratio) The cost ofAsatisfies ๐ฝ๐(A, ๐ค) < ๐พ2ยท๐ฝ๐(๐
0, ๐ค) +๐(polylog(๐))
for all disturbances๐ค such thatโฅ๐คโฅโ < ๐, where๐พ2is the optimal compet- itive ratio attained by the strictly causal competitive controller.
2. (Sublinear regret) The cost ofAsatisfies ๐ฝ๐(A, ๐ค) < min
๐โM
๐ฝ๐(๐, ๐ค) +๐(polylog(๐)) for all disturbances๐ค such thatโฅ๐คโฅโ < ๐.
We next consider the โadaptive controlโ setting, where the online control algorithm does not know the system dynamics. The algorithm proposed in [CH21] achieves ๐(โ
๐) regret even when the system dynamics are unknown; we note that this is exponentially worse than the ๐(polylog(๐)) regret which is attainable when the dynamics are known. The algorithm operates in two phases; in the first phase it performs system identification by exciting the system using the control inputs and then estimating the system parameters by observing the system response to the excitation. In the second phase, it uses the estimates obtained in the first phase to construct an๐ป
2controller in the estimated system; this๐ป
2controller is stabilizing in the true system provided that the estimate is sufficiently accurate. The๐(โ
๐)regret bound leads to the following corollary of Theorem 16, where we again suppress dependence on๐ , ๐, and the system parameters and focus on how the cost incurred by the algorithm scales in the time horizon:
Corollary 2. Fix ๐, ๐ > 0 and the system parameters (๐ด, ๐ต๐ข, ๐ต๐ค), and let M be defined as in Theorem 15. There exists a computationally efficient online control algorithmAwhich, without knowing the system parameters simultaneously achieves the following performance guarantees:
1. (Approximate competitive ratio) The cost ofAsatisfies ๐ฝ๐(A, ๐ค) < ๐พ2ยท๐ฝ๐(๐
0, ๐ค) +๐
โ ๐
for all disturbances๐ค such thatโฅ๐คโฅโ < ๐, where๐พ2is the optimal compet- itive ratio attained by the strictly causal competitive controller.
2. (Sublinear regret) The cost ofAsatisfies ๐ฝ๐(A, ๐ค) < min
๐โM
๐ฝ๐(๐, ๐ค) +๐
โ ๐
for all disturbances๐ค such thatโฅ๐คโฅโ < ๐.
We note that [Aga+19] showed that algorithms with low regret relative to a DAC policy classM automatically also achieve low regret relative to the class of stabi- lizing linear state-feedback policiesK; this implies that Corollaries 1 and 2 can also be translated into statements about regret relative toK. We refer to their paper for details.
C h a p t e r 6
NUMERICAL EXPERIMENTS
6.1 Double Integrator
The double integrator is a simple dynamical system that models the one-dimensional kinematics of a moving object. The states of the system are the objectโs position and velocity, which are represented by the variables๐ฅ โRand๐ฅยค โR, respectively. The continuous-time dynamics of the system are represented by the Newtonian equation
๐ ๐ ๐ก
"
๐ฅ(๐ก)
ยค ๐ฅ(๐ก)
#
=
"
ยค ๐ฅ(๐ก)
0
# +
"
0 ๐ข(๐ก)
# +
"
0 ๐ค(๐ก)
# ,
where๐ข(๐ก) โRand๐ค(๐ก) โRare the control input and the exogenous disturbance at time๐ก, respectively. The goal of the controller is to stabilize the object by keeping (๐ฅ ,๐ฅยค) as close to (0,0) as possible, while using little energy. The discrete-time dynamics are
"
๐ฅ๐ก+
1
ยค ๐ฅ๐ก+
1
#
=
"
1 ๐ฟ๐ก 0 1
# "
๐ฅ๐ก
ยค ๐ฅ๐ก
# +
"
0 ๐ฟ๐ก
# ๐ข๐ก+
"
0 ๐ฟ๐ก
# ๐ค๐ก,
where๐ฟ๐กis the discretization parameter; in our experiments we set๐ฟ๐ก =0.1 seconds.
In our experiments we take๐ =๐ผ
2and initialize๐ฅand๐ฅยคto zero.
We compare the regret-optimal controllers to standard๐ป
2and๐ปโcontrollers in the causal, full-information setting. Each of the regret-optimal and ๐ปโ controllers is associated with a parameter๐พ which quantifies its performance guarantee:
1. The competitive controller has๐พ =2.14; in other words, the competitive ratio of the competitive controller is 4.58.
2. The energy-optimal controller has๐พ =0.63; in other words, the regret of the energy-optimal controller against the optimal noncausal controller is bounded by 0.4 times the energy of the disturbance.
3. The pathlength-optimal controller has๐พ =934.99; in other words, the regret of the pathlength-optimal controller against the optimal noncausal controller is bounded by 8.74ร105times the pathlength of the disturbance.
4. The ๐ปโ-optimal controller has ๐พ = 1.00, so the cost incurred by the ๐ปโ controller is at most the energy of the disturbance.
0 /2 3 /2 2 0
0.5 1 1.5
2 2.5
3
Figure 6.1: Frequency responses of causal controllers in the double integrator system.
Frequency Response
In Figure 6.1, we plot the frequency responseโฅ๐๐พ(๐๐๐) โฅ2at different frequencies๐, where๐๐พ is the transfer operator associated to the controller๐พand๐พ is alternately taken to be the competitive, energy-optimal, pathlength-optimal, ๐ป
2, ๐ปโ, and op- timal noncausal controller. The frequency response measures how much energy is transferred to the closed-loop control system from a sinusoidal disturbance with frequency๐. The optimal noncausal controller has the lowest frequency response at all frequencies; its frequency response is a lower bound on the frequency response of any other controller. The๐ปโcontroller is the causal controller which minimizes
sup
๐โ[0,2๐]
ยฏ ๐ ๐โ
๐พ(๐๐๐)๐๐พ(๐๐๐) .
We see that the frequency response of the ๐ปโ controller peaks at ๐ = 0, where it matches the frequency response of the optimal noncausal controller; all other causal controllers have a higher peak frequency response. The ๐ป
2controller is the causal controller which minimizes
1 2๐
โซ 2๐ ๐=0
Tr ๐โ
๐พ(๐๐๐)๐๐พ(๐๐๐) , therefore the๐ป
2controller is the controller with the smallest area under its frequency response curve; we note that the disturbance in the double integrator system is scalar,
therefore ยฏ๐ ๐โ
๐พ(๐๐๐)๐๐พ(๐๐๐)
= Tr ๐โ
๐พ(๐๐๐)๐๐พ(๐๐๐)
. The competitive ratio of a controller๐พ at frequency๐is the spectral radius of the matrix
๐โ
๐พ(๐๐๐)๐๐พ(๐๐๐) ๐โ
๐พ0
(๐๐๐)๐๐พ
0(๐๐๐)โ1
, where๐๐พ is the transfer operator associated to ๐พ and๐๐พ
0 is the transfer operator associated with the optimal noncausal controller. The disturbance is scalar, so the competitive ratio simplifies to
๐โ
๐พ(๐๐๐)๐๐พ(๐๐๐) ๐โ
๐พ0
(๐๐๐)๐๐พ
0(๐๐๐), which is simply the ratio of frequency responses.
We see that the competitive controller has a lower frequency response than all other causal controllers at all frequencies except near ๐ = 0 and ๐ = 2๐, where its frequency response is highest. Intuitively, this is because the competitive controller is constrained to maintain a frequency response within a factor of 4.58 of the frequency response of the optimal noncausal controller, so it has more leeway at frequencies near ๐ = 0 and๐ = 2๐, where the optimal noncausal controller also has a high frequency response.
The energy-optimal controller minimizes sup
๐โ[0,2๐]
ยฏ ๐
๐โ
๐พ(๐๐๐)๐๐พ(๐๐๐) โ๐โ
๐พ0(๐๐๐)๐๐พ
0(๐๐๐) .
In other words, the energy-optimal controller minimizes the gap between its own frequency response and the frequency response of the optimal noncausal controller.
Consulting Figure 6.1, we see that every other causal controller indeed has a larger peak difference in frequency response relative to the optimal noncausal controller.
Time Domain Analysis
We next benchmark our regret-optimal controllers in the double integrator system across a wide variety of disturbances. We first consider an i.i.d. standard Gaussian disturbance. In Figure 6.2 we see that the competitive controller closely tracks the performance of the ๐ป
2controller, while the pathlength-optimal and ๐ปโ controller incur almost identical cost. Intuitively, an i.i.d. Gaussian disturbance has no correlation across timesteps and hence is expected to have high pathlength, so it is unsurprising that the pathlength-optimal controller performs poorly. We see that the energy-optimal controller incurs cost which is roughly halfway between that of the
๐ป2and๐ปโcontrollers. In Figure 6.3 we plot the costs of only the competitive and ๐ป2 controllers to better illustrate how closely the competitive controller is able to approximate the performance of the๐ป
2controller.
150 300 450 600
Time (sec) 0
0.5 1 1.5
2 2.5
Time-Averaged Cost
Figure 6.2: Relative performance of causal controllers in the double integrator system driven by an i.i.d. Gaussian disturbance.
We next consider sinusoidal disturbances of frequency๐ =0, i.e., constant distur- bances. The frequency responses shown in Figure 6.1 predict that the competitive controller will perform the worst, while the pathlength-optimal and ๐ปโcontrollers will match the performance of the optimal noncausal controller; this prediction is confirmed in Figure 6.4. As expected, the pathlength-optimal controller incurs zero regret, since the pathlength of the disturbance is zero. We note that while the competitive controller incurs the most cost, its cost is still less than 4.58 times the optimal noncausal cost, which is consistent with its competitive ratio bound.
We also note that the time-averaged cost of the๐ปโ controller converges to 1.00 in steady-state, which is consistent with the fact that its๐ปโ gain is 1.00.
We next consider a sinusoidal disturbance of frequency๐ =๐, i.e., the disturbances alternate between+1 andโ1 in each timestep. Due to the very large variation in the costs incurred by the causal controllers, we plot the costs on a log scale in Figure 6.5.
The frequency responses shown in Figure 6.1 predict that the competitive controller will perform the best out of all causal controllers; this prediction is confirmed