LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)
3.3 Thompson Sampling-Based Adaptive Control
3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability In this section, we provide the precise statement that the probability of sampling
3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability
Therefore, we can lower bound the probability of being optimistic as πopt
π‘ β₯ PΞΛπ‘ β Ssurr
Fπ‘cnt,πΈΛπ‘ =P
πΏ(ΞΛβ€π‘ π»β) β€ πΏ(Ξβ€βπ»β)
Fπ‘cnt,πΈΛπ‘
β₯ min
ΞβEΛ π‘RLSPπ‘{πΏ(ΞΛβ€π»β+πβ€π½π‘π
β1
2
π‘ π»β) β€ πΏ(Ξβ€βπ»β)} (3.10)
= min
ΞβEΛ RLSπ‘ Pπ‘{πΏ(ΞΛβ€π»β+ΞβοΈ
πΉπ‘) β€ πΏ(Ξβ€βπ»β)}, (3.11) wherePπ‘{Β·}BP{Β· | Fπ‘cnt}, πΉπ‘Bπ½2
π‘π»ββ€πβ1
π‘ π»βandΞis a matrix of sizeπΓπwith iid N (0,1)entries. Here (3.10) considers the worst possible estimate withinEπ‘RLSand (3.11) is the whitening transformation.
Reformulation in Terms of Closed-Loop Matrix
In the second step, we reformulate the probability of sampling optimistic parameters in terms of closed-loop system matrix Λπ΄πBΞΛβ€π»β=π΄Λ+ π΅πΎΛ (Ξβ) of the sampled system ΛΞ =(π΄,Λ π΅Λ)β€ driven by the policy πΎ(Ξβ). Transitioning to the closed-loop formulation allows tighter bounds on the optimistic probability. To complete this reformulation, we need to construct an estimation confidence set for the closed-loop system matrix Λπ΄π B ΞΛβ€π»β= π΄Λ+π΅πΎΛ (Ξβ)of the RLS-estimated system ΛΞ =(π΄,Λ π΅Λ)β€ and show that the constructed confidence set is a superset toEπ‘RLS.
Lemma 3.9(Closed-loop confidence). LetπΉπ‘(πΏ)Bπ½2
π‘(πΏ)π»ββ€πβ1
π‘ π»β. For anyπ‘ β₯ 0, define by
Eπ‘cl(πΏ) B
nΞΛ βR(π+π)Γπ tr
(ΞΛβ€π»ββΞβ€βπ»β)πΉβ1
π‘ (πΏ) (ΞΛβ€π»ββΞβ€βπ»β)β€
β€1 o
, the closed-loop confidence set. Then, for all timesπ‘β₯0andπΏβ (0,1), we have that Eπ‘RLS(πΏ) β Eπ‘cl(πΏ).
Note that the definition ofEπ‘cl(πΏ)onlyinvolves closed-loop matrices Λπ΄πBΞΛβ€π»βand π΄π,βBΞβ€βπ»β. We can use the result of Lemma 3.9 to reformulate the probability of sampling optimistic parameters, ΛΞ = (π΄,Λ π΅Λ), as sampling optimistic closed-loop system matrices, Λπ΄π. We bound πopt
π‘ from below as πopt
π‘ β₯ min
ΞβEΛ clπ‘ Pπ‘{πΏ(ΞΛβ€π»β+ΞβοΈ
πΉπ‘) β€ πΏ(π΄π,β)} (3.12)
= min
π΄Λπ:β₯π΄Λβ€πβπ΄β€π ,ββ₯
πΉβ1 π‘
β€1Pπ‘{πΏ(π΄Λπ+ΞβοΈ
πΉπ‘) β€ πΏ(π΄π,β)} (3.13)
= min
Ξ₯Λ:β₯Ξ₯β₯Λ πΉβ€1Pπ‘{πΏ(π΄π,β+Ξ₯ΛβοΈ
πΉπ‘+ΞβοΈ
πΉπ‘) β€ πΏ(π΄π,β)}, (3.14)
where (3.12) is due to Lemma 3.9 and (3.13) follows from the fact thatπ»βhas full column rank. Observe that, in Equation (3.14), ΛΞ₯is a unit Frobenius norm matrix of sizeπΓπand the term π΄π,β+Ξ₯Λβ
πΉπ‘ accounts for the confidence ellipsoid for the estimated closed-loop matrix, Λπ΄π. The event in (3.14) corresponds to finding the closed-loop matrix, π΄π,β + (Ξ+Ξ₯)Λ β
πΉπ‘ of the TS sampled system in the sublevel manifoldMβ B
π΄π β Mπ | πΏ(π΄π) β€ πΏ(π΄π,β) as illustrated in Figure 3.3.
Local Geometry of Optimistic Set under Perturbations
Next, we further simplify the form of the probability in (3.14) by exploiting the local geometric structure of the function πΏ : π΄π β¦β π2
π€
Γβ π‘=0
π΄π‘
π
2
πβ defined over the set of (Schur-)stable matrices,MSchurB{π΄πβ Mπ | π(π΄π)<1}. The following lemma characterizes perturbative properties ofπΏ.
Lemma 3.10 (Perturbations). The function πΏ : MSchur β R+ defined as πΏ(π΄π) = π2
π€
Γβ π‘=0
π΄π‘
π
2
πβ is smooth in its domain. For any π΄π β MSchur, there existsπ > 0 such that for any perturbation β₯πΊβ₯πΉ β€ π, the function πΏadmits a quadratic Taylor expansion as
πΏ(π΄π+πΊ) =πΏ(π΄π) + βπΏ(π΄π) β’πΊ+ 1
2πΊβ’ Hπ΄π+π πΊ(πΊ) (3.15) for an π β [0,1] where Hπ΄π : Mπ β Mπ is the Hessian operator evaluated at a point π΄π β MSchur. In particular, we have that βπΏ(π΄π
β) = 2π(Ξβ)π΄π,βΞ£β. Furthermore, there exists a constantπ >0such that
πΊβ’ Hπ΄π+π πΊ(πΊ)
β€ πβ₯πΊβ₯2
πΉ for anyπ β [0,1]and β₯πΊβ₯πΉ β€ π.
Lemma 3.10 guarantees that if a perturbation is sufficiently small, the perturbed function can be locally expressed as a quadratic function of the perturbation. Since the set of stable matrices,MSchur, is globally non-convex and Taylorβs theorem only holds in convex domains, we restrict the perturbations in a ball of radiusπ >0. The fact that there is a neighborhood of stable matrices around a matrix π΄π enables us to apply Taylorβs theorem in this neighborhood.
Given the optimal closed-loop system matrix π΄π,β, let πβ > 0 be chosen such that the expansion in (3.15) holds for perturbationsβ₯πΊβ₯πΉ β€ πβaround π΄π,β. Denote the perturbation due to Thompson sampling and estimation error as πΊπ‘ = (Ξ+Ξ₯)Λ β
πΉπ‘
Mβ
Mβqd
Bβ
π΄π,β
βπΏβ
βπβ1ββπΏβ
ππ΄
π,βMβ
Mβqdβ© Bβ
π
Figure 3.3: A visual representation of sublevel manifoldMβ. π is the origin and π΄π,β is the optimal closed-loop system matrix. ππ΄
π ,βMβ is the tangent space to the manifold Mβ at the point π΄π,β and βπΏβ is the Jacobian of the function πΏ at π΄π,β. Mqdβ is the sublevel manifold of the quadratic approximation toπΏandBβis a small ball of stable matrices around π΄π,β. The intersectionMβqdβ© Bβ is a subset ofMβ.
and let β₯πΊπ‘β₯πΉ β€ πβ. Then, we can write
πΏ(π΄π,β+πΊπ‘)= πΏ(π΄π,β) + βπΏ(π΄π,β) β’πΊπ‘+ 1
2πΊπ‘β’ Hπ΄π ,β+π πΊπ‘(πΊπ‘)
β€ πΏ(π΄π,β) + βπΏ(π΄π,β) β’πΊπ‘+πβ
2 β₯πΊπ‘β₯2πΉ, (3.16) whereπβ>0 is a constant due to Lemma 3.10. Using (3.16), we have the following lower bound on (3.14),
πopt
π‘ β₯ min
Ξ₯Λ:β₯Ξ₯β₯Λ πΉβ€1Pπ‘ nπβ
2β₯ (Ξ+Ξ₯)Λ πΉ
1 2
π‘ β₯2πΉ+βπΏββ’ (Ξ+Ξ₯)Λ πΉ
1 2
π‘ β€0, and β₯ (Ξ+Ξ₯)Λ πΉ
1 2
π‘ β₯πΉβ€πβ o
, (3.17) whereβπΏβBβπΏ(π΄π,β). The event in (3.17) corresponds to findingπ΄π,β+ (Ξ+Ξ₯)Λ β
πΉπ‘ at the intersection of the stable ball BβB
π΄π β Mπ | β₯π΄πβ π΄π,ββ₯πΉ β€ πβ and the sublevel manifold Mβqd B
π΄π β Mπ | β₯π΄πβ π΄π,β+πββ1βπΏββ₯πΉ β€ β₯πββ1βπΏββ₯πΉ as illustrated in Figure 3.3.
The intersectionMqdβ β© Bβ β Mβserves as another surrogate to sublevel manifold Mβ. Switching to the new surrogateMβqd helps us overcome the issue of working with intractable and complicated geometry ofMβdue to infinite sum inπΏ(π΄c). We
can utilize techniques relating to Gaussian probabilities as the geometry ofMqdβ is described by a quadratic form.
Final Bound
Equipped with the preceding results, we can bound the optimism probability tractably from below by the probability of a TS sampled closed-loop system matrix lying inside the intersection of two balls Mβqdβ© Bβ as given in (3.17). By bound- ing the weighted Frobenius norms in (3.17) from above by πmax,π‘, the maximum eigenvalue ofπΉπ‘, and normalizing the matrixβπΏβ
β
πΉπ‘, we can write πopt
π‘ β₯ min
β₯Ξ₯β₯Λ πΉβ€1Pπ‘ nπβ
2πmax,π‘β₯Ξ+Ξ₯β₯Λ 2πΉ+ (βπΏβ
βοΈ
πΉπ‘) β’ (Ξ+Ξ₯) β€Λ 0,andπmax,π‘β₯Ξ+Ξ₯β₯Λ 2πΉβ€πβ2 o
= min
β₯Ξ₯β₯Λ πΉβ€1Pπ‘
((βπΏβπΉ1/2
π‘ ) β’ (Ξ+Ξ₯)Λ
β₯βπΏβπΉ
1/2
π‘ β₯πΉ
β€βπmax,π‘πββ₯Ξ+Ξ₯β₯Λ 2
πΉ
2β₯βπΏβπΉ
1/2
π‘ β₯πΉ
, and β₯Ξ+Ξ₯β₯Λ 2πΉβ€ πβ2 πmax,π‘
) . (3.18) Observe that the inner product(βπΏβπΉ
1/2
π‘ ) β’Ξ₯Λ is maximized byΞ₯#B (β
πΏβπΉ
1/2 π‘ )
β₯βπΏβπΉ1/2
π‘ β₯πΉ
sub- ject to β₯Ξ₯β₯Λ πΉβ€1. Since the probability distribution of β₯Ξ+Ξ₯β₯Λ 2
πΉ is invariant under orthogonal transformation ofΞand ΛΞ₯, (3.18) also attains its minimum atΞ₯#. Thus, we can rewrite (3.18) as
πopt
π‘ β₯Pπ‘
((βπΏβπΉ
1/2
π‘ ) β’Ξ
β₯βπΏβπΉ
1/2
π‘ β₯πΉ
+1β€ βπmax,π‘πβ 2β₯βπΏβπΉ
1/2
π‘ β₯πΉ
β₯Ξ+Ξ₯#β₯2πΉ, and β₯Ξ+Ξ₯#β₯2πΉβ€ πβ2 πmax,π‘
)
=Pπ‘ (
π+1 β€ β πmax,π‘πβ 2β₯βπΏβπΉ1/2
π‘ β₯πΉ
(π+1)2+π
, and(π+1)2+π β€ πβ2 πmax,π‘
) , (3.19) whereπβΌ N (0,1)and πβΌπ2
π2β1are independent standard normal and chi-squared distributions, and (3.19) is derived by rotatingΞso that its first element is along the direction ofβπΏβπΉ
1/2
π‘ . We use the following lemma to characterize the eigenvalues ofπΉπ‘ and control the lower bound (3.19) on πopt
π‘ .
Lemma 3.11 (Bounded eigenvalues). Suppose ππ€ =π( (β
π)1+π(1)). Denote the minimum and maximum eigenvalues ofπΉπ‘ byπmin,π‘ andπmax,π‘, respectively. Under the eventπΈπ, for large enoughπ, we have thatπmax,π‘ β€ πΆlog
π ππ€ and
πmax, π‘ πmin, π‘ β€ πΆ
πlogπ ππ€
for anyππ < π‘ β€π for a constantπΆ =poly(π, π ,log(1/πΏ)).
Lemma 3.11 states that maximum eigenvalue and the condition number of πΉπ‘ are controlled inversely by the length of initial exploration phaseππ€and proportionally
by logπandπlogπgiven that exploration time is bounded by a certain amount. The length of initial explorationππ€ relative to the horizonπ is critical in guaranteeing asymptotically constant optimistic probability πopt
π‘ . Although more lengthy initial exploration will lead to better convergence to constant optimistic probability, it also incurs higher asymptotic regret due to linear scaling of exploration regret withππ€. Using the relation β₯βπΏβπΉ
1 2
π‘ β₯πΉ β₯max(πmin,ββ₯πΉ
1 2
π‘ β₯πΉ, π
1 2
min,π‘β₯βπΏββ₯πΉ) whereπmin,β is the minimum singular value ofβπΏβ, we can further bound (3.19) from below. From Lemma 3.10, we can writeβπΏβ=2π(Ξβ)π΄π,βΞ£β whereπ(Ξβ) β»0 is the solution to the DARE in (3.3) andΞ£β = Ξ£(Ξβ, πΎβ) β»0 is the stationary state covariance matrix.
Notice that the minimum singular value ofβπΏβis positive (i.e.,βπΏβis full-rank) if and only if the closed-loop system matrix,π΄π,β, is non-singular.
In general,π΄π,βcan be singular. Assuming thatππ€=π( (β
π)1+π(1)), under the event πΈπ, we can use β₯βπΏβπΉ
1 2
π‘ β₯πΉβ₯βοΈ
πmin,π‘β₯βπΏββ₯πΉ to obtain the following lower bound on πopt
π‘ forππ< π‘β€π: πopt
π‘ β₯ Pπ‘
π+1 β€ β
βοΈ
πmax,π‘ 2πβ
βοΈ
πmax,π‘ πmin,π‘
(π +1)2+π
, and(π+1)2+π β€ πβ2 πmax,π‘
,
β₯ P
π+1 β€ β πΆ 2πβ
β πlogπ
ππ€
(π+1)2+π
, and(π+1)2+ π β€ πβ2ππ€ πΆlogπ
,
where πβ B β₯πββ1βπΏββ₯πΉ. Choosing the exploration time as ππ€ = π(β
πlogπ) makes the coefficients
β πlogπ
ππ€ = π(1) to be very small and logππ€π to be very large, leading to constant lower bound on limiting optimistic probability lim infπββπopt
π β₯
P{π+1β€ 0} Cπ(1).
On the other hand, if π΄π,β is non-singular, then we can use the alternative bound
β₯βπΏβ
β
πΉπ‘β₯πΉ β₯πmin,ββ₯β
πΉπ‘β₯πΉ β₯ πmin,ββοΈ
πmax,π‘ to obtain the following lower bound forππ< π‘β€π:
πopt
π‘ β₯ Pπ‘ (
π+1 β€ β
βοΈ
πmax,π‘ 2πmin,β
(π+1)2+π
, and(π+1)2+π β€ πβ2 πmax,π‘
) ,
β₯ P

ο£²

ο£³
π+1 β€ β
β πΆ 2πmin,β
βοΈ
logπ ππ€
(π+1)2+π
, and (π+1)2+πβ€ πβ2ππ€ πΆlogπ

ο£½

ο£Ύ .
Similarly, choosing the exploration time asππ€ = π(logπ) makes the coefficients
βοΈlogπ
ππ€ =π(1)to be very small and logππ€π =π(1)to be very large, leading to constant lower bound on limiting optimistic probability lim infπββπopt
π β₯π(1).
Table 3.8: Regret and Maximum State Norm in Boeing 747 Flight Control.
Algorithm
Average
Regret Top 95% Top 90%
Average
maxβ₯π₯β₯2 Top 95% Top 90%
TSAC 4.58Γ107 1.43Γ105 9.49Γ104 1.23Γ103 1.07Γ102 9.77Γ101 StabL 1.34Γ104 1.05Γ103 9.60Γ103 3.38Γ101 3.14Γ101 2.98Γ101 OFULQ 1.47Γ108 4.19Γ106 9.89Γ105 1.62Γ103 5.21Γ102 2.78Γ102 TS-LQR 5.63Γ1011 3.07Γ107 5.33Γ106 6.26Γ104 1.08Γ103 6.39Γ102
In both cases, the optimistic probability achieves a constant lower bound for large enoughπ as πopt
π β₯ π(1) (1+π(1))β1. This result can be interpreted in a geometric way as follows. As the time passes, the estimates of the system become more accurate in the sense that the confidence region of the estimate shrinks very quickly as controlled by the eigenvalues ofπΉπ‘. Similarly, the high-probability region of TS samples also shrink very fast controlled by the covariance matrixπΉπ‘. Therefore, for large enoughπ, the confidence region of the model estimate and the high-probability region of TS samples get significantly smaller compared to the surrogate optimistic setMβqdβ© Bβ. This size difference effectively reduces the probability of finding a sampled system inMβqdβ© Bβ to the probability of finding a sampled system in the half-space separated by the tangent spaceππ΄
π ,βMβ. 3.3.4 Numerical Experiments
Finally, we evaluate the performance of TSAC in longitudinal flight control of Boeing 747 with linearized dynamics [123]. We compare TSAC with three adaptive control algorithms in the literature that do not require an initial stabilizing policy: (i) OFULQ of Abbasi-Yadkori and SzepesvΓ‘ri [2], (ii) TS-LQR of Abeille and Lazaric [7], and (iii) StabL. We perform 200 independent runs for 200 time steps for each algorithm and report their average, top 95% and top 90% regret, and maximum state norm performances. We present the performance of the best parameter choices for each algorithm. For a fair comparison, we also adopt slow policy updates in OFULQ and TS-LQR. For further details and the experimental results please refer to [166].
The results are presented in Table 3.8. Notice that TSAC achieves the second-best performance after StabL. As expected, StabL outperforms TSAC since it performs much heavier computations to find the optimistic controller in the confidence set, whereas TSAC samples optimistic parameters only with some fixed probability.
However, TSAC compares favorably against both OFULQ and TS-LQR, making it the best-performing computationally efficient algorithm.