Proof Outline of Sampling Optimistic Models with Constant Probability In this section, we provide the precise statement that the probability of sampling

LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

3.3 Thompson Sampling-Based Adaptive Control

3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability In this section, we provide the precise statement that the probability of sampling

3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability

Therefore, we can lower bound the probability of being optimistic as 𝑝^opt

𝑡 ≥ PΘ˜𝑡 ∈ S^surr

F_𝑡^cnt,𝐸ˆ_𝑡 =P

𝐿(Θ˜^⊤_𝑡 𝐻_∗) ≤ 𝐿(Θ^⊤_∗𝐻_∗)

F_𝑡^cnt,𝐸ˆ_𝑡

≥ min

Θ∈Eˆ _𝑡^RLSP^𝑡{𝐿(Θˆ^⊤𝐻_∗+𝜂^⊤𝛽_𝑡𝑉

−¹

𝑡 𝐻_∗) ≤ 𝐿(Θ^⊤_∗𝐻_∗)} (3.10)

= min

Θ∈Eˆ ^RLS_𝑡 P^𝑡{𝐿(Θˆ^⊤𝐻_∗+Ξ√︁

𝐹_𝑡) ≤ 𝐿(Θ^⊤_∗𝐻_∗)}, (3.11) whereP^𝑡{·}BP{· | F_𝑡^cnt}, 𝐹_𝑡B𝛽²

𝑡𝐻_∗^⊤𝑉⁻¹

𝑡 𝐻_∗andΞis a matrix of size𝑛×𝑛with iid N (0,1)entries. Here (3.10) considers the worst possible estimate withinE_𝑡^RLSand (3.11) is the whitening transformation.

Reformulation in Terms of Closed-Loop Matrix

In the second step, we reformulate the probability of sampling optimistic parameters in terms of closed-loop system matrix ˜𝐴_𝑐BΘ˜^⊤𝐻_∗=𝐴˜+ 𝐵𝐾˜ (Θ_∗) of the sampled system ˜Θ =(𝐴,˜ 𝐵˜)^⊤ driven by the policy 𝐾(Θ_∗). Transitioning to the closed-loop formulation allows tighter bounds on the optimistic probability. To complete this reformulation, we need to construct an estimation confidence set for the closed-loop system matrix ˆ𝐴_𝑐 B Θˆ^⊤𝐻_∗= 𝐴ˆ+𝐵𝐾ˆ (Θ_∗)of the RLS-estimated system ˆΘ =(𝐴,ˆ 𝐵ˆ)^⊤ and show that the constructed confidence set is a superset toE_𝑡^RLS.

Lemma 3.9(Closed-loop confidence). Let𝐹_𝑡(𝛿)B𝛽²

𝑡(𝛿)𝐻_∗^⊤𝑉⁻¹

𝑡 𝐻_∗. For any𝑡 ≥ 0, define by

E_𝑡^cl(𝛿) B

nΘˆ ∈R^{(𝑛+𝑑)×𝑛} tr

(Θˆ^⊤𝐻_∗−Θ^⊤_∗𝐻_∗)𝐹⁻¹

𝑡 (𝛿) (Θˆ^⊤𝐻_∗−Θ^⊤_∗𝐻_∗)^⊤

≤1 o

, the closed-loop confidence set. Then, for all times𝑡≥0and𝛿∈ (0,1), we have that E_𝑡^RLS(𝛿) ⊆ E_𝑡^cl(𝛿).

Note that the definition ofE_𝑡^cl(𝛿)onlyinvolves closed-loop matrices ˆ𝐴_𝑐BΘˆ^⊤𝐻_∗and 𝐴_𝑐,∗BΘ^⊤_∗𝐻_∗. We can use the result of Lemma 3.9 to reformulate the probability of sampling optimistic parameters, ˜Θ = (𝐴,˜ 𝐵˜), as sampling optimistic closed-loop system matrices, ˜𝐴_𝑐. We bound 𝑝^opt

𝑡 from below as 𝑝^opt

𝑡 ≥ min

Θ∈Eˆ ^cl_𝑡 P^𝑡{𝐿(Θˆ^⊤𝐻_∗+Ξ√︁

𝐹_𝑡) ≤ 𝐿(𝐴_𝑐,_∗)} (3.12)

= min

𝐴ˆ𝑐:∥𝐴ˆ^⊤_𝑐−𝐴^⊤_{𝑐 ,∗}∥

𝐹−1 𝑡

≤1P^𝑡{𝐿(𝐴ˆ_𝑐+Ξ√︁

𝐹_𝑡) ≤ 𝐿(𝐴_𝑐,∗)} (3.13)

= min

Υˆ:∥Υ∥ˆ 𝐹≤1P^𝑡{𝐿(𝐴_𝑐,∗+Υˆ√︁

𝐹_𝑡+Ξ√︁

𝐹_𝑡) ≤ 𝐿(𝐴_𝑐,∗)}, (3.14)

where (3.12) is due to Lemma 3.9 and (3.13) follows from the fact that𝐻_∗has full column rank. Observe that, in Equation (3.14), ˆΥis a unit Frobenius norm matrix of size𝑛×𝑛and the term 𝐴_𝑐,∗+Υˆ√

𝐹_𝑡 accounts for the confidence ellipsoid for the estimated closed-loop matrix, ˆ𝐴_𝑐. The event in (3.14) corresponds to finding the closed-loop matrix, 𝐴_𝑐,_∗ + (Ξ+Υ)ˆ √

𝐹_𝑡 of the TS sampled system in the sublevel manifoldM_∗ B

𝐴_𝑐 ∈ M𝑛 | 𝐿(𝐴_𝑐) ≤ 𝐿(𝐴_𝑐,_∗) as illustrated in Figure 3.3.

Local Geometry of Optimistic Set under Perturbations

Next, we further simplify the form of the probability in (3.14) by exploiting the local geometric structure of the function 𝐿 : 𝐴_𝑐 ↦→ 𝜎²

𝑤

Í∞ 𝑡=0

𝐴^𝑡

𝑐

𝑄_∗ defined over the set of (Schur-)stable matrices,M_SchurB{𝐴_𝑐∈ M𝑛 | 𝜌(𝐴_𝑐)<1}. The following lemma characterizes perturbative properties of𝐿.

Lemma 3.10 (Perturbations). The function 𝐿 : M_Schur → R⁺ defined as 𝐿(𝐴_𝑐) = 𝜎²

𝑤

Í∞ 𝑡=0

𝐴^𝑡

𝑐

𝑄_∗ is smooth in its domain. For any 𝐴_𝑐 ∈ M_Schur, there exists𝜖 > 0 such that for any perturbation ∥𝐺∥𝐹 ≤ 𝜖, the function 𝐿admits a quadratic Taylor expansion as

𝐿(𝐴_𝑐+𝐺) =𝐿(𝐴_𝑐) + ∇𝐿(𝐴_𝑐) •𝐺+ 1

2𝐺• H𝐴𝑐+𝑠𝐺(𝐺) (3.15) for an 𝑠 ∈ [0,1] where H𝐴𝑐 : M𝑛 → M𝑛 is the Hessian operator evaluated at a point 𝐴_𝑐 ∈ M_Schur. In particular, we have that ∇𝐿(𝐴_𝑐

∗) = 2𝑃(Θ_∗)𝐴_𝑐,∗Σ_∗. Furthermore, there exists a constant𝑟 >0such that

𝐺• H𝐴_𝑐+𝑠𝐺(𝐺)

≤ 𝑟∥𝐺∥²

𝐹 for any𝑠 ∈ [0,1]and ∥𝐺∥𝐹 ≤ 𝜖.

Lemma 3.10 guarantees that if a perturbation is sufficiently small, the perturbed function can be locally expressed as a quadratic function of the perturbation. Since the set of stable matrices,M_Schur, is globally non-convex and Taylor’s theorem only holds in convex domains, we restrict the perturbations in a ball of radius𝜖 >0. The fact that there is a neighborhood of stable matrices around a matrix 𝐴_𝑐 enables us to apply Taylor’s theorem in this neighborhood.

Given the optimal closed-loop system matrix 𝐴_𝑐,∗, let 𝜖_∗ > 0 be chosen such that the expansion in (3.15) holds for perturbations∥𝐺∥𝐹 ≤ 𝜖_∗around 𝐴_𝑐,∗. Denote the perturbation due to Thompson sampling and estimation error as 𝐺_𝑡 = (Ξ+Υ)ˆ √

𝐹_𝑡

M_∗

M_∗^qd

B_∗

𝐴_𝑐,_∗

∇𝐿∗

−𝑟⁻¹_∗∇𝐿∗

𝑇_𝐴

𝑐,∗M_∗

M_∗^qd∩ B_∗

𝑂

Figure 3.3: A visual representation of sublevel manifoldM_∗. 𝑂 is the origin and 𝐴_𝑐,_∗ is the optimal closed-loop system matrix. 𝑇_𝐴

𝑐 ,∗M_∗ is the tangent space to the manifold M_∗ at the point 𝐴_𝑐,∗ and ∇𝐿_∗ is the Jacobian of the function 𝐿 at 𝐴_𝑐,∗. M^qd_∗ is the sublevel manifold of the quadratic approximation to𝐿andB_∗is a small ball of stable matrices around 𝐴_𝑐,∗. The intersectionM_∗^qd∩ B_∗ is a subset ofM_∗.

and let ∥𝐺_𝑡∥𝐹 ≤ 𝜖_∗. Then, we can write

𝐿(𝐴_𝑐,∗+𝐺_𝑡)= 𝐿(𝐴_𝑐,∗) + ∇𝐿(𝐴_𝑐,∗) •𝐺_𝑡+ 1

2𝐺_𝑡• H𝐴_{𝑐 ,∗}+𝑠𝐺𝑡(𝐺_𝑡)

≤ 𝐿(𝐴_𝑐,∗) + ∇𝐿(𝐴_𝑐,∗) •𝐺_𝑡+𝑟_∗

2 ∥𝐺_𝑡∥²_𝐹, (3.16) where𝑟_∗>0 is a constant due to Lemma 3.10. Using (3.16), we have the following lower bound on (3.14),

𝑝^opt

𝑡 ≥ min

Υˆ:∥Υ∥ˆ 𝐹≤1P^𝑡 n𝑟_∗

2∥ (Ξ+Υ)ˆ 𝐹

1 2

𝑡 ∥²_𝐹+∇𝐿_∗• (Ξ+Υ)ˆ 𝐹

1 2

𝑡 ≤0, and ∥ (Ξ+Υ)ˆ 𝐹

1 2

𝑡 ∥𝐹≤𝜖_∗ o

, (3.17) where∇𝐿_∗B∇𝐿(𝐴_𝑐,∗). The event in (3.17) corresponds to finding𝐴_𝑐,∗+ (Ξ+Υ)ˆ √

𝐹_𝑡 at the intersection of the stable ball B_∗B

𝐴_𝑐 ∈ M𝑛 | ∥𝐴_𝑐− 𝐴_𝑐,∗∥𝐹 ≤ 𝜖_∗ and the sublevel manifold M_∗^qd B

𝐴_𝑐 ∈ M𝑛 | ∥𝐴_𝑐− 𝐴_𝑐,∗+𝑟_∗⁻¹∇𝐿_∗∥𝐹 ≤ ∥𝑟_∗⁻¹∇𝐿_∗∥𝐹 as illustrated in Figure 3.3.

The intersectionM^qd_∗ ∩ B_∗ ⊂ M_∗serves as another surrogate to sublevel manifold M_∗. Switching to the new surrogateM_∗^qd helps us overcome the issue of working with intractable and complicated geometry ofM_∗due to infinite sum in𝐿(𝐴_c). We

can utilize techniques relating to Gaussian probabilities as the geometry ofM^qd_∗ is described by a quadratic form.

Final Bound

Equipped with the preceding results, we can bound the optimism probability tractably from below by the probability of a TS sampled closed-loop system matrix lying inside the intersection of two balls M_∗^qd∩ B_∗ as given in (3.17). By bound- ing the weighted Frobenius norms in (3.17) from above by 𝜆_max_,𝑡, the maximum eigenvalue of𝐹_𝑡, and normalizing the matrix∇𝐿_∗

√

𝐹_𝑡, we can write 𝑝^opt

𝑡 ≥ min

∥Υ∥ˆ 𝐹≤1P^𝑡 n𝑟_∗

2𝜆_max_,𝑡∥Ξ+Υ∥ˆ ²_𝐹+ (∇𝐿_∗

√︁

𝐹_𝑡) • (Ξ+Υ) ≤ˆ 0,and𝜆_max_,𝑡∥Ξ+Υ∥ˆ ²_𝐹≤𝜖_∗² o

= min

∥Υ∥ˆ 𝐹≤1P^𝑡

((∇𝐿_∗𝐹¹^/²

𝑡 ) • (Ξ+Υ)ˆ

∥∇𝐿_∗𝐹

1/2

𝑡 ∥𝐹

≤−𝜆_max_,𝑡𝑟_∗∥Ξ+Υ∥ˆ ²

𝐹

2∥∇𝐿_∗𝐹

1/2

𝑡 ∥𝐹

, and ∥Ξ+Υ∥ˆ ²_𝐹≤ 𝜖_∗² 𝜆_max_,𝑡

) . (3.18) Observe that the inner product(∇𝐿_∗𝐹

1/2

𝑡 ) •Υˆ is maximized byΥ_#B ^(∇

𝐿_∗𝐹

1/2 𝑡 )

∥∇𝐿_∗𝐹¹^/²

𝑡 ∥𝐹

sub- ject to ∥Υ∥ˆ 𝐹≤1. Since the probability distribution of ∥Ξ+Υ∥ˆ ²

𝐹 is invariant under orthogonal transformation ofΞand ˆΥ, (3.18) also attains its minimum atΥ_#. Thus, we can rewrite (3.18) as

𝑝^opt

𝑡 ≥P^𝑡

((∇𝐿_∗𝐹

1/2

𝑡 ) •Ξ

∥∇𝐿_∗𝐹

1/2

𝑡 ∥𝐹

+1≤ −𝜆_max_,𝑡𝑟_∗ 2∥∇𝐿_∗𝐹

1/2

𝑡 ∥𝐹

∥Ξ+Υ_#∥²_𝐹, and ∥Ξ+Υ_#∥²_𝐹≤ 𝜖_∗² 𝜆_max_,𝑡

)

=P^𝑡 (

𝜉+1 ≤ − 𝜆_max_,𝑡𝑟_∗ 2∥∇𝐿_∗𝐹¹^/²

𝑡 ∥𝐹

(𝜉+1)²+𝑋

, and(𝜉+1)²+𝑋 ≤ 𝜖_∗² 𝜆_max_,𝑡

) , (3.19) where𝜉∼ N (0,1)and 𝑋∼𝜒²

𝑛²−1are independent standard normal and chi-squared distributions, and (3.19) is derived by rotatingΞso that its first element is along the direction of∇𝐿_∗𝐹

1/2

𝑡 . We use the following lemma to characterize the eigenvalues of𝐹_𝑡 and control the lower bound (3.19) on 𝑝^opt

𝑡 .

Lemma 3.11 (Bounded eigenvalues). Suppose 𝑇_𝑤 =𝑂( (√

𝑇)¹⁺^𝑜⁽¹⁾). Denote the minimum and maximum eigenvalues of𝐹_𝑡 by𝜆_min_,𝑡 and𝜆_max_,𝑡, respectively. Under the event𝐸_𝑇, for large enough𝑇, we have that𝜆_max_,𝑡 ≤ 𝐶^log

𝑇 𝑇𝑤 and

𝜆_max_{, 𝑡} 𝜆_min_{, 𝑡} ≤ 𝐶

𝑇log𝑇 𝑇𝑤

for any𝑇_𝑟 < 𝑡 ≤𝑇 for a constant𝐶 =poly(𝑛, 𝑑 ,log(1/𝛿)).

Lemma 3.11 states that maximum eigenvalue and the condition number of 𝐹_𝑡 are controlled inversely by the length of initial exploration phase𝑇_𝑤and proportionally

by log𝑇and𝑇log𝑇given that exploration time is bounded by a certain amount. The length of initial exploration𝑇_𝑤 relative to the horizon𝑇 is critical in guaranteeing asymptotically constant optimistic probability 𝑝^opt

𝑡 . Although more lengthy initial exploration will lead to better convergence to constant optimistic probability, it also incurs higher asymptotic regret due to linear scaling of exploration regret with𝑇_𝑤. Using the relation ∥∇𝐿_∗𝐹

1 2

𝑡 ∥𝐹 ≥max(𝜎_min_,_∗∥𝐹

1 2

𝑡 ∥𝐹, 𝜆

1 2

min,𝑡∥∇𝐿_∗∥𝐹) where𝜎_min_,_∗ is the minimum singular value of∇𝐿_∗, we can further bound (3.19) from below. From Lemma 3.10, we can write∇𝐿_∗=2𝑃(Θ_∗)𝐴_𝑐,∗Σ∗ where𝑃(Θ_∗) ≻0 is the solution to the DARE in (3.3) andΣ∗ = Σ(Θ_∗, 𝐾_∗) ≻0 is the stationary state covariance matrix.

Notice that the minimum singular value of∇𝐿_∗is positive (i.e.,∇𝐿_∗is full-rank) if and only if the closed-loop system matrix,𝐴_𝑐,∗, is non-singular.

In general,𝐴_𝑐,_∗can be singular. Assuming that𝑇_𝑤=𝑂( (√

𝑇)¹⁺^𝑜⁽¹⁾), under the event 𝐸_𝑇, we can use ∥∇𝐿_∗𝐹

1 2

𝑡 ∥𝐹≥√︁

𝜆_min_,𝑡∥∇𝐿_∗∥𝐹 to obtain the following lower bound on 𝑝^opt

𝑡 for𝑇_𝑟< 𝑡≤𝑇: 𝑝^opt

𝑡 ≥ P^𝑡

𝜉+1 ≤ −

√︁

𝜆_max_,𝑡 2𝜌_∗

√︄

𝜆_max_,𝑡 𝜆_min_,𝑡

(𝜉 +1)²+𝑋

, and(𝜉+1)²+𝑋 ≤ 𝜖_∗² 𝜆_max_,𝑡

≥ P

𝜉+1 ≤ − 𝐶 2𝜌_∗

√ 𝑇log𝑇

𝑇_𝑤

(𝜉+1)²+𝑋

, and(𝜉+1)²+ 𝑋 ≤ 𝜖_∗²𝑇_𝑤 𝐶log𝑇

where 𝜌_∗ B ∥𝑟_∗⁻¹∇𝐿_∗∥𝐹. Choosing the exploration time as 𝑇_𝑤 = 𝜔(√

𝑇log𝑇) makes the coefficients

√ 𝑇log𝑇

𝑇_𝑤 = 𝑜(1) to be very small and _log^𝑇^𝑤_𝑇 to be very large, leading to constant lower bound on limiting optimistic probability lim inf𝑇→∞𝑝^opt

𝑇 ≥

P{𝜉+1≤ 0} C𝑄(1).

On the other hand, if 𝐴_𝑐,_∗ is non-singular, then we can use the alternative bound

∥∇𝐿_∗

√

𝐹_𝑡∥𝐹 ≥𝜎_min_,_∗∥√

𝐹_𝑡∥𝐹 ≥ 𝜎_min_,_∗√︁

𝜆_max_,𝑡 to obtain the following lower bound for𝑇_𝑟< 𝑡≤𝑇:

𝑝^opt

𝑡 ≥ P^𝑡 (

𝜉+1 ≤ −

√︁

𝜆_max_,𝑡 2𝜎_min_,∗

(𝜉+1)²+𝑋

, and(𝜉+1)²+𝑋 ≤ 𝜖_∗² 𝜆_max_,𝑡

) ,

≥ P









𝜉+1 ≤ −

√ 𝐶 2𝜎_min_,∗

√︄

log𝑇 𝑇_𝑤

(𝜉+1)²+𝑋

, and (𝜉+1)²+𝑋≤ 𝜖_∗²𝑇_𝑤 𝐶log𝑇







 .

Similarly, choosing the exploration time as𝑇_𝑤 = 𝜔(log𝑇) makes the coefficients

√︃log𝑇

𝑇_𝑤 =𝑜(1)to be very small and _log^𝑇^𝑤_𝑇 =𝜔(1)to be very large, leading to constant lower bound on limiting optimistic probability lim inf𝑇→∞𝑝^opt

𝑇 ≥𝑄(1).

Table 3.8: Regret and Maximum State Norm in Boeing 747 Flight Control.

Algorithm

Average

Regret Top 95% Top 90%

Average

max∥𝑥∥₂ Top 95% Top 90%

TSAC 4.58×10⁷ 1.43×10⁵ 9.49×10⁴ 1.23×10³ 1.07×10² 9.77×10¹ StabL 1.34×10⁴ 1.05×10³ 9.60×10³ 3.38×10¹ 3.14×10¹ 2.98×10¹ OFULQ 1.47×10⁸ 4.19×10⁶ 9.89×10⁵ 1.62×10³ 5.21×10² 2.78×10² TS-LQR 5.63×10¹¹ 3.07×10⁷ 5.33×10⁶ 6.26×10⁴ 1.08×10³ 6.39×10²

In both cases, the optimistic probability achieves a constant lower bound for large enough𝑇 as 𝑝^opt

𝑇 ≥ 𝑄(1) (1+𝑜(1))⁻¹. This result can be interpreted in a geometric way as follows. As the time passes, the estimates of the system become more accurate in the sense that the confidence region of the estimate shrinks very quickly as controlled by the eigenvalues of𝐹_𝑡. Similarly, the high-probability region of TS samples also shrink very fast controlled by the covariance matrix𝐹_𝑡. Therefore, for large enough𝑇, the confidence region of the model estimate and the high-probability region of TS samples get significantly smaller compared to the surrogate optimistic setM_∗^qd∩ B_∗. This size difference effectively reduces the probability of finding a sampled system inM_∗^qd∩ B_∗ to the probability of finding a sampled system in the half-space separated by the tangent space𝑇_𝐴

𝑐 ,∗M_∗. 3.3.4 Numerical Experiments

Finally, we evaluate the performance of TSAC in longitudinal flight control of Boeing 747 with linearized dynamics [123]. We compare TSAC with three adaptive control algorithms in the literature that do not require an initial stabilizing policy: (i) OFULQ of Abbasi-Yadkori and Szepesvári [2], (ii) TS-LQR of Abeille and Lazaric [7], and (iii) StabL. We perform 200 independent runs for 200 time steps for each algorithm and report their average, top 95% and top 90% regret, and maximum state norm performances. We present the performance of the best parameter choices for each algorithm. For a fair comparison, we also adopt slow policy updates in OFULQ and TS-LQR. For further details and the experimental results please refer to [166].

The results are presented in Table 3.8. Notice that TSAC achieves the second-best performance after StabL. As expected, StabL outperforms TSAC since it performs much heavier computations to find the optimistic controller in the confidence set, whereas TSAC samples optimistic parameters only with some fixed probability.

However, TSAC compares favorably against both OFULQ and TS-LQR, making it the best-performing computationally efficient algorithm.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 99-106)