• Tidak ada hasil yang ditemukan

Proof Outline of Sampling Optimistic Models with Constant Probability In this section, we provide the precise statement that the probability of sampling

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 99-106)

LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

3.3 Thompson Sampling-Based Adaptive Control

3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability In this section, we provide the precise statement that the probability of sampling

3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability

Therefore, we can lower bound the probability of being optimistic as 𝑝opt

𝑑 β‰₯ PΞ˜Λœπ‘‘ ∈ Ssurr

F𝑑cnt,𝐸ˆ𝑑 =P

𝐿(Ξ˜ΛœβŠ€π‘‘ π»βˆ—) ≀ 𝐿(Ξ˜βŠ€βˆ—π»βˆ—)

F𝑑cnt,𝐸ˆ𝑑

β‰₯ min

Θ∈EΛ† 𝑑RLSP𝑑{𝐿(Ξ˜Λ†βŠ€π»βˆ—+πœ‚βŠ€π›½π‘‘π‘‰

βˆ’1

2

𝑑 π»βˆ—) ≀ 𝐿(Ξ˜βŠ€βˆ—π»βˆ—)} (3.10)

= min

Θ∈EΛ† RLS𝑑 P𝑑{𝐿(Ξ˜Λ†βŠ€π»βˆ—+Ξ√︁

𝐹𝑑) ≀ 𝐿(Ξ˜βŠ€βˆ—π»βˆ—)}, (3.11) whereP𝑑{Β·}BP{Β· | F𝑑cnt}, 𝐹𝑑B𝛽2

π‘‘π»βˆ—βŠ€π‘‰βˆ’1

𝑑 π»βˆ—andΞis a matrix of size𝑛×𝑛with iid N (0,1)entries. Here (3.10) considers the worst possible estimate withinE𝑑RLSand (3.11) is the whitening transformation.

Reformulation in Terms of Closed-Loop Matrix

In the second step, we reformulate the probability of sampling optimistic parameters in terms of closed-loop system matrix Λœπ΄π‘BΞ˜ΛœβŠ€π»βˆ—=𝐴˜+ 𝐡𝐾˜ (Ξ˜βˆ—) of the sampled system ˜Θ =(𝐴,˜ 𝐡˜)⊀ driven by the policy 𝐾(Ξ˜βˆ—). Transitioning to the closed-loop formulation allows tighter bounds on the optimistic probability. To complete this reformulation, we need to construct an estimation confidence set for the closed-loop system matrix ˆ𝐴𝑐 B Ξ˜Λ†βŠ€π»βˆ—= 𝐴ˆ+𝐡𝐾ˆ (Ξ˜βˆ—)of the RLS-estimated system Λ†Ξ˜ =(𝐴,Λ† 𝐡ˆ)⊀ and show that the constructed confidence set is a superset toE𝑑RLS.

Lemma 3.9(Closed-loop confidence). Let𝐹𝑑(𝛿)B𝛽2

𝑑(𝛿)π»βˆ—βŠ€π‘‰βˆ’1

𝑑 π»βˆ—. For any𝑑 β‰₯ 0, define by

E𝑑cl(𝛿) B

nΞ˜Λ† ∈R(𝑛+𝑑)×𝑛 tr

(Ξ˜Λ†βŠ€π»βˆ—βˆ’Ξ˜βŠ€βˆ—π»βˆ—)πΉβˆ’1

𝑑 (𝛿) (Ξ˜Λ†βŠ€π»βˆ—βˆ’Ξ˜βŠ€βˆ—π»βˆ—)⊀

≀1 o

, the closed-loop confidence set. Then, for all times𝑑β‰₯0andπ›Ώβˆˆ (0,1), we have that E𝑑RLS(𝛿) βŠ† E𝑑cl(𝛿).

Note that the definition ofE𝑑cl(𝛿)onlyinvolves closed-loop matrices ˆ𝐴𝑐BΞ˜Λ†βŠ€π»βˆ—and 𝐴𝑐,βˆ—BΞ˜βŠ€βˆ—π»βˆ—. We can use the result of Lemma 3.9 to reformulate the probability of sampling optimistic parameters, ˜Θ = (𝐴,˜ 𝐡˜), as sampling optimistic closed-loop system matrices, Λœπ΄π‘. We bound 𝑝opt

𝑑 from below as 𝑝opt

𝑑 β‰₯ min

Θ∈EΛ† cl𝑑 P𝑑{𝐿(Ξ˜Λ†βŠ€π»βˆ—+Ξ√︁

𝐹𝑑) ≀ 𝐿(𝐴𝑐,βˆ—)} (3.12)

= min

𝐴ˆ𝑐:βˆ₯π΄Λ†βŠ€π‘βˆ’π΄βŠ€π‘ ,βˆ—βˆ₯

πΉβˆ’1 𝑑

≀1P𝑑{𝐿(𝐴ˆ𝑐+Ξ√︁

𝐹𝑑) ≀ 𝐿(𝐴𝑐,βˆ—)} (3.13)

= min

Ξ₯Λ†:βˆ₯Ξ₯βˆ₯Λ† 𝐹≀1P𝑑{𝐿(𝐴𝑐,βˆ—+Ξ₯Λ†βˆšοΈ

𝐹𝑑+Ξ√︁

𝐹𝑑) ≀ 𝐿(𝐴𝑐,βˆ—)}, (3.14)

where (3.12) is due to Lemma 3.9 and (3.13) follows from the fact thatπ»βˆ—has full column rank. Observe that, in Equation (3.14), Λ†Ξ₯is a unit Frobenius norm matrix of size𝑛×𝑛and the term 𝐴𝑐,βˆ—+Ξ₯Λ†βˆš

𝐹𝑑 accounts for the confidence ellipsoid for the estimated closed-loop matrix, ˆ𝐴𝑐. The event in (3.14) corresponds to finding the closed-loop matrix, 𝐴𝑐,βˆ— + (Ξ+Ξ₯)Λ† √

𝐹𝑑 of the TS sampled system in the sublevel manifoldMβˆ— B

𝐴𝑐 ∈ M𝑛 | 𝐿(𝐴𝑐) ≀ 𝐿(𝐴𝑐,βˆ—) as illustrated in Figure 3.3.

Local Geometry of Optimistic Set under Perturbations

Next, we further simplify the form of the probability in (3.14) by exploiting the local geometric structure of the function 𝐿 : 𝐴𝑐 ↦→ 𝜎2

𝑀

Í∞ 𝑑=0

𝐴𝑑

𝑐

2

π‘„βˆ— defined over the set of (Schur-)stable matrices,MSchurB{π΄π‘βˆˆ M𝑛 | 𝜌(𝐴𝑐)<1}. The following lemma characterizes perturbative properties of𝐿.

Lemma 3.10 (Perturbations). The function 𝐿 : MSchur β†’ R+ defined as 𝐿(𝐴𝑐) = 𝜎2

𝑀

Í∞ 𝑑=0

𝐴𝑑

𝑐

2

π‘„βˆ— is smooth in its domain. For any 𝐴𝑐 ∈ MSchur, there existsπœ– > 0 such that for any perturbation βˆ₯𝐺βˆ₯𝐹 ≀ πœ–, the function 𝐿admits a quadratic Taylor expansion as

𝐿(𝐴𝑐+𝐺) =𝐿(𝐴𝑐) + βˆ‡πΏ(𝐴𝑐) ‒𝐺+ 1

2𝐺‒ H𝐴𝑐+𝑠𝐺(𝐺) (3.15) for an 𝑠 ∈ [0,1] where H𝐴𝑐 : M𝑛 β†’ M𝑛 is the Hessian operator evaluated at a point 𝐴𝑐 ∈ MSchur. In particular, we have that βˆ‡πΏ(𝐴𝑐

βˆ—) = 2𝑃(Ξ˜βˆ—)𝐴𝑐,βˆ—Ξ£βˆ—. Furthermore, there exists a constantπ‘Ÿ >0such that

𝐺‒ H𝐴𝑐+𝑠𝐺(𝐺)

≀ π‘Ÿβˆ₯𝐺βˆ₯2

𝐹 for any𝑠 ∈ [0,1]and βˆ₯𝐺βˆ₯𝐹 ≀ πœ–.

Lemma 3.10 guarantees that if a perturbation is sufficiently small, the perturbed function can be locally expressed as a quadratic function of the perturbation. Since the set of stable matrices,MSchur, is globally non-convex and Taylor’s theorem only holds in convex domains, we restrict the perturbations in a ball of radiusπœ– >0. The fact that there is a neighborhood of stable matrices around a matrix 𝐴𝑐 enables us to apply Taylor’s theorem in this neighborhood.

Given the optimal closed-loop system matrix 𝐴𝑐,βˆ—, let πœ–βˆ— > 0 be chosen such that the expansion in (3.15) holds for perturbationsβˆ₯𝐺βˆ₯𝐹 ≀ πœ–βˆ—around 𝐴𝑐,βˆ—. Denote the perturbation due to Thompson sampling and estimation error as 𝐺𝑑 = (Ξ+Ξ₯)Λ† √

𝐹𝑑

Mβˆ—

Mβˆ—qd

Bβˆ—

𝐴𝑐,βˆ—

βˆ‡πΏβˆ—

βˆ’π‘Ÿβˆ’1βˆ—βˆ‡πΏβˆ—

𝑇𝐴

𝑐,βˆ—Mβˆ—

Mβˆ—qd∩ Bβˆ—

𝑂

Figure 3.3: A visual representation of sublevel manifoldMβˆ—. 𝑂 is the origin and 𝐴𝑐,βˆ— is the optimal closed-loop system matrix. 𝑇𝐴

𝑐 ,βˆ—Mβˆ— is the tangent space to the manifold Mβˆ— at the point 𝐴𝑐,βˆ— and βˆ‡πΏβˆ— is the Jacobian of the function 𝐿 at 𝐴𝑐,βˆ—. Mqdβˆ— is the sublevel manifold of the quadratic approximation to𝐿andBβˆ—is a small ball of stable matrices around 𝐴𝑐,βˆ—. The intersectionMβˆ—qd∩ Bβˆ— is a subset ofMβˆ—.

and let βˆ₯𝐺𝑑βˆ₯𝐹 ≀ πœ–βˆ—. Then, we can write

𝐿(𝐴𝑐,βˆ—+𝐺𝑑)= 𝐿(𝐴𝑐,βˆ—) + βˆ‡πΏ(𝐴𝑐,βˆ—) ‒𝐺𝑑+ 1

2𝐺𝑑‒ H𝐴𝑐 ,βˆ—+𝑠𝐺𝑑(𝐺𝑑)

≀ 𝐿(𝐴𝑐,βˆ—) + βˆ‡πΏ(𝐴𝑐,βˆ—) ‒𝐺𝑑+π‘Ÿβˆ—

2 βˆ₯𝐺𝑑βˆ₯2𝐹, (3.16) whereπ‘Ÿβˆ—>0 is a constant due to Lemma 3.10. Using (3.16), we have the following lower bound on (3.14),

𝑝opt

𝑑 β‰₯ min

Ξ₯Λ†:βˆ₯Ξ₯βˆ₯Λ† 𝐹≀1P𝑑 nπ‘Ÿβˆ—

2βˆ₯ (Ξ+Ξ₯)Λ† 𝐹

1 2

𝑑 βˆ₯2𝐹+βˆ‡πΏβˆ—β€’ (Ξ+Ξ₯)Λ† 𝐹

1 2

𝑑 ≀0, and βˆ₯ (Ξ+Ξ₯)Λ† 𝐹

1 2

𝑑 βˆ₯πΉβ‰€πœ–βˆ— o

, (3.17) whereβˆ‡πΏβˆ—Bβˆ‡πΏ(𝐴𝑐,βˆ—). The event in (3.17) corresponds to finding𝐴𝑐,βˆ—+ (Ξ+Ξ₯)Λ† √

𝐹𝑑 at the intersection of the stable ball Bβˆ—B

𝐴𝑐 ∈ M𝑛 | βˆ₯π΄π‘βˆ’ 𝐴𝑐,βˆ—βˆ₯𝐹 ≀ πœ–βˆ— and the sublevel manifold Mβˆ—qd B

𝐴𝑐 ∈ M𝑛 | βˆ₯π΄π‘βˆ’ 𝐴𝑐,βˆ—+π‘Ÿβˆ—βˆ’1βˆ‡πΏβˆ—βˆ₯𝐹 ≀ βˆ₯π‘Ÿβˆ—βˆ’1βˆ‡πΏβˆ—βˆ₯𝐹 as illustrated in Figure 3.3.

The intersectionMqdβˆ— ∩ Bβˆ— βŠ‚ Mβˆ—serves as another surrogate to sublevel manifold Mβˆ—. Switching to the new surrogateMβˆ—qd helps us overcome the issue of working with intractable and complicated geometry ofMβˆ—due to infinite sum in𝐿(𝐴c). We

can utilize techniques relating to Gaussian probabilities as the geometry ofMqdβˆ— is described by a quadratic form.

Final Bound

Equipped with the preceding results, we can bound the optimism probability tractably from below by the probability of a TS sampled closed-loop system matrix lying inside the intersection of two balls Mβˆ—qd∩ Bβˆ— as given in (3.17). By bound- ing the weighted Frobenius norms in (3.17) from above by πœ†max,𝑑, the maximum eigenvalue of𝐹𝑑, and normalizing the matrixβˆ‡πΏβˆ—

√

𝐹𝑑, we can write 𝑝opt

𝑑 β‰₯ min

βˆ₯Ξ₯βˆ₯Λ† 𝐹≀1P𝑑 nπ‘Ÿβˆ—

2πœ†max,𝑑βˆ₯Ξ+Ξ₯βˆ₯Λ† 2𝐹+ (βˆ‡πΏβˆ—

√︁

𝐹𝑑) β€’ (Ξ+Ξ₯) ≀ˆ 0,andπœ†max,𝑑βˆ₯Ξ+Ξ₯βˆ₯Λ† 2πΉβ‰€πœ–βˆ—2 o

= min

βˆ₯Ξ₯βˆ₯Λ† 𝐹≀1P𝑑

((βˆ‡πΏβˆ—πΉ1/2

𝑑 ) β€’ (Ξ+Ξ₯)Λ†

βˆ₯βˆ‡πΏβˆ—πΉ

1/2

𝑑 βˆ₯𝐹

β‰€βˆ’πœ†max,π‘‘π‘Ÿβˆ—βˆ₯Ξ+Ξ₯βˆ₯Λ† 2

𝐹

2βˆ₯βˆ‡πΏβˆ—πΉ

1/2

𝑑 βˆ₯𝐹

, and βˆ₯Ξ+Ξ₯βˆ₯Λ† 2𝐹≀ πœ–βˆ—2 πœ†max,𝑑

) . (3.18) Observe that the inner product(βˆ‡πΏβˆ—πΉ

1/2

𝑑 ) β€’Ξ₯Λ† is maximized byΞ₯#B (βˆ‡

πΏβˆ—πΉ

1/2 𝑑 )

βˆ₯βˆ‡πΏβˆ—πΉ1/2

𝑑 βˆ₯𝐹

sub- ject to βˆ₯Ξ₯βˆ₯Λ† 𝐹≀1. Since the probability distribution of βˆ₯Ξ+Ξ₯βˆ₯Λ† 2

𝐹 is invariant under orthogonal transformation ofΞand Λ†Ξ₯, (3.18) also attains its minimum atΞ₯#. Thus, we can rewrite (3.18) as

𝑝opt

𝑑 β‰₯P𝑑

((βˆ‡πΏβˆ—πΉ

1/2

𝑑 ) β€’Ξž

βˆ₯βˆ‡πΏβˆ—πΉ

1/2

𝑑 βˆ₯𝐹

+1≀ βˆ’πœ†max,π‘‘π‘Ÿβˆ— 2βˆ₯βˆ‡πΏβˆ—πΉ

1/2

𝑑 βˆ₯𝐹

βˆ₯Ξ+Ξ₯#βˆ₯2𝐹, and βˆ₯Ξ+Ξ₯#βˆ₯2𝐹≀ πœ–βˆ—2 πœ†max,𝑑

)

=P𝑑 (

πœ‰+1 ≀ βˆ’ πœ†max,π‘‘π‘Ÿβˆ— 2βˆ₯βˆ‡πΏβˆ—πΉ1/2

𝑑 βˆ₯𝐹

(πœ‰+1)2+𝑋

, and(πœ‰+1)2+𝑋 ≀ πœ–βˆ—2 πœ†max,𝑑

) , (3.19) whereπœ‰βˆΌ N (0,1)and π‘‹βˆΌπœ’2

𝑛2βˆ’1are independent standard normal and chi-squared distributions, and (3.19) is derived by rotatingΞso that its first element is along the direction ofβˆ‡πΏβˆ—πΉ

1/2

𝑑 . We use the following lemma to characterize the eigenvalues of𝐹𝑑 and control the lower bound (3.19) on 𝑝opt

𝑑 .

Lemma 3.11 (Bounded eigenvalues). Suppose 𝑇𝑀 =𝑂( (√

𝑇)1+π‘œ(1)). Denote the minimum and maximum eigenvalues of𝐹𝑑 byπœ†min,𝑑 andπœ†max,𝑑, respectively. Under the event𝐸𝑇, for large enough𝑇, we have thatπœ†max,𝑑 ≀ 𝐢log

𝑇 𝑇𝑀 and

πœ†max, 𝑑 πœ†min, 𝑑 ≀ 𝐢

𝑇log𝑇 𝑇𝑀

for anyπ‘‡π‘Ÿ < 𝑑 ≀𝑇 for a constant𝐢 =poly(𝑛, 𝑑 ,log(1/𝛿)).

Lemma 3.11 states that maximum eigenvalue and the condition number of 𝐹𝑑 are controlled inversely by the length of initial exploration phase𝑇𝑀and proportionally

by log𝑇and𝑇log𝑇given that exploration time is bounded by a certain amount. The length of initial exploration𝑇𝑀 relative to the horizon𝑇 is critical in guaranteeing asymptotically constant optimistic probability 𝑝opt

𝑑 . Although more lengthy initial exploration will lead to better convergence to constant optimistic probability, it also incurs higher asymptotic regret due to linear scaling of exploration regret with𝑇𝑀. Using the relation βˆ₯βˆ‡πΏβˆ—πΉ

1 2

𝑑 βˆ₯𝐹 β‰₯max(𝜎min,βˆ—βˆ₯𝐹

1 2

𝑑 βˆ₯𝐹, πœ†

1 2

min,𝑑βˆ₯βˆ‡πΏβˆ—βˆ₯𝐹) where𝜎min,βˆ— is the minimum singular value ofβˆ‡πΏβˆ—, we can further bound (3.19) from below. From Lemma 3.10, we can writeβˆ‡πΏβˆ—=2𝑃(Ξ˜βˆ—)𝐴𝑐,βˆ—Ξ£βˆ— where𝑃(Ξ˜βˆ—) ≻0 is the solution to the DARE in (3.3) andΞ£βˆ— = Ξ£(Ξ˜βˆ—, πΎβˆ—) ≻0 is the stationary state covariance matrix.

Notice that the minimum singular value ofβˆ‡πΏβˆ—is positive (i.e.,βˆ‡πΏβˆ—is full-rank) if and only if the closed-loop system matrix,𝐴𝑐,βˆ—, is non-singular.

In general,𝐴𝑐,βˆ—can be singular. Assuming that𝑇𝑀=𝑂( (√

𝑇)1+π‘œ(1)), under the event 𝐸𝑇, we can use βˆ₯βˆ‡πΏβˆ—πΉ

1 2

𝑑 βˆ₯𝐹β‰₯√︁

πœ†min,𝑑βˆ₯βˆ‡πΏβˆ—βˆ₯𝐹 to obtain the following lower bound on 𝑝opt

𝑑 forπ‘‡π‘Ÿ< 𝑑≀𝑇: 𝑝opt

𝑑 β‰₯ P𝑑

πœ‰+1 ≀ βˆ’

√︁

πœ†max,𝑑 2πœŒβˆ—

βˆšοΈ„

πœ†max,𝑑 πœ†min,𝑑

(πœ‰ +1)2+𝑋

, and(πœ‰+1)2+𝑋 ≀ πœ–βˆ—2 πœ†max,𝑑

,

β‰₯ P

πœ‰+1 ≀ βˆ’ 𝐢 2πœŒβˆ—

√ 𝑇log𝑇

𝑇𝑀

(πœ‰+1)2+𝑋

, and(πœ‰+1)2+ 𝑋 ≀ πœ–βˆ—2𝑇𝑀 𝐢log𝑇

,

where πœŒβˆ— B βˆ₯π‘Ÿβˆ—βˆ’1βˆ‡πΏβˆ—βˆ₯𝐹. Choosing the exploration time as 𝑇𝑀 = πœ”(√

𝑇log𝑇) makes the coefficients

√ 𝑇log𝑇

𝑇𝑀 = π‘œ(1) to be very small and log𝑇𝑀𝑇 to be very large, leading to constant lower bound on limiting optimistic probability lim infπ‘‡β†’βˆžπ‘opt

𝑇 β‰₯

P{πœ‰+1≀ 0} C𝑄(1).

On the other hand, if 𝐴𝑐,βˆ— is non-singular, then we can use the alternative bound

βˆ₯βˆ‡πΏβˆ—

√

𝐹𝑑βˆ₯𝐹 β‰₯𝜎min,βˆ—βˆ₯√

𝐹𝑑βˆ₯𝐹 β‰₯ 𝜎min,βˆ—βˆšοΈ

πœ†max,𝑑 to obtain the following lower bound forπ‘‡π‘Ÿ< 𝑑≀𝑇:

𝑝opt

𝑑 β‰₯ P𝑑 (

πœ‰+1 ≀ βˆ’

√︁

πœ†max,𝑑 2𝜎min,βˆ—

(πœ‰+1)2+𝑋

, and(πœ‰+1)2+𝑋 ≀ πœ–βˆ—2 πœ†max,𝑑

) ,

β‰₯ P



ο£²



ο£³

πœ‰+1 ≀ βˆ’

√ 𝐢 2𝜎min,βˆ—

βˆšοΈ„

log𝑇 𝑇𝑀

(πœ‰+1)2+𝑋

, and (πœ‰+1)2+𝑋≀ πœ–βˆ—2𝑇𝑀 𝐢log𝑇



ο£½



ο£Ύ .

Similarly, choosing the exploration time as𝑇𝑀 = πœ”(log𝑇) makes the coefficients

βˆšοΈƒlog𝑇

𝑇𝑀 =π‘œ(1)to be very small and log𝑇𝑀𝑇 =πœ”(1)to be very large, leading to constant lower bound on limiting optimistic probability lim infπ‘‡β†’βˆžπ‘opt

𝑇 β‰₯𝑄(1).

Table 3.8: Regret and Maximum State Norm in Boeing 747 Flight Control.

Algorithm

Average

Regret Top 95% Top 90%

Average

maxβˆ₯π‘₯βˆ₯2 Top 95% Top 90%

TSAC 4.58Γ—107 1.43Γ—105 9.49Γ—104 1.23Γ—103 1.07Γ—102 9.77Γ—101 StabL 1.34Γ—104 1.05Γ—103 9.60Γ—103 3.38Γ—101 3.14Γ—101 2.98Γ—101 OFULQ 1.47Γ—108 4.19Γ—106 9.89Γ—105 1.62Γ—103 5.21Γ—102 2.78Γ—102 TS-LQR 5.63Γ—1011 3.07Γ—107 5.33Γ—106 6.26Γ—104 1.08Γ—103 6.39Γ—102

In both cases, the optimistic probability achieves a constant lower bound for large enough𝑇 as 𝑝opt

𝑇 β‰₯ 𝑄(1) (1+π‘œ(1))βˆ’1. This result can be interpreted in a geometric way as follows. As the time passes, the estimates of the system become more accurate in the sense that the confidence region of the estimate shrinks very quickly as controlled by the eigenvalues of𝐹𝑑. Similarly, the high-probability region of TS samples also shrink very fast controlled by the covariance matrix𝐹𝑑. Therefore, for large enough𝑇, the confidence region of the model estimate and the high-probability region of TS samples get significantly smaller compared to the surrogate optimistic setMβˆ—qd∩ Bβˆ—. This size difference effectively reduces the probability of finding a sampled system inMβˆ—qd∩ Bβˆ— to the probability of finding a sampled system in the half-space separated by the tangent space𝑇𝐴

𝑐 ,βˆ—Mβˆ—. 3.3.4 Numerical Experiments

Finally, we evaluate the performance of TSAC in longitudinal flight control of Boeing 747 with linearized dynamics [123]. We compare TSAC with three adaptive control algorithms in the literature that do not require an initial stabilizing policy: (i) OFULQ of Abbasi-Yadkori and SzepesvΓ‘ri [2], (ii) TS-LQR of Abeille and Lazaric [7], and (iii) StabL. We perform 200 independent runs for 200 time steps for each algorithm and report their average, top 95% and top 90% regret, and maximum state norm performances. We present the performance of the best parameter choices for each algorithm. For a fair comparison, we also adopt slow policy updates in OFULQ and TS-LQR. For further details and the experimental results please refer to [166].

The results are presented in Table 3.8. Notice that TSAC achieves the second-best performance after StabL. As expected, StabL outperforms TSAC since it performs much heavier computations to find the optimistic controller in the confidence set, whereas TSAC samples optimistic parameters only with some fixed probability.

However, TSAC compares favorably against both OFULQ and TS-LQR, making it the best-performing computationally efficient algorithm.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 99-106)