• Tidak ada hasil yang ditemukan

Theoretical Analysis of TSAC

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 94-99)

LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

3.3 Thompson Sampling-Based Adaptive Control

3.3.2 Theoretical Analysis of TSAC

In this section, we study the theoretical guarantees of TSAC. For simplicity of presentation, we consider the Gaussian process noise for the system dynamics. In particular, we assume that there exists a filtration F𝑑 such that for all 𝑑 β‰₯ 0, π‘₯𝑑, 𝑧𝑑 areF𝑑-measurable and𝑀𝑑|F𝑑 =N (0, 𝜎2

𝑀𝐼)for some knownπœŽπ‘€ > 0. The following results can be extended to sub-Gaussian process noise setting, i.e., Assumption 3.1, using the techniques developed in the previous section (see Lemma 3.1 and its proof in Appendix B.1.1). The following states the first order-optimal frequentist regret bound for TS in multidimensional stabilizable LQRs, our main result.

Theorem 3.3 (Regret of TSAC). Suppose Assumption 3.2 holds and set 𝜏0 = 2π›Ύβˆ’1log(2πœ…

√

2) and 𝑇0 = poly(log(1/𝛿), πœŽβˆ’1

𝑀 , 𝑛, 𝑑 ,𝛼, 𝛾¯ βˆ’1, πœ…). Then, for long enough 𝑇, TSAC achieves the regret 𝑅𝑇 = Oe

(𝑛+𝑑)(𝑛+𝑑)√︁

𝑇log(1/𝛿)

w.p. at least 1βˆ’10𝛿, if 𝑇𝑀 = max

𝑇0, 𝑐1(√

𝑇log𝑇)1+π‘œ(1)

for a constant 𝑐1 > 0. Fur- thermore, if the closed loop matrix of the optimally controlled underlying sys- tem, 𝐴𝑐,βˆ— B π΄βˆ— + π΅βˆ—πΎβˆ—, is non-singular, i.e., π΄βˆ— is non-singular, w.p. at least 1 βˆ’ 10𝛿, TSAC achieves the regret 𝑅𝑇 = Oe

poly(𝑛, 𝑑)√︁

𝑇log(1/𝛿)

if 𝑇𝑀 = max

𝑇0, 𝑐2(log𝑇)1+π‘œ(1)

for a constant𝑐2 > 0.

This makes TSAC thefirst efficientadaptive control algorithm that achieves optimal regret in adaptive control of all LQRs without an initial stabilizing policy. To prove this result, we follow a similar approach as StabL in the previous section and [7], and define the high probability joint event 𝐸𝑑 = 𝐸ˆ𝑑 ∩ πΈΛœπ‘‘ βˆ©πΈΒ―π‘‘, where ˆ𝐸𝑑 states that the RLS estimate Λ†Ξ˜concentrates around Ξ˜βˆ—, ΛœπΈπ‘‘ states that the sampled parameter ˜Θ concentrates around Λ†Ξ˜, and ¯𝐸𝑑 states that the state remains bounded respectively. Conditioned on this event, we decompose the frequentist regret as

𝑅𝑇1𝐸𝑇 ≀ 𝑅exp

𝑇𝑀 + 𝑅RLS

𝑇 + 𝑅mart

𝑇 + 𝑅TS

𝑇 + 𝑅gap

𝑇 , where 𝑅exp

𝑇𝑀 accounts for the regret attained due to improved exploration, 𝑅RLS

𝑇 represents the difference between the value function of the true next state and the predicted next state,𝑅mart

𝑇 is a martingale with bounded difference,𝑅TS

𝑇 measures the difference in optimal average expected cost between the true model Ξ˜βˆ— and the sampled model ˜Θ, and 𝑅gap

𝑇 measures the regret due to policy changes. The decomposition and expressions are given in Appendix B.2.3. In the analysis, we bound each term separately (Appendix B.2.4).

Note that𝑅RLS

𝑇 and𝑅mart

𝑇 appear in the regret analysis of StabL due to algorithmic and problem setting construction, thus, follow directly from the prior analysis. Before discussing the further details of the analysis, we first consider the prior works that use TS for adaptive control of LQRs and discuss their shortcomings. Further, we highlight the challenges in adaptive control of multidimensional stabilizable LQRs using TS and present our approaches to overcome these.

Prior Work on TS-based Adaptive Control and Challenges

For the frequentist regret minimization problem, the state-of-the-art adaptive control algorithm that uses TS is Abeille and Lazaric [7]. They consider the β€œcontractible”

LQR systems, i.e. |π΄βˆ— + π΅βˆ—πΎ(Ξ˜βˆ—) | < 1, and provide Λœπ‘‚(√

𝑇) regret upper bound for scalar LQRs, i.e. 𝑛 = 𝑑 = 1. Notice that the set of contractible systems is a small subset of the setS defined in Assumption 3.2 and they are only equivalent for scalar systems since 𝜌(π΄βˆ— βˆ’ π΅βˆ—πΎ(Ξ˜βˆ—)) = |π΄βˆ— βˆ’ π΅βˆ—πΎ(Ξ˜βˆ—) |. This simplified setting allow them to reduce the regret analysis into the trade-off between 𝑅TS

𝑇 =

Í𝑇

𝑑=0{𝐽(Ξ˜Λœπ‘‘) βˆ’π½(Ξ˜βˆ—)}and 𝑅gap

𝑇 =Í𝑇

𝑑=0E[π‘₯⊀

𝑑1(𝑃(Ξ˜Λœπ‘‘+1) βˆ’π‘ƒ(Ξ˜Λœπ‘‘)π‘₯𝑑+1 F𝑑].

These regret terms are central in the analysis of several adaptive control algorithms.

In the certainty equivalent control approaches, 𝑅TS

𝑇 is bounded by the quadratic scaling of model estimation error after a significantly long exploration with a known stabilizing controller [191, 242]. In the optimism-based algorithms such as StabL, 𝑅TS

𝑇 is bounded by 0 by design [2, 81]. Similarly, in the Bayesian regret setting, [212] assume that the underlying parameterΞ˜βˆ— comes from a known prior that the expected regret is computed with respect to. This true prior yields E[𝑅TS

𝑇 ] =0 in certain restrictive LQRs. Whereas the conventional approach in the analysis of 𝑅gap

𝑇 is to have lazy policy updates, i.e.,𝑂(log𝑇)policy changes such as StabL, via doubling the determinant of𝑉𝑑or exponentially increasing epoch durations [48, 85].

On the other hand, Abeille and Lazaric [7] bound𝑅TS

𝑇 by showing that TS samples the optimistic parameters, ΛœΞ˜π‘‘ such that𝐽(Ξ˜Λœπ‘‘) ≀ 𝐽(Ξ˜βˆ—), with a constant probability,

which reduces the regret of non-optimistic steps. Unlike the conventional policy update approaches, the key idea in Abeille and Lazaric [7] is to update the control policy every time-steps via TS, which increases the number of optimistic policies during the execution. They show that while this frequent update rule reduces 𝑅TS

𝑇 , it only results with𝑅gap

𝑇 = O (e √

𝑇). However, they were only able to show that this constant probability of optimistic sampling holds for scalar LQRs.

The difficulty of the analysis for the probability of optimistic parameter sampling lies in the challenging characterization of the optimistic set. Since𝐽(Θ)˜ =𝜎2

𝑀tr(𝑃(Θ))˜ , one needs to consider the spectrum of 𝑃(Θ)˜ to define optimistic models, which makes the analysis difficult. In particular, decreasing the cost along one direction may result in an increase in other directions. However, for the scalar LQR setting considered in Abeille and Lazaric [7],𝐽(Θ)˜ =𝑃(Θ)˜ and using standard perturbation results on DARE suffices. As mentioned in Abeille and Lazaric [7], one can naively consider the surrogate set of being optimistic in all directions, i.e.,𝑃(Θ)˜ β‰Ό 𝑃(Ξ˜βˆ—).

Nevertheless, this would result in a probability that decays linearly in time and does not yield sub-linear regret. In this study, we propose new surrogate sets to derive a lower bound on the probability of having optimistic samples and show that TS in fact samples optimistic model parameters with constant probability.

In designing TS-based adaptive control algorithms for multidimensional stabilizable LQRs, one needs to maintain a bounded state. In bounding the state, Abeille and Lazaric [7] rely on the fact that the underlying system is contractive,βˆ₯𝐴˜+𝐡𝐾˜ (Θ) βˆ₯˜ <

1. However, under Assumption 3.2, even if the optimal policy of the underlying system is chosen by the learning agent, the closed-loop system may not be contractive since for any symmetric matrix𝑀,𝜌(𝑀) ≀ βˆ₯𝑀βˆ₯. Thus, to avoid dire consequences of unstable dynamics, TS-based adaptive control algorithms should focus on finite- time stabilization of the system dynamics in the early stages.

Moreover, the lack of contractive closed-loop mappings in stabilizable LQRs pre- vents frequent policy changes used in Abeille and Lazaric [7]. From the def- inition of (πœ…, 𝛾)-stabilizability, for any stabilizing controller 𝐾′, we have that π΄βˆ—+π΅βˆ—πΎβ€²=𝐻′𝐿 π»β€²βˆ’1, withβˆ₯𝐿βˆ₯ < 1 for some similarity transformation 𝐻′. Thus, as noted in the analysis of StabL, even if all the policies are stabilizing, changing the policies at every time step could cause couplings of these similarity transformations and result in linear growth of the state over time. Thus, TS-based adaptive control algorithms need to find the balance in the rate of policy updates, so that frequent policy switches are avoided, yet, enough optimistic policies are sampled. In light of

these observations, our results hinge on the following:

1) Improved exploration that allows fast stabilization of the dynamics;

2) Fixed policy update rule that prevents state blow-up and reduces𝑅gap

𝑇 and𝑅TS

𝑇 ; 3) A novel result that shows TS samples optimistic model parameters with a constant probability for multidimensional LQRs and gives a novel bound on𝑅TS

𝑇 . Details of the analysis

The improved exploration along with TS in the early stages allows TSAC to effec- tively explore the state space in all directions. The following shows that for a long enough improved exploration phase, TSAC achieves consistent model estimates and guarantees the design of stabilizing policies.

Lemma 3.6 (Model Estimation Error and Stabilizing Policy Design). Suppose Assumption 3.2 holds. For 𝑑 β‰₯ 200(𝑛 + 𝑑)log12𝛿 time-steps of TS with im- proved exploration, with probability at least 1βˆ’ 2𝛿, TSAC obtains model esti- mates such that βˆ₯Ξ˜Λ†π‘‘ βˆ’ Ξ˜βˆ—βˆ₯2 ≀ 7𝛽𝑑(𝛿)/(πœŽπ‘€

√

𝑑). Moreover, after 𝑇𝑀 β‰₯ 𝑇0 B poly(log(1/𝛿), πœŽβˆ’1

𝑀 , 𝑛, 𝑑 ,𝛼, 𝛾¯ βˆ’1, πœ…)length TS with improved exploration phase, with probability at least1βˆ’3𝛿, TSAC samples controllers 𝐾(Ξ˜Λœπ‘‘) such that the closed- loop dynamics onΞ˜βˆ— is (πœ…

√

2, 𝛾/2) strongly stable for all 𝑑 > 𝑇𝑀, i.e., there exists 𝐿 and 𝐻 ≻ 0 such that π΄βˆ— + π΅βˆ—πΎ(Ξ˜Λœπ‘‘) = 𝐻 𝐿 π»βˆ’1, with βˆ₯𝐿βˆ₯ ≀ 1 βˆ’ 𝛾/2 and

βˆ₯𝐻βˆ₯ βˆ₯π»βˆ’1βˆ₯ ≀ πœ…

√ 2.

The proof and the precise expression of𝑇𝑀 can be collected in Appendix B.2.1. In the proof, we show that the inputs𝑒𝑑 =𝐾(Ξ˜Λœπ‘–)π‘₯𝑑+πœˆπ‘‘forπœˆπ‘‘βˆΌ N (0,2πœ…2𝜎2

𝑀𝐼)guarantee the persistence of excitation with high probability, i.e., the smallest eigenvalue of the design matrix𝑉𝑑scales linearly over time. Combining this result, with the confidence set construction in (3.8), we derive the first result. Using the first result and the fact that there exists a stabilizing neighborhood around the model parameter Ξ˜βˆ—, such that all the optimal linear controllers of the models within this region stabilize Ξ˜βˆ—, we derive the final result. Due to early improved exploration, TSAC stabilizes the system dynamics after𝑇𝑀 samples and starts stabilizing adaptive control with only TS. Using the stabilizing controllers for fixed𝜏0=2π›Ύβˆ’1log(2πœ…

√

2) time-steps, TSAC decays the state magnitude and remedy possible state blow-ups in the first phase. To study the boundedness of state, defineπ‘‡π‘Ÿ =𝑇𝑀+ (𝑛+𝑑)𝜏0log(𝑛+𝑑). The following shows that the state is bounded and well-controlled.

Lemma 3.7(Bounded states). Suppose Assumption 3.2 holds. For given𝑇𝑀andπ‘‡π‘Ÿ, TSAC controls the state such that βˆ₯π‘₯𝑑βˆ₯ =𝑂( (𝑛+𝑑)𝑛+𝑑) for𝑑 ≀ π‘‡π‘Ÿ, with probability at least1βˆ’3𝛿 and βˆ₯π‘₯𝑑βˆ₯ ≀ (12πœ…2+2πœ…

√

2)π›Ύβˆ’1πœŽπ‘€

√︁2𝑛log(𝑛(π‘‘βˆ’π‘‡π‘€)/𝛿) for𝑇β‰₯𝑑 > π‘‡π‘Ÿ, with probability at least1βˆ’4𝛿.

This result is a trivial extension of Lemma 3.5 for StabL since rejection sampling guarantees that the sampled model is an element ofS, thus, it is(πœ…, 𝛾)βˆ’stabilizable by its corresponding optimal controller, 1βˆ’π›Ύ β‰₯ max𝑑≀𝑇 𝜌(π΄Λœπ‘‘+π΅Λœπ‘‘πΎ(Ξ˜Λœπ‘‘)). Using this fact, following the proof Lemma 3.5 in Appendix B.1.3 one can show that for 𝑑 ≀ π‘‡π‘Ÿ, deploying the same policy for 𝜏0 time-steps in the first phase maintains a well-controlled state except for 𝑛+𝑑 time-steps, under the high probability event of ˆ𝐸𝑑 βˆ©πΈΛœπ‘‘. For bounding the state after𝑑 > π‘‡π‘Ÿ, the proof of Lemma 3.5 follows directly such that after(𝑛+𝑑)log(𝑛+𝑑)policy updates, the state is well-controlled and brought to equilibrium. This result shows that the joint event𝐸𝑑 =πΈΛ†π‘‘βˆ©πΈΛœπ‘‘βˆ©πΈΒ―π‘‘ holds with probability at least 1βˆ’4𝛿for all𝑑 ≀𝑇.

Conditioned on this event, we analyze the regret terms individually (Appendix B.2.4). We show that with probability at least 1βˆ’π›Ώ, 𝑅exp

𝑇𝑀 yields O ( (e 𝑛+𝑑)𝑛+𝑑𝑇𝑀) regret due to isotropic perturbations. 𝑅RLS

𝑇 and 𝑅mart

𝑇 are O ( (e 𝑛 + 𝑑)𝑛+π‘‘βˆš π‘‡π‘Ÿ + poly(𝑛, 𝑑)√

𝑇 βˆ’π‘‡π‘Ÿ) with probability at least 1βˆ’π›Ώ due to standard arguments based on the event 𝐸𝑇. More importantly, conditioned on the event 𝐸𝑇, we prove that 𝑅gap

𝑇 = O ( (e 𝑛+𝑑)𝑛+π‘‘βˆš

π‘‡π‘Ÿ+poly(𝑛, 𝑑)√

π‘‡βˆ’π‘‡π‘Ÿ) with probability at least 1βˆ’2𝛿, and 𝑅TS

𝑇 =O (e 𝑛𝑇𝑀+poly(𝑛, 𝑑)√

π‘‡βˆ’π‘‡π‘€)with probability at least 1βˆ’2𝛿, whose analyses require several novel fundamental results.

To bound on 𝑅gap

𝑇 , we extend the results in Abeille and Lazaric [7] to multidi- mensional stabilizable LQRs and incorporate the slow update rule and the early improved exploration. We show that while TSAC enjoys well-controlled state with polynomial dimension dependency on regret due to slow policy updates, it also maintains the desirable O (e √

𝑇) regret of frequent updates with only a constant 𝜏0 scaling. As discussed before, bounding 𝑅TS

𝑇 requires selecting optimistic models with constant probability, which has been an open problem in the literature for mul- tidimensional systems. In this study, we provide a solution to this problem and show that TS indeed selects optimistic model parameters with a constant probability for multidimensional LQRs. The precise statement of this result and its proof outline are given in Section 3.3.3. Leveraging this result, we derive the upper bound on 𝑅TS

𝑇 . Combining all these terms yields the regret upper bound of TSAC given in Theorem 3.3.

3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 94-99)