Theoretical Analysis of TSAC - Thompson Sampling-Based Adaptive Control

LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

3.3 Thompson Sampling-Based Adaptive Control

3.3.2 Theoretical Analysis of TSAC

In this section, we study the theoretical guarantees of TSAC. For simplicity of presentation, we consider the Gaussian process noise for the system dynamics. In particular, we assume that there exists a filtration F𝑡 such that for all 𝑡 ≥ 0, 𝑥_𝑡, 𝑧_𝑡 areF𝑡-measurable and𝑤_𝑡|F𝑡 =N (0, 𝜎²

𝑤𝐼)for some known𝜎_𝑤 > 0. The following results can be extended to sub-Gaussian process noise setting, i.e., Assumption 3.1, using the techniques developed in the previous section (see Lemma 3.1 and its proof in Appendix B.1.1). The following states the first order-optimal frequentist regret bound for TS in multidimensional stabilizable LQRs, our main result.

Theorem 3.3 (Regret of TSAC). Suppose Assumption 3.2 holds and set 𝜏₀ = 2𝛾⁻¹log(2𝜅

√

2) and 𝑇₀ = poly(log(1/𝛿), 𝜎⁻¹

𝑤 , 𝑛, 𝑑 ,𝛼, 𝛾¯ ⁻¹, 𝜅). Then, for long enough 𝑇, TSAC achieves the regret 𝑅_𝑇 = Oe

(𝑛+𝑑)^(𝑛+𝑑)√︁

𝑇log(1/𝛿)

w.p. at least 1−10𝛿, if 𝑇_𝑤 = max

𝑇₀, 𝑐₁(√

𝑇log𝑇)^1+𝑜(1)

for a constant 𝑐₁ > 0. Fur- thermore, if the closed loop matrix of the optimally controlled underlying sys- tem, 𝐴_𝑐,∗ B 𝐴_∗ + 𝐵_∗𝐾_∗, is non-singular, i.e., 𝐴_∗ is non-singular, w.p. at least 1 − 10𝛿, TSAC achieves the regret 𝑅_𝑇 = Oe

poly(𝑛, 𝑑)√︁

𝑇log(1/𝛿)

if 𝑇_𝑤 = max

𝑇₀, 𝑐₂(log𝑇)¹⁺^𝑜⁽¹⁾

for a constant𝑐₂ > 0.

This makes TSAC thefirst efficientadaptive control algorithm that achieves optimal regret in adaptive control of all LQRs without an initial stabilizing policy. To prove this result, we follow a similar approach as StabL in the previous section and [7], and define the high probability joint event 𝐸_𝑡 = 𝐸ˆ_𝑡 ∩ 𝐸˜_𝑡 ∩𝐸¯_𝑡, where ˆ𝐸_𝑡 states that the RLS estimate ˆΘconcentrates around Θ_∗, ˜𝐸_𝑡 states that the sampled parameter ˜Θ concentrates around ˆΘ, and ¯𝐸_𝑡 states that the state remains bounded respectively. Conditioned on this event, we decompose the frequentist regret as

𝑅_𝑇1^𝐸^𝑇 ≤ 𝑅^exp

𝑇_𝑤 + 𝑅^RLS

𝑇 + 𝑅^mart

𝑇 + 𝑅^TS

𝑇 + 𝑅^gap

𝑇 , where 𝑅^exp

𝑇_𝑤 accounts for the regret attained due to improved exploration, 𝑅^RLS

𝑇 represents the difference between the value function of the true next state and the predicted next state,𝑅^mart

𝑇 is a martingale with bounded difference,𝑅^TS

𝑇 measures the difference in optimal average expected cost between the true model Θ_∗ and the sampled model ˜Θ, and 𝑅^gap

𝑇 measures the regret due to policy changes. The decomposition and expressions are given in Appendix B.2.3. In the analysis, we bound each term separately (Appendix B.2.4).

Note that𝑅^RLS

𝑇 and𝑅^mart

𝑇 appear in the regret analysis of StabL due to algorithmic and problem setting construction, thus, follow directly from the prior analysis. Before discussing the further details of the analysis, we first consider the prior works that use TS for adaptive control of LQRs and discuss their shortcomings. Further, we highlight the challenges in adaptive control of multidimensional stabilizable LQRs using TS and present our approaches to overcome these.

Prior Work on TS-based Adaptive Control and Challenges

For the frequentist regret minimization problem, the state-of-the-art adaptive control algorithm that uses TS is Abeille and Lazaric [7]. They consider the “contractible”

LQR systems, i.e. |𝐴_∗ + 𝐵_∗𝐾(Θ_∗) | < 1, and provide ˜𝑂(√

𝑇) regret upper bound for scalar LQRs, i.e. 𝑛 = 𝑑 = 1. Notice that the set of contractible systems is a small subset of the setS defined in Assumption 3.2 and they are only equivalent for scalar systems since 𝜌(𝐴_∗ − 𝐵_∗𝐾(Θ_∗)) = |𝐴_∗ − 𝐵_∗𝐾(Θ_∗) |. This simplified setting allow them to reduce the regret analysis into the trade-off between 𝑅^TS

𝑇 =

Í^𝑇

𝑡=0{𝐽(Θ˜𝑡) −𝐽(Θ_∗)}and 𝑅^gap

𝑇 =Í^𝑇

𝑡=0E[𝑥^⊤

𝑡₁(𝑃(Θ˜𝑡+1) −𝑃(Θ˜𝑡)𝑥_𝑡₊₁ F𝑡].

These regret terms are central in the analysis of several adaptive control algorithms.

In the certainty equivalent control approaches, 𝑅^TS

𝑇 is bounded by the quadratic scaling of model estimation error after a significantly long exploration with a known stabilizing controller [191, 242]. In the optimism-based algorithms such as StabL, 𝑅^TS

𝑇 is bounded by 0 by design [2, 81]. Similarly, in the Bayesian regret setting, [212] assume that the underlying parameterΘ_∗ comes from a known prior that the expected regret is computed with respect to. This true prior yields E[𝑅^TS

𝑇 ] =0 in certain restrictive LQRs. Whereas the conventional approach in the analysis of 𝑅^gap

𝑇 is to have lazy policy updates, i.e.,𝑂(log𝑇)policy changes such as StabL, via doubling the determinant of𝑉_𝑡or exponentially increasing epoch durations [48, 85].

On the other hand, Abeille and Lazaric [7] bound𝑅^TS

𝑇 by showing that TS samples the optimistic parameters, ˜Θ𝑡 such that𝐽(Θ˜𝑡) ≤ 𝐽(Θ_∗), with a constant probability,

which reduces the regret of non-optimistic steps. Unlike the conventional policy update approaches, the key idea in Abeille and Lazaric [7] is to update the control policy every time-steps via TS, which increases the number of optimistic policies during the execution. They show that while this frequent update rule reduces 𝑅^TS

𝑇 , it only results with𝑅^gap

𝑇 = O (e √

𝑇). However, they were only able to show that this constant probability of optimistic sampling holds for scalar LQRs.

The difficulty of the analysis for the probability of optimistic parameter sampling lies in the challenging characterization of the optimistic set. Since𝐽(Θ)˜ =𝜎²

𝑤tr(𝑃(Θ))˜ , one needs to consider the spectrum of 𝑃(Θ)˜ to define optimistic models, which makes the analysis difficult. In particular, decreasing the cost along one direction may result in an increase in other directions. However, for the scalar LQR setting considered in Abeille and Lazaric [7],𝐽(Θ)˜ =𝑃(Θ)˜ and using standard perturbation results on DARE suffices. As mentioned in Abeille and Lazaric [7], one can naively consider the surrogate set of being optimistic in all directions, i.e.,𝑃(Θ)˜ ≼ 𝑃(Θ_∗).

Nevertheless, this would result in a probability that decays linearly in time and does not yield sub-linear regret. In this study, we propose new surrogate sets to derive a lower bound on the probability of having optimistic samples and show that TS in fact samples optimistic model parameters with constant probability.

In designing TS-based adaptive control algorithms for multidimensional stabilizable LQRs, one needs to maintain a bounded state. In bounding the state, Abeille and Lazaric [7] rely on the fact that the underlying system is contractive,∥𝐴˜+𝐵𝐾˜ (Θ) ∥˜ <

1. However, under Assumption 3.2, even if the optimal policy of the underlying system is chosen by the learning agent, the closed-loop system may not be contractive since for any symmetric matrix𝑀,𝜌(𝑀) ≤ ∥𝑀∥. Thus, to avoid dire consequences of unstable dynamics, TS-based adaptive control algorithms should focus on finite- time stabilization of the system dynamics in the early stages.

Moreover, the lack of contractive closed-loop mappings in stabilizable LQRs prevents frequent policy changes used in Abeille and Lazaric [7]. From the def- inition of (𝜅, 𝛾)-stabilizability, for any stabilizing controller 𝐾^′, we have that 𝐴_∗+𝐵_∗𝐾^′=𝐻^′𝐿 𝐻^′−¹, with∥𝐿∥ < 1 for some similarity transformation 𝐻^′. Thus, as noted in the analysis of StabL, even if all the policies are stabilizing, changing the policies at every time step could cause couplings of these similarity transformations and result in linear growth of the state over time. Thus, TS-based adaptive control algorithms need to find the balance in the rate of policy updates, so that frequent policy switches are avoided, yet, enough optimistic policies are sampled. In light of

these observations, our results hinge on the following:

1) Improved exploration that allows fast stabilization of the dynamics;

2) Fixed policy update rule that prevents state blow-up and reduces𝑅^gap

𝑇 and𝑅^TS

𝑇 ; 3) A novel result that shows TS samples optimistic model parameters with a constant probability for multidimensional LQRs and gives a novel bound on𝑅^TS

𝑇 . Details of the analysis

The improved exploration along with TS in the early stages allows TSAC to effec- tively explore the state space in all directions. The following shows that for a long enough improved exploration phase, TSAC achieves consistent model estimates and guarantees the design of stabilizing policies.

Lemma 3.6 (Model Estimation Error and Stabilizing Policy Design). Suppose Assumption 3.2 holds. For 𝑡 ≥ 200(𝑛 + 𝑑)log¹²_𝛿 time-steps of TS with im- proved exploration, with probability at least 1− 2𝛿, TSAC obtains model esti- mates such that ∥Θˆ𝑡 − Θ∗∥₂ ≤ 7𝛽_𝑡(𝛿)/(𝜎_𝑤

√

𝑡). Moreover, after 𝑇_𝑤 ≥ 𝑇₀ B poly(log(1/𝛿), 𝜎⁻¹

𝑤 , 𝑛, 𝑑 ,𝛼, 𝛾¯ ⁻¹, 𝜅)length TS with improved exploration phase, with probability at least1−3𝛿, TSAC samples controllers 𝐾(Θ˜𝑡) such that the closed- loop dynamics onΘ_∗ is (𝜅

√

2, 𝛾/2) strongly stable for all 𝑡 > 𝑇_𝑤, i.e., there exists 𝐿 and 𝐻 ≻ 0 such that 𝐴_∗ + 𝐵_∗𝐾(Θ˜𝑡) = 𝐻 𝐿 𝐻⁻¹, with ∥𝐿∥ ≤ 1 − 𝛾/2 and

∥𝐻∥ ∥𝐻⁻¹∥ ≤ 𝜅

√ 2.

The proof and the precise expression of𝑇_𝑤 can be collected in Appendix B.2.1. In the proof, we show that the inputs𝑢_𝑡 =𝐾(Θ˜𝑖)𝑥_𝑡+𝜈_𝑡for𝜈_𝑡∼ N (0,2𝜅²𝜎²

𝑤𝐼)guarantee the persistence of excitation with high probability, i.e., the smallest eigenvalue of the design matrix𝑉_𝑡scales linearly over time. Combining this result, with the confidence set construction in (3.8), we derive the first result. Using the first result and the fact that there exists a stabilizing neighborhood around the model parameter Θ∗, such that all the optimal linear controllers of the models within this region stabilize Θ_∗, we derive the final result. Due to early improved exploration, TSAC stabilizes the system dynamics after𝑇_𝑤 samples and starts stabilizing adaptive control with only TS. Using the stabilizing controllers for fixed𝜏₀=2𝛾⁻¹log(2𝜅

√

2) time-steps, TSAC decays the state magnitude and remedy possible state blow-ups in the first phase. To study the boundedness of state, define𝑇_𝑟 =𝑇_𝑤+ (𝑛+𝑑)𝜏₀log(𝑛+𝑑). The following shows that the state is bounded and well-controlled.

Lemma 3.7(Bounded states). Suppose Assumption 3.2 holds. For given𝑇_𝑤and𝑇_𝑟, TSAC controls the state such that ∥𝑥_𝑡∥ =𝑂( (𝑛+𝑑)^𝑛+𝑑) for𝑡 ≤ 𝑇_𝑟, with probability at least1−3𝛿 and ∥𝑥_𝑡∥ ≤ (12𝜅²+2𝜅

√

2)𝛾⁻¹𝜎_𝑤

√︁2𝑛log(𝑛(𝑡−𝑇_𝑤)/𝛿) for𝑇≥𝑡 > 𝑇_𝑟, with probability at least1−4𝛿.

This result is a trivial extension of Lemma 3.5 for StabL since rejection sampling guarantees that the sampled model is an element ofS, thus, it is(𝜅, 𝛾)−stabilizable by its corresponding optimal controller, 1−𝛾 ≥ max𝑡≤𝑇 𝜌(𝐴˜_𝑡+𝐵˜_𝑡𝐾(Θ˜𝑡)). Using this fact, following the proof Lemma 3.5 in Appendix B.1.3 one can show that for 𝑡 ≤ 𝑇_𝑟, deploying the same policy for 𝜏₀ time-steps in the first phase maintains a well-controlled state except for 𝑛+𝑑 time-steps, under the high probability event of ˆ𝐸_𝑡 ∩𝐸˜_𝑡. For bounding the state after𝑡 > 𝑇_𝑟, the proof of Lemma 3.5 follows directly such that after(𝑛+𝑑)log(𝑛+𝑑)policy updates, the state is well-controlled and brought to equilibrium. This result shows that the joint event𝐸_𝑡 =𝐸ˆ_𝑡∩𝐸˜_𝑡∩𝐸¯_𝑡 holds with probability at least 1−4𝛿for all𝑡 ≤𝑇.

Conditioned on this event, we analyze the regret terms individually (Appendix B.2.4). We show that with probability at least 1−𝛿, 𝑅^exp

𝑇_𝑤 yields O ( (e 𝑛+𝑑)^𝑛⁺^𝑑𝑇_𝑤) regret due to isotropic perturbations. 𝑅^RLS

𝑇 and 𝑅^mart

𝑇 are O ( (e 𝑛 + 𝑑)^𝑛⁺^𝑑√ 𝑇_𝑟 + poly(𝑛, 𝑑)√

𝑇 −𝑇_𝑟) with probability at least 1−𝛿 due to standard arguments based on the event 𝐸_𝑇. More importantly, conditioned on the event 𝐸_𝑇, we prove that 𝑅^gap

𝑇 = O ( (e 𝑛+𝑑)^𝑛+𝑑√

𝑇_𝑟+poly(𝑛, 𝑑)√

𝑇−𝑇_𝑟) with probability at least 1−2𝛿, and 𝑅^TS

𝑇 =O (e 𝑛𝑇_𝑤+poly(𝑛, 𝑑)√

𝑇−𝑇_𝑤)with probability at least 1−2𝛿, whose analyses require several novel fundamental results.

To bound on 𝑅^gap

𝑇 , we extend the results in Abeille and Lazaric [7] to multidimensional stabilizable LQRs and incorporate the slow update rule and the early improved exploration. We show that while TSAC enjoys well-controlled state with polynomial dimension dependency on regret due to slow policy updates, it also maintains the desirable O (e √

𝑇) regret of frequent updates with only a constant 𝜏₀ scaling. As discussed before, bounding 𝑅^TS

𝑇 requires selecting optimistic models with constant probability, which has been an open problem in the literature for multidimensional systems. In this study, we provide a solution to this problem and show that TS indeed selects optimistic model parameters with a constant probability for multidimensional LQRs. The precise statement of this result and its proof outline are given in Section 3.3.3. Leveraging this result, we derive the upper bound on 𝑅^TS

𝑇 . Combining all these terms yields the regret upper bound of TSAC given in Theorem 3.3.

3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 94-99)