LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)
3.3 Thompson Sampling-Based Adaptive Control
3.3.2 Theoretical Analysis of TSAC
In this section, we study the theoretical guarantees of TSAC. For simplicity of presentation, we consider the Gaussian process noise for the system dynamics. In particular, we assume that there exists a filtration Fπ‘ such that for all π‘ β₯ 0, π₯π‘, π§π‘ areFπ‘-measurable andπ€π‘|Fπ‘ =N (0, π2
π€πΌ)for some knownππ€ > 0. The following results can be extended to sub-Gaussian process noise setting, i.e., Assumption 3.1, using the techniques developed in the previous section (see Lemma 3.1 and its proof in Appendix B.1.1). The following states the first order-optimal frequentist regret bound for TS in multidimensional stabilizable LQRs, our main result.
Theorem 3.3 (Regret of TSAC). Suppose Assumption 3.2 holds and set π0 = 2πΎβ1log(2π
β
2) and π0 = poly(log(1/πΏ), πβ1
π€ , π, π ,πΌ, πΎΒ― β1, π ). Then, for long enough π, TSAC achieves the regret π π = Oe
(π+π)(π+π)βοΈ
πlog(1/πΏ)
w.p. at least 1β10πΏ, if ππ€ = max
π0, π1(β
πlogπ)1+π(1)
for a constant π1 > 0. Fur- thermore, if the closed loop matrix of the optimally controlled underlying sys- tem, π΄π,β B π΄β + π΅βπΎβ, is non-singular, i.e., π΄β is non-singular, w.p. at least 1 β 10πΏ, TSAC achieves the regret π π = Oe
poly(π, π)βοΈ
πlog(1/πΏ)
if ππ€ = max
π0, π2(logπ)1+π(1)
for a constantπ2 > 0.
This makes TSAC thefirst efficientadaptive control algorithm that achieves optimal regret in adaptive control of all LQRs without an initial stabilizing policy. To prove this result, we follow a similar approach as StabL in the previous section and [7], and define the high probability joint event πΈπ‘ = πΈΛπ‘ β© πΈΛπ‘ β©πΈΒ―π‘, where ΛπΈπ‘ states that the RLS estimate ΛΞconcentrates around Ξβ, ΛπΈπ‘ states that the sampled parameter ΛΞ concentrates around ΛΞ, and Β―πΈπ‘ states that the state remains bounded respectively. Conditioned on this event, we decompose the frequentist regret as
π π1πΈπ β€ π exp
ππ€ + π RLS
π + π mart
π + π TS
π + π gap
π , where π exp
ππ€ accounts for the regret attained due to improved exploration, π RLS
π represents the difference between the value function of the true next state and the predicted next state,π mart
π is a martingale with bounded difference,π TS
π measures the difference in optimal average expected cost between the true model Ξβ and the sampled model ΛΞ, and π gap
π measures the regret due to policy changes. The decomposition and expressions are given in Appendix B.2.3. In the analysis, we bound each term separately (Appendix B.2.4).
Note thatπ RLS
π andπ mart
π appear in the regret analysis of StabL due to algorithmic and problem setting construction, thus, follow directly from the prior analysis. Before discussing the further details of the analysis, we first consider the prior works that use TS for adaptive control of LQRs and discuss their shortcomings. Further, we highlight the challenges in adaptive control of multidimensional stabilizable LQRs using TS and present our approaches to overcome these.
Prior Work on TS-based Adaptive Control and Challenges
For the frequentist regret minimization problem, the state-of-the-art adaptive control algorithm that uses TS is Abeille and Lazaric [7]. They consider the βcontractibleβ
LQR systems, i.e. |π΄β + π΅βπΎ(Ξβ) | < 1, and provide Λπ(β
π) regret upper bound for scalar LQRs, i.e. π = π = 1. Notice that the set of contractible systems is a small subset of the setS defined in Assumption 3.2 and they are only equivalent for scalar systems since π(π΄β β π΅βπΎ(Ξβ)) = |π΄β β π΅βπΎ(Ξβ) |. This simplified setting allow them to reduce the regret analysis into the trade-off between π TS
π =
Γπ
π‘=0{π½(ΞΛπ‘) βπ½(Ξβ)}and π gap
π =Γπ
π‘=0E[π₯β€
π‘1(π(ΞΛπ‘+1) βπ(ΞΛπ‘)π₯π‘+1 Fπ‘].
These regret terms are central in the analysis of several adaptive control algorithms.
In the certainty equivalent control approaches, π TS
π is bounded by the quadratic scaling of model estimation error after a significantly long exploration with a known stabilizing controller [191, 242]. In the optimism-based algorithms such as StabL, π TS
π is bounded by 0 by design [2, 81]. Similarly, in the Bayesian regret setting, [212] assume that the underlying parameterΞβ comes from a known prior that the expected regret is computed with respect to. This true prior yields E[π TS
π ] =0 in certain restrictive LQRs. Whereas the conventional approach in the analysis of π gap
π is to have lazy policy updates, i.e.,π(logπ)policy changes such as StabL, via doubling the determinant ofππ‘or exponentially increasing epoch durations [48, 85].
On the other hand, Abeille and Lazaric [7] boundπ TS
π by showing that TS samples the optimistic parameters, ΛΞπ‘ such thatπ½(ΞΛπ‘) β€ π½(Ξβ), with a constant probability,
which reduces the regret of non-optimistic steps. Unlike the conventional policy update approaches, the key idea in Abeille and Lazaric [7] is to update the control policy every time-steps via TS, which increases the number of optimistic policies during the execution. They show that while this frequent update rule reduces π TS
π , it only results withπ gap
π = O (e β
π). However, they were only able to show that this constant probability of optimistic sampling holds for scalar LQRs.
The difficulty of the analysis for the probability of optimistic parameter sampling lies in the challenging characterization of the optimistic set. Sinceπ½(Ξ)Λ =π2
π€tr(π(Ξ))Λ , one needs to consider the spectrum of π(Ξ)Λ to define optimistic models, which makes the analysis difficult. In particular, decreasing the cost along one direction may result in an increase in other directions. However, for the scalar LQR setting considered in Abeille and Lazaric [7],π½(Ξ)Λ =π(Ξ)Λ and using standard perturbation results on DARE suffices. As mentioned in Abeille and Lazaric [7], one can naively consider the surrogate set of being optimistic in all directions, i.e.,π(Ξ)Λ βΌ π(Ξβ).
Nevertheless, this would result in a probability that decays linearly in time and does not yield sub-linear regret. In this study, we propose new surrogate sets to derive a lower bound on the probability of having optimistic samples and show that TS in fact samples optimistic model parameters with constant probability.
In designing TS-based adaptive control algorithms for multidimensional stabilizable LQRs, one needs to maintain a bounded state. In bounding the state, Abeille and Lazaric [7] rely on the fact that the underlying system is contractive,β₯π΄Λ+π΅πΎΛ (Ξ) β₯Λ <
1. However, under Assumption 3.2, even if the optimal policy of the underlying system is chosen by the learning agent, the closed-loop system may not be contractive since for any symmetric matrixπ,π(π) β€ β₯πβ₯. Thus, to avoid dire consequences of unstable dynamics, TS-based adaptive control algorithms should focus on finite- time stabilization of the system dynamics in the early stages.
Moreover, the lack of contractive closed-loop mappings in stabilizable LQRs pre- vents frequent policy changes used in Abeille and Lazaric [7]. From the def- inition of (π , πΎ)-stabilizability, for any stabilizing controller πΎβ², we have that π΄β+π΅βπΎβ²=π»β²πΏ π»β²β1, withβ₯πΏβ₯ < 1 for some similarity transformation π»β². Thus, as noted in the analysis of StabL, even if all the policies are stabilizing, changing the policies at every time step could cause couplings of these similarity transformations and result in linear growth of the state over time. Thus, TS-based adaptive control algorithms need to find the balance in the rate of policy updates, so that frequent policy switches are avoided, yet, enough optimistic policies are sampled. In light of
these observations, our results hinge on the following:
1) Improved exploration that allows fast stabilization of the dynamics;
2) Fixed policy update rule that prevents state blow-up and reducesπ gap
π andπ TS
π ; 3) A novel result that shows TS samples optimistic model parameters with a constant probability for multidimensional LQRs and gives a novel bound onπ TS
π . Details of the analysis
The improved exploration along with TS in the early stages allows TSAC to effec- tively explore the state space in all directions. The following shows that for a long enough improved exploration phase, TSAC achieves consistent model estimates and guarantees the design of stabilizing policies.
Lemma 3.6 (Model Estimation Error and Stabilizing Policy Design). Suppose Assumption 3.2 holds. For π‘ β₯ 200(π + π)log12πΏ time-steps of TS with im- proved exploration, with probability at least 1β 2πΏ, TSAC obtains model esti- mates such that β₯ΞΛπ‘ β Ξββ₯2 β€ 7π½π‘(πΏ)/(ππ€
β
π‘). Moreover, after ππ€ β₯ π0 B poly(log(1/πΏ), πβ1
π€ , π, π ,πΌ, πΎΒ― β1, π )length TS with improved exploration phase, with probability at least1β3πΏ, TSAC samples controllers πΎ(ΞΛπ‘) such that the closed- loop dynamics onΞβ is (π
β
2, πΎ/2) strongly stable for all π‘ > ππ€, i.e., there exists πΏ and π» β» 0 such that π΄β + π΅βπΎ(ΞΛπ‘) = π» πΏ π»β1, with β₯πΏβ₯ β€ 1 β πΎ/2 and
β₯π»β₯ β₯π»β1β₯ β€ π
β 2.
The proof and the precise expression ofππ€ can be collected in Appendix B.2.1. In the proof, we show that the inputsπ’π‘ =πΎ(ΞΛπ)π₯π‘+ππ‘forππ‘βΌ N (0,2π 2π2
π€πΌ)guarantee the persistence of excitation with high probability, i.e., the smallest eigenvalue of the design matrixππ‘scales linearly over time. Combining this result, with the confidence set construction in (3.8), we derive the first result. Using the first result and the fact that there exists a stabilizing neighborhood around the model parameter Ξβ, such that all the optimal linear controllers of the models within this region stabilize Ξβ, we derive the final result. Due to early improved exploration, TSAC stabilizes the system dynamics afterππ€ samples and starts stabilizing adaptive control with only TS. Using the stabilizing controllers for fixedπ0=2πΎβ1log(2π
β
2) time-steps, TSAC decays the state magnitude and remedy possible state blow-ups in the first phase. To study the boundedness of state, defineππ =ππ€+ (π+π)π0log(π+π). The following shows that the state is bounded and well-controlled.
Lemma 3.7(Bounded states). Suppose Assumption 3.2 holds. For givenππ€andππ, TSAC controls the state such that β₯π₯π‘β₯ =π( (π+π)π+π) forπ‘ β€ ππ, with probability at least1β3πΏ and β₯π₯π‘β₯ β€ (12π 2+2π
β
2)πΎβ1ππ€
βοΈ2πlog(π(π‘βππ€)/πΏ) forπβ₯π‘ > ππ, with probability at least1β4πΏ.
This result is a trivial extension of Lemma 3.5 for StabL since rejection sampling guarantees that the sampled model is an element ofS, thus, it is(π , πΎ)βstabilizable by its corresponding optimal controller, 1βπΎ β₯ maxπ‘β€π π(π΄Λπ‘+π΅Λπ‘πΎ(ΞΛπ‘)). Using this fact, following the proof Lemma 3.5 in Appendix B.1.3 one can show that for π‘ β€ ππ, deploying the same policy for π0 time-steps in the first phase maintains a well-controlled state except for π+π time-steps, under the high probability event of ΛπΈπ‘ β©πΈΛπ‘. For bounding the state afterπ‘ > ππ, the proof of Lemma 3.5 follows directly such that after(π+π)log(π+π)policy updates, the state is well-controlled and brought to equilibrium. This result shows that the joint eventπΈπ‘ =πΈΛπ‘β©πΈΛπ‘β©πΈΒ―π‘ holds with probability at least 1β4πΏfor allπ‘ β€π.
Conditioned on this event, we analyze the regret terms individually (Appendix B.2.4). We show that with probability at least 1βπΏ, π exp
ππ€ yields O ( (e π+π)π+πππ€) regret due to isotropic perturbations. π RLS
π and π mart
π are O ( (e π + π)π+πβ ππ + poly(π, π)β
π βππ) with probability at least 1βπΏ due to standard arguments based on the event πΈπ. More importantly, conditioned on the event πΈπ, we prove that π gap
π = O ( (e π+π)π+πβ
ππ+poly(π, π)β
πβππ) with probability at least 1β2πΏ, and π TS
π =O (e πππ€+poly(π, π)β
πβππ€)with probability at least 1β2πΏ, whose analyses require several novel fundamental results.
To bound on π gap
π , we extend the results in Abeille and Lazaric [7] to multidi- mensional stabilizable LQRs and incorporate the slow update rule and the early improved exploration. We show that while TSAC enjoys well-controlled state with polynomial dimension dependency on regret due to slow policy updates, it also maintains the desirable O (e β
π) regret of frequent updates with only a constant π0 scaling. As discussed before, bounding π TS
π requires selecting optimistic models with constant probability, which has been an open problem in the literature for mul- tidimensional systems. In this study, we provide a solution to this problem and show that TS indeed selects optimistic model parameters with a constant probability for multidimensional LQRs. The precise statement of this result and its proof outline are given in Section 3.3.3. Leveraging this result, we derive the upper bound on π TS
π . Combining all these terms yields the regret upper bound of TSAC given in Theorem 3.3.
3.3.3 Proof Outline of Sampling Optimistic Models with Constant Probability