LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)
3.1 Motivation and Background
The learning and control problem in LQRs have been studied in an array of prior works [1, 7, 8, 56, 81, 85, 191, 242]. These works provide finite-time performance guarantees of adaptive control algorithms in terms of regret. In particular, they show that Λπ(β
π) regret afterπ time steps is optimal in adaptive control of LQRs.
They utilize several different paradigms for algorithm design such as Certainty Equivalence, Optimism or Thompson Sampling, yet, they suffer either from the inherent algorithmic drawbacks or limited applicability in practice. Table 3.1 gives an overall comparison of these works in terms of their setting, the requirement of stabilizing controllers, and computational complexities. In the following, we compare these works in more detail.
Certainty equivalent control: Certainty equivalent control (CEC) is one of the most straightforward paradigms for control design in adaptive control of dynamical systems. In CEC, an agent obtains a nominal estimate of the system, and executes the optimal control law for this estimated system (Figure 1.1). Even though Mania et al. [191] and Simchowitz and Foster [242] show that this simple approach attains order-optimal regret in LQRs, the proposed algorithms have several drawbacks.
First and foremost, CEC is sensitive to model mismatch and requires significantly small model estimation errors to the point that exploration of the system dynamics is not required. Since this level of refinement is challenging to obtain for an unknown system, these methods rely on access to an initial stabilizing controller to enable a long exploration. In practice, such a priori known controllers may not be available,
which hinders the deployment of these algorithms.
Moreover, CEC-based approaches follow the given non-adaptive initial stabilizing policy for a long period of time with isotropic perturbations. Thus, they provide an order-optimal theoretical regret upper bound with an additional large constant regret. However, in many applications such as medical, such constant regret and non-adaptive controllers are not tolerable. Our methods StabL and TSAC aim to address these challenges and provide learning and control algorithms that can be deployed in practice. As we will show in Sections 3.2.3 and 3.3.4, they achieve significantly improved performance over the prior baseline RL algorithms in various adaptive control tasks.
Optimism in the face of uncertainty (OFU) principle: As we have seen in Chapter 2, one of the most prominent methods to effectively balance exploration and exploitation is optimism in the face of uncertainty (OFU) principle [156]. An agent that follows the OFU principle deploys the optimal policy of the model with the lowest optimal cost within the set of plausible models (Figure 1.1). This guarantees the asymptotic convergence to the optimal policy for the LQR [30]. Using the OFU principle, the learning algorithms of [2, 86] attain order-optimalO (β
π)regret after π time steps, but their regret upper bounds suffer from anexponentialdependence in the LQR model dimensions.
This is due to the fact that the OFU principle relies heavily on the confidence-set constructions. An agent following the OFU principle mostly explores parts of state space with the lowest expected cost and with higher uncertainty. When the agent does not have reliable model estimates, this may cause a lack of exploration in certain parts of the state space that are important in designing stabilizing policies. This problem becomes more evident in the early stages of agent-environment interactions due to the lack of reliable knowledge about the system. Note that this issue is unique to control problems and not as common in other RL settings, e.g., bandits and gameplay. With the early improved exploration strategy of StabL, we alleviate this problem, achieve fast stabilization, and thus attainO (β
π)regret upper bound with polynomial dimension dependency.
Thompson Sampling: Thompson Sampling (TS) is one of the oldest strategies to balance the exploration vs. exploitation trade-off [263]. In TS, the agent samples a model from a distribution computed based on prior control input and observation pairs, and then takes the optimal action for this sampled model and updates the
distribution based on its novel observation (Figure 1.1). Since it relies solely on sampling, this approach provides polynomial-time algorithms for adaptive control.
Therefore, it is a promising alternative to overcome the possible computational bur- den of optimism-based methods, since they solve a non-convex optimization problem to find the optimistic controllers, which is an NP-hard problem in general [9].
For this reason, [6, 7] propose adaptive control algorithms using TS. In particular, Abeille and Lazaric [7] provide the first TS-based adaptive control algorithm for LQRs that attains optimal regret of ΛO (β
π). However, their result only holds for scalarstabilizable systems, since they were able to show that TS samples optimistic parameters with constant probability in only scalar systems. Further, they conjecture that this is true in multidimensional systems as well and TS-based adaptive control can provide optimal regret in multidimensional LQRs, and provide a simple numer- ical example to support their claims. In the analysis of TSAC, we derive a new lower bound which in fact proves that TS samples optimistic parameters with a constant probability for all stabilizable LQRs. Further, we design TSAC with the required algorithmic improvements to attain ΛO (β
π)in multidimensional stabilizable LQRs.
Finding a stabilizing controller: Similar to the regret minimization, there has been a growing interest in finite-time stabilization of linear dynamical systems [72, 82, 84]. Among these works, Faradonbeh et al. [82] is the closest to our study. However, there are significant differences in the methods and the span of the results. In Faradonbeh et al. [82], random linear controllers are used solely for finding a stabilizing set without a control goal. This results in the explosion of state, presumably exponentially in time, leading to a regret that scales exponentially in time. The proposed method provides many insightful aspects for finding a stabilizing set in finite time, yet a cost analysis of this process or an adaptive control policy have not been provided. Moreover, the stabilizing set in [82] relates to the minimum value that satisfies a specific condition for the roots of a polynomial. This results in a somewhat implicit sample complexity for constructing such a set. On the other hand, in this chapter, we provide a complete study of an autonomous learning and control algorithm for the online LQR problem. Among our results, we give an explicit formulation of the stabilizing set and a sample complexity that only relates to the minimal stabilizability information of the system.
Generalized LQR setting: Another line of research considers the generalizations of the online LQR problem under partial observability [160β162, 191, 245] or ad-
versarial disturbances [56, 108]. These works either assume a given stabilizing controller or open-loop stable system dynamics, except Chen and Hazan [56]. In- dependently and concurrently, the recent work by Chen and Hazan [56] designs an autonomous learning algorithm and regret guarantees that are similar to StabL and TSAC. However, the approaches and the settings have major differences. Chen and Hazan [56] consider the restrictive setting ofcontrollablesystems, yet with adver- sarial disturbances and general cost functions. They injectsignificantlybig inputs, exponential in system parameters, with a pure exploration intent to guarantee the recovery of system parameters and stabilization. This negatively affects the practi- cality of the algorithm. On the other hand, in StabL and TSAC, we inject isotropic Gaussian perturbations to improve the exploration in the stochastic (sub-Gaussian noise)stabilizableLQR while still aiming to control, i.e., no pure exploration phase.
This yields practical RL algorithms that attain state-of-the-art performance.
Notation
We denote the Euclidean norm of a vectorπ₯ as β₯π₯β₯2. For a matrix π΄βRπΓπ, we denote π(π΄) as the spectral radius of π΄, β₯π΄β₯πΉ as its Frobenius norm and β₯π΄β₯ as its spectral norm. tr(π΄) denotes its trace, π΄β€ is the transpose. For any positive definite matrixπ, β₯π΄β₯π =β₯π1/2π΄β₯πΉ. For matrices π΄, π΅βRπΓπ, π΄β’ π΅=tr(π΄ π΅β€) denotes their Frobenius inner product. The j-th singular value of a rank-π matrix π΄ is ππ(π΄), where πmax(π΄):=π1(π΄) β₯. . .β₯ πmin(π΄):=ππ(π΄). πΌ represents the identity matrix with the appropriate dimensions. Mπ=RπΓπ denotes the set of π-dimensional square matrices. N (π,Ξ£) denotes normal distribution with mean π and covarianceΞ£. π(Β·)denotes the Gaussianπ-function. π(Β·)andπ(Β·)denote the standard asymptotic notation and π(π) =π(π(π))is equivalent toπ(π) =π(π(π)). πΛ(Β·)presents the order up to logarithmic terms.
Problem Setting
Consider a discrete-time linear time-invariant system,
π₯π‘+1 = π΄βπ₯π‘+π΅βπ’π‘+π€π‘, (3.1) whereπ₯π‘ βRπis the state of the system,π’π‘ βRπ is the control input,π€π‘ βRπis the process noise at timeπ‘. We consider the systems with sub-Gaussian noise.
Assumption 3.1(Sub-Gaussian Noise). The process noiseπ€π‘is a martingale differ- ence sequence with respect to the filtration(Fπ‘β1). Moreover, it is component-wise
conditionallyπ2
π€-sub-Gaussian and isotropic such that for anyπ βR, E
exp π π€π‘ , π
|Fπ‘β1
β€ exp π 2π2
π€/2 andE
π€π‘π€β€
π‘ |Fπ‘β1
=πΒ―2
π€πΌfor someπΒ―2
π€ > 0.
Note that the results of this chapter only require the conditional covariance matrix π =E[π€π‘π€β€
π‘ |Fπ‘β1]to be full rank. The isotropic noise assumption is chosen to ease the presentation, and similar results can be obtained with upper and lower bounds onπ, i.e.,ππ’ π > πmax(π) β₯ πmin(π) > ππ ππ€ > 0.
At each time step π‘, the system is at stateπ₯π‘. After observing π₯π‘, the agent applies a control input π’π‘ and the system evolves to π₯π‘+1 at timeπ‘ +1. At each time step π‘, the agent pays a cost ππ‘ = π₯β€
π‘ π π₯π‘ +π’β€
π‘ π π’π‘, where π β RπΓπ and π β RπΓπ are positive definite matrices such that β₯πβ₯,β₯π β₯ < πΌandπmin(π), πmin(π ) > πΌ. The problem is to design control inputs based on past observations in order to minimize the average expected cost
π½β = lim
πββ min
π’=[π’1,...,π’π]
1 πE
h βοΈπ
π‘=1
π₯β€
π‘ π π₯π‘ +π’β€
π‘ π π’π‘ i
, (3.2)
by designing control inputs based on past observations. This problem is the canonical example for the control of linear dynamical systems and is termed as linear quadratic regulator (LQR). The system (3.1) can be represented asπ₯π‘+1 = Ξβ€βπ§π‘+π€π‘, where Ξβ€β = [π΄β π΅β] and π§π‘ = [π₯β€
π‘ π’β€
π‘ ]β€. Knowing Ξβ, the optimal control policy, is a linear state feedback controlπ’π‘ =πΎ(Ξβ)π₯π‘withπΎ(Ξβ)=β(π +π΅β€βπβπ΅β)β1π΅β€βπβπ΄β, where πβ is the unique solution to the discrete-time algebraic Riccati equation (DARE) [28]:
πβ= π΄ββ€πβπ΄β+πβ π΄β€βπβπ΅β(π +π΅β€βπβπ΅β)β1π΅β€βπβπ΄β. (3.3) The optimal cost for Ξβ is denoted as π½(Ξβ) = π½β = Tr(πΒ―2
π€πβ). In this work, unlike the controllable LQR setting of the prior adaptive control algorithms without a stabilizing controller [2, 56], we study the online LQR problem in the general setting ofstabilizableLQR.
Definition 3.1 (Stabilizability vs. Controllability). The linear dynamical system Ξβ is stabilizable if there exists πΎ such that π(π΄β + π΅βπΎ) < 1. On the other hand, the linear dynamical system Ξβ is controllable if the controllability matrix [π΅β π΄βπ΅β π΄2βπ΅β . . . π΄πββ1π΅β] has full row rank.
Note that the stabilizability condition is the minimum requirement to define the op- timal control problem. It isstrictly weaker than controllability, i.e., all controllable systems are stabilizable but the converse is not true [28]. Similar to Cohen et al.
[62], we quantify the stabilizability ofΞβ for the finite-time analysis.
Definition 3.2 ((π , πΎ)-Stabilizability). The linear dynamical system Ξβ is (π , πΎ)- stabilizable for (π β₯ 1and0 < πΎ β€ 1) ifβ₯πΎ(Ξβ) β₯ β€ π and there existsπΏandπ» β» 0 such thatπ΄β+π΅βπΎ(Ξβ) =π» πΏ π»β1, withβ₯πΏβ₯ β€1βπΎand β₯π»β₯ β₯π»β1β₯ β€ π .
Note that this is merely a quantification of stabilizability. In other words, any stabilizable system is also(π , πΎ)-stabilizable for someπ andπΎand conversely(π , πΎ)- stabilizability implies stabilizability. In particular, for all stabilizable systems, by setting 1β πΎ = π(π΄β +π΅βπΎ(Ξβ)) and π to be the condition number of π(Ξβ)1/2 where π(Ξβ) is the positive definite matrix that satisfies the following Lyapunov equation:
(π΄β+π΅βπΎ(Ξβ))β€π(Ξβ) (π΄β+π΅βπΎ(Ξβ)) βͺ― π(Ξβ), (3.4) one can show that π΄β + π΅βπΎ(Ξβ) = π» πΏ π»β1, where π» = π(Ξβ)β1/2 and πΏ = π(Ξβ)1/2(π΄β +π΅βπΎ(Ξβ))π(Ξβ)β1/2 with β₯π»β₯ β₯π»β1β₯ β€ π , and β₯πΏβ₯ β€ 1βπΎ, (see Lemma B.1 of Cohen et al. [61]).
Assumption 3.2(Stabilizable Linear Dynamical System). The unknown parameter Ξβ β S such thatS =
Ξβ²= [π΄β², π΅β²]
Ξβ²is(π , πΎ)-stabilizable,β₯Ξβ²β₯πΉ β€ π . Notice that S denotes the set of all bounded systems that are (π , πΎ)-stabilizable, whereΞβis an element of, and the membership toScan be easily verified. Moreover, the proposed algorithm in this work only requires the upper bounds on these relevant control-theoretic quantitiesπ , πΎ, andπ, which are also standard in prior works, e.g., [2, 62]. In practice, when there is a total lack of knowledge about the system, one can start with conservative upper bounds and adjust these based on the behavior of the system, e.g., the growth of the state. From (π , πΎ)-stabilizability, we have that π(π΄β²+π΅β²πΎ(Ξβ²)) β€ 1βπΎ, and sup{β₯πΎ(Ξβ²) β₯ |Ξβ²β S} β€ π . The following lemma shows that for any(π , πΎ)-stabilizable system the solution of (3.3) is bounded.
Lemma 3.1(Bounded DARE Solution). For any Ξthat is (π , πΎ)-stabilizable and has bounded regulatory cost matrices, i.e.,β₯πβ₯,β₯π β₯ < πΌ, the solution of (3.3), π, is bounded asβ₯πβ₯ β€ π· BπΌπΎβ1π 2(1+π 2).
Proof. The solution of DARE in (3.3) corresponds to recursively applying the following
β₯πββ₯ =
βοΈβ
π‘=0 (π΄β+π΅βπΎ(Ξβ))π‘β€
π+πΎ(Ξβ)β€π πΎ(Ξβ)
(π΄β+π΅βπΎ(Ξβ))π‘
=
βοΈβ π‘=0
π» πΏπ‘π»β1 β€
π+πΎ(Ξβ)β€π πΎ(Ξβ)
π» πΏπ‘π»β1
β€ πΌ(1+ β₯πΎ(Ξβ) β₯2) β₯π»β₯2β₯π»β1β₯2βοΈβ
π‘=0β₯πΏβ₯2π‘ (3.5)
β€ πΌπΎβ1π 2(1+π 2), (3.6)
where (3.5) follows from the upper bound onβ₯πβ₯,β₯π β₯ β€ πΌand (3.6) follows from
the definition of(π , πΎ)-stabilizability. β‘
This lemma also shows that for the underlying stabilizable system,π½(Ξβ) < β.
Finite-Time Adaptive Control/Model-based RL Problem in LQRs
In our study, we consider the adaptive control setting where the model parameters π΄β and π΅β, i.e., Ξβ, are unknown. In this scenario, the learning agent needs to interact with the environment to learn these parameters and aims to minimize the cumulative cost Γπ
π‘=1ππ‘. Note that the cost matrices π and π are the designerβs choice and given. Afterπ time steps, we evaluate the regret of the learning agent as
π π =βοΈπ
π‘=0(ππ‘ βπ½(Ξβ)),
which is the difference between the performance of the agent and the expected performance of the optimal controller.