Motivation and Background - LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

3.1 Motivation and Background

The learning and control problem in LQRs have been studied in an array of prior works [1, 7, 8, 56, 81, 85, 191, 242]. These works provide finite-time performance guarantees of adaptive control algorithms in terms of regret. In particular, they show that ˜𝑂(√

𝑇) regret after𝑇 time steps is optimal in adaptive control of LQRs.

They utilize several different paradigms for algorithm design such as Certainty Equivalence, Optimism or Thompson Sampling, yet, they suffer either from the inherent algorithmic drawbacks or limited applicability in practice. Table 3.1 gives an overall comparison of these works in terms of their setting, the requirement of stabilizing controllers, and computational complexities. In the following, we compare these works in more detail.

Certainty equivalent control: Certainty equivalent control (CEC) is one of the most straightforward paradigms for control design in adaptive control of dynamical systems. In CEC, an agent obtains a nominal estimate of the system, and executes the optimal control law for this estimated system (Figure 1.1). Even though Mania et al. [191] and Simchowitz and Foster [242] show that this simple approach attains order-optimal regret in LQRs, the proposed algorithms have several drawbacks.

First and foremost, CEC is sensitive to model mismatch and requires significantly small model estimation errors to the point that exploration of the system dynamics is not required. Since this level of refinement is challenging to obtain for an unknown system, these methods rely on access to an initial stabilizing controller to enable a long exploration. In practice, such a priori known controllers may not be available,

which hinders the deployment of these algorithms.

Moreover, CEC-based approaches follow the given non-adaptive initial stabilizing policy for a long period of time with isotropic perturbations. Thus, they provide an order-optimal theoretical regret upper bound with an additional large constant regret. However, in many applications such as medical, such constant regret and non-adaptive controllers are not tolerable. Our methods StabL and TSAC aim to address these challenges and provide learning and control algorithms that can be deployed in practice. As we will show in Sections 3.2.3 and 3.3.4, they achieve significantly improved performance over the prior baseline RL algorithms in various adaptive control tasks.

Optimism in the face of uncertainty (OFU) principle: As we have seen in Chapter 2, one of the most prominent methods to effectively balance exploration and exploitation is optimism in the face of uncertainty (OFU) principle [156]. An agent that follows the OFU principle deploys the optimal policy of the model with the lowest optimal cost within the set of plausible models (Figure 1.1). This guarantees the asymptotic convergence to the optimal policy for the LQR [30]. Using the OFU principle, the learning algorithms of [2, 86] attain order-optimalO (√

𝑇)regret after 𝑇 time steps, but their regret upper bounds suffer from anexponentialdependence in the LQR model dimensions.

This is due to the fact that the OFU principle relies heavily on the confidence-set constructions. An agent following the OFU principle mostly explores parts of state space with the lowest expected cost and with higher uncertainty. When the agent does not have reliable model estimates, this may cause a lack of exploration in certain parts of the state space that are important in designing stabilizing policies. This problem becomes more evident in the early stages of agent-environment interactions due to the lack of reliable knowledge about the system. Note that this issue is unique to control problems and not as common in other RL settings, e.g., bandits and gameplay. With the early improved exploration strategy of StabL, we alleviate this problem, achieve fast stabilization, and thus attainO (√

𝑇)regret upper bound with polynomial dimension dependency.

Thompson Sampling: Thompson Sampling (TS) is one of the oldest strategies to balance the exploration vs. exploitation trade-off [263]. In TS, the agent samples a model from a distribution computed based on prior control input and observation pairs, and then takes the optimal action for this sampled model and updates the

distribution based on its novel observation (Figure 1.1). Since it relies solely on sampling, this approach provides polynomial-time algorithms for adaptive control.

Therefore, it is a promising alternative to overcome the possible computational bur- den of optimism-based methods, since they solve a non-convex optimization problem to find the optimistic controllers, which is an NP-hard problem in general [9].

For this reason, [6, 7] propose adaptive control algorithms using TS. In particular, Abeille and Lazaric [7] provide the first TS-based adaptive control algorithm for LQRs that attains optimal regret of ˜O (√

𝑇). However, their result only holds for scalarstabilizable systems, since they were able to show that TS samples optimistic parameters with constant probability in only scalar systems. Further, they conjecture that this is true in multidimensional systems as well and TS-based adaptive control can provide optimal regret in multidimensional LQRs, and provide a simple numer- ical example to support their claims. In the analysis of TSAC, we derive a new lower bound which in fact proves that TS samples optimistic parameters with a constant probability for all stabilizable LQRs. Further, we design TSAC with the required algorithmic improvements to attain ˜O (√

𝑇)in multidimensional stabilizable LQRs.

Finding a stabilizing controller: Similar to the regret minimization, there has been a growing interest in finite-time stabilization of linear dynamical systems [72, 82, 84]. Among these works, Faradonbeh et al. [82] is the closest to our study. However, there are significant differences in the methods and the span of the results. In Faradonbeh et al. [82], random linear controllers are used solely for finding a stabilizing set without a control goal. This results in the explosion of state, presumably exponentially in time, leading to a regret that scales exponentially in time. The proposed method provides many insightful aspects for finding a stabilizing set in finite time, yet a cost analysis of this process or an adaptive control policy have not been provided. Moreover, the stabilizing set in [82] relates to the minimum value that satisfies a specific condition for the roots of a polynomial. This results in a somewhat implicit sample complexity for constructing such a set. On the other hand, in this chapter, we provide a complete study of an autonomous learning and control algorithm for the online LQR problem. Among our results, we give an explicit formulation of the stabilizing set and a sample complexity that only relates to the minimal stabilizability information of the system.

Generalized LQR setting: Another line of research considers the generalizations of the online LQR problem under partial observability [160–162, 191, 245] or ad-

versarial disturbances [56, 108]. These works either assume a given stabilizing controller or open-loop stable system dynamics, except Chen and Hazan [56]. In- dependently and concurrently, the recent work by Chen and Hazan [56] designs an autonomous learning algorithm and regret guarantees that are similar to StabL and TSAC. However, the approaches and the settings have major differences. Chen and Hazan [56] consider the restrictive setting ofcontrollablesystems, yet with adver- sarial disturbances and general cost functions. They injectsignificantlybig inputs, exponential in system parameters, with a pure exploration intent to guarantee the recovery of system parameters and stabilization. This negatively affects the practi- cality of the algorithm. On the other hand, in StabL and TSAC, we inject isotropic Gaussian perturbations to improve the exploration in the stochastic (sub-Gaussian noise)stabilizableLQR while still aiming to control, i.e., no pure exploration phase.

This yields practical RL algorithms that attain state-of-the-art performance.

Notation

We denote the Euclidean norm of a vector𝑥 as ∥𝑥∥₂. For a matrix 𝐴∈R^𝑛×𝑑, we denote 𝜌(𝐴) as the spectral radius of 𝐴, ∥𝐴∥𝐹 as its Frobenius norm and ∥𝐴∥ as its spectral norm. tr(𝐴) denotes its trace, 𝐴^⊤ is the transpose. For any positive definite matrix𝑉, ∥𝐴∥𝑉 =∥𝑉^1/2𝐴∥𝐹. For matrices 𝐴, 𝐵∈R^𝑛^×^𝑑, 𝐴• 𝐵=tr(𝐴 𝐵^⊤) denotes their Frobenius inner product. The j-th singular value of a rank-𝑛 matrix 𝐴 is 𝜎_𝑗(𝐴), where 𝜎_max(𝐴):=𝜎₁(𝐴) ≥. . .≥ 𝜎_min(𝐴):=𝜎_𝑛(𝐴). 𝐼 represents the identity matrix with the appropriate dimensions. M𝑛=R^𝑛×𝑛 denotes the set of 𝑛-dimensional square matrices. N (𝜇,Σ) denotes normal distribution with mean 𝜇 and covarianceΣ. 𝑄(·)denotes the Gaussian𝑄-function. 𝑂(·)and𝑜(·)denote the standard asymptotic notation and 𝑓(𝑇) =𝜔(𝑔(𝑇))is equivalent to𝑔(𝑇) =𝑜(𝑓(𝑇)). 𝑂˜(·)presents the order up to logarithmic terms.

Problem Setting

Consider a discrete-time linear time-invariant system,

𝑥_𝑡+1 = 𝐴_∗𝑥_𝑡+𝐵_∗𝑢_𝑡+𝑤_𝑡, (3.1) where𝑥_𝑡 ∈R^𝑛is the state of the system,𝑢_𝑡 ∈R^𝑑 is the control input,𝑤_𝑡 ∈R^𝑛is the process noise at time𝑡. We consider the systems with sub-Gaussian noise.

Assumption 3.1(Sub-Gaussian Noise). The process noise𝑤_𝑡is a martingale differ- ence sequence with respect to the filtration(F_𝑡−₁). Moreover, it is component-wise

conditionally𝜎²

𝑤-sub-Gaussian and isotropic such that for any𝑠 ∈R, E

exp 𝑠𝑤_{𝑡 , 𝑗}

|F_𝑡−1

≤ exp 𝑠²𝜎²

𝑤/2 andE

𝑤_𝑡𝑤^⊤

𝑡 |F𝑡−1

=𝜎¯²

𝑤𝐼for some𝜎¯²

𝑤 > 0.

Note that the results of this chapter only require the conditional covariance matrix 𝑊 =E[𝑤_𝑡𝑤^⊤

𝑡 |F𝑡−1]to be full rank. The isotropic noise assumption is chosen to ease the presentation, and similar results can be obtained with upper and lower bounds on𝑊, i.e.,𝑊_{𝑢 𝑝} > 𝜎_max(𝑊) ≥ 𝜎_min(𝑊) > 𝑊_{𝑙 𝑜𝑤} > 0.

At each time step 𝑡, the system is at state𝑥_𝑡. After observing 𝑥_𝑡, the agent applies a control input 𝑢_𝑡 and the system evolves to 𝑥_𝑡+1 at time𝑡 +1. At each time step 𝑡, the agent pays a cost 𝑐_𝑡 = 𝑥^⊤

𝑡 𝑄 𝑥_𝑡 +𝑢^⊤

𝑡 𝑅𝑢_𝑡, where 𝑄 ∈ R^𝑛^×^𝑛 and 𝑅 ∈ R^𝑑^×^𝑑 are positive definite matrices such that ∥𝑄∥,∥𝑅∥ < 𝛼and𝜎_min(𝑄), 𝜎_min(𝑅) > 𝛼. The problem is to design control inputs based on past observations in order to minimize the average expected cost

𝐽_∗ = lim

𝑇→∞ min

𝑢=[𝑢₁,...,𝑢_𝑇]

1 𝑇E

h ∑︁^𝑇

𝑡=1

𝑥^⊤

𝑡 𝑄 𝑥_𝑡 +𝑢^⊤

𝑡 𝑅𝑢_𝑡 i

, (3.2)

by designing control inputs based on past observations. This problem is the canonical example for the control of linear dynamical systems and is termed as linear quadratic regulator (LQR). The system (3.1) can be represented as𝑥_𝑡+1 = Θ^⊤_∗𝑧_𝑡+𝑤_𝑡, where Θ^⊤_∗ = [𝐴_∗ 𝐵_∗] and 𝑧_𝑡 = [𝑥^⊤

𝑡 𝑢^⊤

𝑡 ]^⊤. Knowing Θ_∗, the optimal control policy, is a linear state feedback control𝑢_𝑡 =𝐾(Θ_∗)𝑥_𝑡with𝐾(Θ_∗)=−(𝑅+𝐵^⊤_∗𝑃_∗𝐵_∗)⁻¹𝐵^⊤_∗𝑃_∗𝐴_∗, where 𝑃_∗ is the unique solution to the discrete-time algebraic Riccati equation (DARE) [28]:

𝑃_∗= 𝐴_∗^⊤𝑃_∗𝐴_∗+𝑄− 𝐴^⊤_∗𝑃_∗𝐵_∗(𝑅+𝐵^⊤_∗𝑃_∗𝐵_∗)⁻¹𝐵^⊤_∗𝑃_∗𝐴_∗. (3.3) The optimal cost for Θ∗ is denoted as 𝐽(Θ_∗) = 𝐽_∗ = Tr(𝜎¯²

𝑤𝑃_∗). In this work, unlike the controllable LQR setting of the prior adaptive control algorithms without a stabilizing controller [2, 56], we study the online LQR problem in the general setting ofstabilizableLQR.

Definition 3.1 (Stabilizability vs. Controllability). The linear dynamical system Θ∗ is stabilizable if there exists 𝐾 such that 𝜌(𝐴_∗ + 𝐵_∗𝐾) < 1. On the other hand, the linear dynamical system Θ∗ is controllable if the controllability matrix [𝐵_∗ 𝐴_∗𝐵_∗ 𝐴²_∗𝐵_∗ . . . 𝐴^𝑛_∗⁻¹𝐵_∗] has full row rank.

Note that the stabilizability condition is the minimum requirement to define the optimal control problem. It isstrictly weaker than controllability, i.e., all controllable systems are stabilizable but the converse is not true [28]. Similar to Cohen et al.

[62], we quantify the stabilizability ofΘ_∗ for the finite-time analysis.

Definition 3.2 ((𝜅, 𝛾)-Stabilizability). The linear dynamical system Θ∗ is (𝜅, 𝛾)- stabilizable for (𝜅 ≥ 1and0 < 𝛾 ≤ 1) if∥𝐾(Θ_∗) ∥ ≤ 𝜅and there exists𝐿and𝐻 ≻ 0 such that𝐴_∗+𝐵_∗𝐾(Θ_∗) =𝐻 𝐿 𝐻⁻¹, with∥𝐿∥ ≤1−𝛾and ∥𝐻∥ ∥𝐻⁻¹∥ ≤ 𝜅.

Note that this is merely a quantification of stabilizability. In other words, any stabilizable system is also(𝜅, 𝛾)-stabilizable for some𝜅and𝛾and conversely(𝜅, 𝛾)- stabilizability implies stabilizability. In particular, for all stabilizable systems, by setting 1− 𝛾 = 𝜌(𝐴_∗ +𝐵_∗𝐾(Θ_∗)) and 𝜅 to be the condition number of 𝑃(Θ_∗)^1/2 where 𝑃(Θ_∗) is the positive definite matrix that satisfies the following Lyapunov equation:

(𝐴_∗+𝐵_∗𝐾(Θ_∗))^⊤𝑃(Θ_∗) (𝐴_∗+𝐵_∗𝐾(Θ_∗)) ⪯ 𝑃(Θ_∗), (3.4) one can show that 𝐴_∗ + 𝐵_∗𝐾(Θ_∗) = 𝐻 𝐿 𝐻⁻¹, where 𝐻 = 𝑃(Θ_∗)^−1/2 and 𝐿 = 𝑃(Θ_∗)^1/2(𝐴_∗ +𝐵_∗𝐾(Θ_∗))𝑃(Θ_∗)^−1/2 with ∥𝐻∥ ∥𝐻⁻¹∥ ≤ 𝜅, and ∥𝐿∥ ≤ 1−𝛾, (see Lemma B.1 of Cohen et al. [61]).

Assumption 3.2(Stabilizable Linear Dynamical System). The unknown parameter Θ∗ ∈ S such thatS =

Θ^′= [𝐴^′, 𝐵^′]

Θ^′is(𝜅, 𝛾)-stabilizable,∥Θ^′∥𝐹 ≤ 𝑆 . Notice that S denotes the set of all bounded systems that are (𝜅, 𝛾)-stabilizable, whereΘ∗is an element of, and the membership toScan be easily verified. Moreover, the proposed algorithm in this work only requires the upper bounds on these relevant control-theoretic quantities𝜅, 𝛾, and𝑆, which are also standard in prior works, e.g., [2, 62]. In practice, when there is a total lack of knowledge about the system, one can start with conservative upper bounds and adjust these based on the behavior of the system, e.g., the growth of the state. From (𝜅, 𝛾)-stabilizability, we have that 𝜌(𝐴^′+𝐵^′𝐾(Θ^′)) ≤ 1−𝛾, and sup{∥𝐾(Θ^′) ∥ |Θ^′∈ S} ≤ 𝜅. The following lemma shows that for any(𝜅, 𝛾)-stabilizable system the solution of (3.3) is bounded.

Lemma 3.1(Bounded DARE Solution). For any Θthat is (𝜅, 𝛾)-stabilizable and has bounded regulatory cost matrices, i.e.,∥𝑄∥,∥𝑅∥ < 𝛼, the solution of (3.3), 𝑃, is bounded as∥𝑃∥ ≤ 𝐷 B𝛼𝛾⁻¹𝜅²(1+𝜅²).

Proof. The solution of DARE in (3.3) corresponds to recursively applying the following

∥𝑃_∗∥ =

∑︁∞

𝑡=0 (𝐴_∗+𝐵_∗𝐾(Θ_∗))^𝑡⊤

𝑄+𝐾(Θ_∗)^⊤𝑅 𝐾(Θ_∗)

(𝐴_∗+𝐵_∗𝐾(Θ_∗))^𝑡

∑︁∞ 𝑡=0

𝐻 𝐿^𝑡𝐻⁻¹ ^⊤

𝑄+𝐾(Θ_∗)^⊤𝑅 𝐾(Θ_∗)

𝐻 𝐿^𝑡𝐻⁻¹

≤ 𝛼(1+ ∥𝐾(Θ_∗) ∥²) ∥𝐻∥²∥𝐻⁻¹∥²∑︁∞

𝑡=0∥𝐿∥²^𝑡 (3.5)

≤ 𝛼𝛾⁻¹𝜅²(1+𝜅²), (3.6)

where (3.5) follows from the upper bound on∥𝑄∥,∥𝑅∥ ≤ 𝛼and (3.6) follows from

the definition of(𝜅, 𝛾)-stabilizability. □

This lemma also shows that for the underlying stabilizable system,𝐽(Θ_∗) < ∞.

Finite-Time Adaptive Control/Model-based RL Problem in LQRs

In our study, we consider the adaptive control setting where the model parameters 𝐴_∗ and 𝐵_∗, i.e., Θ∗, are unknown. In this scenario, the learning agent needs to interact with the environment to learn these parameters and aims to minimize the cumulative cost Í𝑇

𝑡=1𝑐_𝑡. Note that the cost matrices 𝑄 and 𝑅 are the designer’s choice and given. After𝑇 time steps, we evaluate the regret of the learning agent as

𝑅_𝑇 =∑︁𝑇

𝑡=0(𝑐_𝑡 −𝐽(Θ_∗)),

which is the difference between the performance of the agent and the expected performance of the optimal controller.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 73-79)