• Tidak ada hasil yang ditemukan

Motivation and Background

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 73-79)

LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

3.1 Motivation and Background

The learning and control problem in LQRs have been studied in an array of prior works [1, 7, 8, 56, 81, 85, 191, 242]. These works provide finite-time performance guarantees of adaptive control algorithms in terms of regret. In particular, they show that Λœπ‘‚(√

𝑇) regret after𝑇 time steps is optimal in adaptive control of LQRs.

They utilize several different paradigms for algorithm design such as Certainty Equivalence, Optimism or Thompson Sampling, yet, they suffer either from the inherent algorithmic drawbacks or limited applicability in practice. Table 3.1 gives an overall comparison of these works in terms of their setting, the requirement of stabilizing controllers, and computational complexities. In the following, we compare these works in more detail.

Certainty equivalent control: Certainty equivalent control (CEC) is one of the most straightforward paradigms for control design in adaptive control of dynamical systems. In CEC, an agent obtains a nominal estimate of the system, and executes the optimal control law for this estimated system (Figure 1.1). Even though Mania et al. [191] and Simchowitz and Foster [242] show that this simple approach attains order-optimal regret in LQRs, the proposed algorithms have several drawbacks.

First and foremost, CEC is sensitive to model mismatch and requires significantly small model estimation errors to the point that exploration of the system dynamics is not required. Since this level of refinement is challenging to obtain for an unknown system, these methods rely on access to an initial stabilizing controller to enable a long exploration. In practice, such a priori known controllers may not be available,

which hinders the deployment of these algorithms.

Moreover, CEC-based approaches follow the given non-adaptive initial stabilizing policy for a long period of time with isotropic perturbations. Thus, they provide an order-optimal theoretical regret upper bound with an additional large constant regret. However, in many applications such as medical, such constant regret and non-adaptive controllers are not tolerable. Our methods StabL and TSAC aim to address these challenges and provide learning and control algorithms that can be deployed in practice. As we will show in Sections 3.2.3 and 3.3.4, they achieve significantly improved performance over the prior baseline RL algorithms in various adaptive control tasks.

Optimism in the face of uncertainty (OFU) principle: As we have seen in Chapter 2, one of the most prominent methods to effectively balance exploration and exploitation is optimism in the face of uncertainty (OFU) principle [156]. An agent that follows the OFU principle deploys the optimal policy of the model with the lowest optimal cost within the set of plausible models (Figure 1.1). This guarantees the asymptotic convergence to the optimal policy for the LQR [30]. Using the OFU principle, the learning algorithms of [2, 86] attain order-optimalO (√

𝑇)regret after 𝑇 time steps, but their regret upper bounds suffer from anexponentialdependence in the LQR model dimensions.

This is due to the fact that the OFU principle relies heavily on the confidence-set constructions. An agent following the OFU principle mostly explores parts of state space with the lowest expected cost and with higher uncertainty. When the agent does not have reliable model estimates, this may cause a lack of exploration in certain parts of the state space that are important in designing stabilizing policies. This problem becomes more evident in the early stages of agent-environment interactions due to the lack of reliable knowledge about the system. Note that this issue is unique to control problems and not as common in other RL settings, e.g., bandits and gameplay. With the early improved exploration strategy of StabL, we alleviate this problem, achieve fast stabilization, and thus attainO (√

𝑇)regret upper bound with polynomial dimension dependency.

Thompson Sampling: Thompson Sampling (TS) is one of the oldest strategies to balance the exploration vs. exploitation trade-off [263]. In TS, the agent samples a model from a distribution computed based on prior control input and observation pairs, and then takes the optimal action for this sampled model and updates the

distribution based on its novel observation (Figure 1.1). Since it relies solely on sampling, this approach provides polynomial-time algorithms for adaptive control.

Therefore, it is a promising alternative to overcome the possible computational bur- den of optimism-based methods, since they solve a non-convex optimization problem to find the optimistic controllers, which is an NP-hard problem in general [9].

For this reason, [6, 7] propose adaptive control algorithms using TS. In particular, Abeille and Lazaric [7] provide the first TS-based adaptive control algorithm for LQRs that attains optimal regret of ˜O (√

𝑇). However, their result only holds for scalarstabilizable systems, since they were able to show that TS samples optimistic parameters with constant probability in only scalar systems. Further, they conjecture that this is true in multidimensional systems as well and TS-based adaptive control can provide optimal regret in multidimensional LQRs, and provide a simple numer- ical example to support their claims. In the analysis of TSAC, we derive a new lower bound which in fact proves that TS samples optimistic parameters with a constant probability for all stabilizable LQRs. Further, we design TSAC with the required algorithmic improvements to attain ˜O (√

𝑇)in multidimensional stabilizable LQRs.

Finding a stabilizing controller: Similar to the regret minimization, there has been a growing interest in finite-time stabilization of linear dynamical systems [72, 82, 84]. Among these works, Faradonbeh et al. [82] is the closest to our study. However, there are significant differences in the methods and the span of the results. In Faradonbeh et al. [82], random linear controllers are used solely for finding a stabilizing set without a control goal. This results in the explosion of state, presumably exponentially in time, leading to a regret that scales exponentially in time. The proposed method provides many insightful aspects for finding a stabilizing set in finite time, yet a cost analysis of this process or an adaptive control policy have not been provided. Moreover, the stabilizing set in [82] relates to the minimum value that satisfies a specific condition for the roots of a polynomial. This results in a somewhat implicit sample complexity for constructing such a set. On the other hand, in this chapter, we provide a complete study of an autonomous learning and control algorithm for the online LQR problem. Among our results, we give an explicit formulation of the stabilizing set and a sample complexity that only relates to the minimal stabilizability information of the system.

Generalized LQR setting: Another line of research considers the generalizations of the online LQR problem under partial observability [160–162, 191, 245] or ad-

versarial disturbances [56, 108]. These works either assume a given stabilizing controller or open-loop stable system dynamics, except Chen and Hazan [56]. In- dependently and concurrently, the recent work by Chen and Hazan [56] designs an autonomous learning algorithm and regret guarantees that are similar to StabL and TSAC. However, the approaches and the settings have major differences. Chen and Hazan [56] consider the restrictive setting ofcontrollablesystems, yet with adver- sarial disturbances and general cost functions. They injectsignificantlybig inputs, exponential in system parameters, with a pure exploration intent to guarantee the recovery of system parameters and stabilization. This negatively affects the practi- cality of the algorithm. On the other hand, in StabL and TSAC, we inject isotropic Gaussian perturbations to improve the exploration in the stochastic (sub-Gaussian noise)stabilizableLQR while still aiming to control, i.e., no pure exploration phase.

This yields practical RL algorithms that attain state-of-the-art performance.

Notation

We denote the Euclidean norm of a vectorπ‘₯ as βˆ₯π‘₯βˆ₯2. For a matrix 𝐴∈R𝑛×𝑑, we denote 𝜌(𝐴) as the spectral radius of 𝐴, βˆ₯𝐴βˆ₯𝐹 as its Frobenius norm and βˆ₯𝐴βˆ₯ as its spectral norm. tr(𝐴) denotes its trace, 𝐴⊀ is the transpose. For any positive definite matrix𝑉, βˆ₯𝐴βˆ₯𝑉 =βˆ₯𝑉1/2𝐴βˆ₯𝐹. For matrices 𝐴, 𝐡∈R𝑛×𝑑, 𝐴‒ 𝐡=tr(𝐴 𝐡⊀) denotes their Frobenius inner product. The j-th singular value of a rank-𝑛 matrix 𝐴 is πœŽπ‘—(𝐴), where 𝜎max(𝐴):=𝜎1(𝐴) β‰₯. . .β‰₯ 𝜎min(𝐴):=πœŽπ‘›(𝐴). 𝐼 represents the identity matrix with the appropriate dimensions. M𝑛=R𝑛×𝑛 denotes the set of 𝑛-dimensional square matrices. N (πœ‡,Ξ£) denotes normal distribution with mean πœ‡ and covarianceΞ£. 𝑄(Β·)denotes the Gaussian𝑄-function. 𝑂(Β·)andπ‘œ(Β·)denote the standard asymptotic notation and 𝑓(𝑇) =πœ”(𝑔(𝑇))is equivalent to𝑔(𝑇) =π‘œ(𝑓(𝑇)). π‘‚Λœ(Β·)presents the order up to logarithmic terms.

Problem Setting

Consider a discrete-time linear time-invariant system,

π‘₯𝑑+1 = π΄βˆ—π‘₯𝑑+π΅βˆ—π‘’π‘‘+𝑀𝑑, (3.1) whereπ‘₯𝑑 ∈R𝑛is the state of the system,𝑒𝑑 ∈R𝑑 is the control input,𝑀𝑑 ∈R𝑛is the process noise at time𝑑. We consider the systems with sub-Gaussian noise.

Assumption 3.1(Sub-Gaussian Noise). The process noise𝑀𝑑is a martingale differ- ence sequence with respect to the filtration(Fπ‘‘βˆ’1). Moreover, it is component-wise

conditionally𝜎2

𝑀-sub-Gaussian and isotropic such that for any𝑠 ∈R, E

exp 𝑠𝑀𝑑 , 𝑗

|Fπ‘‘βˆ’1

≀ exp 𝑠2𝜎2

𝑀/2 andE

π‘€π‘‘π‘€βŠ€

𝑑 |Fπ‘‘βˆ’1

=𝜎¯2

𝑀𝐼for some𝜎¯2

𝑀 > 0.

Note that the results of this chapter only require the conditional covariance matrix π‘Š =E[π‘€π‘‘π‘€βŠ€

𝑑 |Fπ‘‘βˆ’1]to be full rank. The isotropic noise assumption is chosen to ease the presentation, and similar results can be obtained with upper and lower bounds onπ‘Š, i.e.,π‘Šπ‘’ 𝑝 > 𝜎max(π‘Š) β‰₯ 𝜎min(π‘Š) > π‘Šπ‘™ π‘œπ‘€ > 0.

At each time step 𝑑, the system is at stateπ‘₯𝑑. After observing π‘₯𝑑, the agent applies a control input 𝑒𝑑 and the system evolves to π‘₯𝑑+1 at time𝑑 +1. At each time step 𝑑, the agent pays a cost 𝑐𝑑 = π‘₯⊀

𝑑 𝑄 π‘₯𝑑 +π‘’βŠ€

𝑑 𝑅𝑒𝑑, where 𝑄 ∈ R𝑛×𝑛 and 𝑅 ∈ R𝑑×𝑑 are positive definite matrices such that βˆ₯𝑄βˆ₯,βˆ₯𝑅βˆ₯ < 𝛼and𝜎min(𝑄), 𝜎min(𝑅) > 𝛼. The problem is to design control inputs based on past observations in order to minimize the average expected cost

π½βˆ— = lim

π‘‡β†’βˆž min

𝑒=[𝑒1,...,𝑒𝑇]

1 𝑇E

h βˆ‘οΈπ‘‡

𝑑=1

π‘₯⊀

𝑑 𝑄 π‘₯𝑑 +π‘’βŠ€

𝑑 𝑅𝑒𝑑 i

, (3.2)

by designing control inputs based on past observations. This problem is the canonical example for the control of linear dynamical systems and is termed as linear quadratic regulator (LQR). The system (3.1) can be represented asπ‘₯𝑑+1 = Ξ˜βŠ€βˆ—π‘§π‘‘+𝑀𝑑, where Ξ˜βŠ€βˆ— = [π΄βˆ— π΅βˆ—] and 𝑧𝑑 = [π‘₯⊀

𝑑 π‘’βŠ€

𝑑 ]⊀. Knowing Ξ˜βˆ—, the optimal control policy, is a linear state feedback control𝑒𝑑 =𝐾(Ξ˜βˆ—)π‘₯𝑑with𝐾(Ξ˜βˆ—)=βˆ’(𝑅+π΅βŠ€βˆ—π‘ƒβˆ—π΅βˆ—)βˆ’1π΅βŠ€βˆ—π‘ƒβˆ—π΄βˆ—, where π‘ƒβˆ— is the unique solution to the discrete-time algebraic Riccati equation (DARE) [28]:

π‘ƒβˆ—= π΄βˆ—βŠ€π‘ƒβˆ—π΄βˆ—+π‘„βˆ’ π΄βŠ€βˆ—π‘ƒβˆ—π΅βˆ—(𝑅+π΅βŠ€βˆ—π‘ƒβˆ—π΅βˆ—)βˆ’1π΅βŠ€βˆ—π‘ƒβˆ—π΄βˆ—. (3.3) The optimal cost for Ξ˜βˆ— is denoted as 𝐽(Ξ˜βˆ—) = π½βˆ— = Tr(𝜎¯2

π‘€π‘ƒβˆ—). In this work, unlike the controllable LQR setting of the prior adaptive control algorithms without a stabilizing controller [2, 56], we study the online LQR problem in the general setting ofstabilizableLQR.

Definition 3.1 (Stabilizability vs. Controllability). The linear dynamical system Ξ˜βˆ— is stabilizable if there exists 𝐾 such that 𝜌(π΄βˆ— + π΅βˆ—πΎ) < 1. On the other hand, the linear dynamical system Ξ˜βˆ— is controllable if the controllability matrix [π΅βˆ— π΄βˆ—π΅βˆ— 𝐴2βˆ—π΅βˆ— . . . π΄π‘›βˆ—βˆ’1π΅βˆ—] has full row rank.

Note that the stabilizability condition is the minimum requirement to define the op- timal control problem. It isstrictly weaker than controllability, i.e., all controllable systems are stabilizable but the converse is not true [28]. Similar to Cohen et al.

[62], we quantify the stabilizability ofΞ˜βˆ— for the finite-time analysis.

Definition 3.2 ((πœ…, 𝛾)-Stabilizability). The linear dynamical system Ξ˜βˆ— is (πœ…, 𝛾)- stabilizable for (πœ… β‰₯ 1and0 < 𝛾 ≀ 1) ifβˆ₯𝐾(Ξ˜βˆ—) βˆ₯ ≀ πœ…and there exists𝐿and𝐻 ≻ 0 such thatπ΄βˆ—+π΅βˆ—πΎ(Ξ˜βˆ—) =𝐻 𝐿 π»βˆ’1, withβˆ₯𝐿βˆ₯ ≀1βˆ’π›Ύand βˆ₯𝐻βˆ₯ βˆ₯π»βˆ’1βˆ₯ ≀ πœ….

Note that this is merely a quantification of stabilizability. In other words, any stabilizable system is also(πœ…, 𝛾)-stabilizable for someπœ…and𝛾and conversely(πœ…, 𝛾)- stabilizability implies stabilizability. In particular, for all stabilizable systems, by setting 1βˆ’ 𝛾 = 𝜌(π΄βˆ— +π΅βˆ—πΎ(Ξ˜βˆ—)) and πœ… to be the condition number of 𝑃(Ξ˜βˆ—)1/2 where 𝑃(Ξ˜βˆ—) is the positive definite matrix that satisfies the following Lyapunov equation:

(π΄βˆ—+π΅βˆ—πΎ(Ξ˜βˆ—))βŠ€π‘ƒ(Ξ˜βˆ—) (π΄βˆ—+π΅βˆ—πΎ(Ξ˜βˆ—)) βͺ― 𝑃(Ξ˜βˆ—), (3.4) one can show that π΄βˆ— + π΅βˆ—πΎ(Ξ˜βˆ—) = 𝐻 𝐿 π»βˆ’1, where 𝐻 = 𝑃(Ξ˜βˆ—)βˆ’1/2 and 𝐿 = 𝑃(Ξ˜βˆ—)1/2(π΄βˆ— +π΅βˆ—πΎ(Ξ˜βˆ—))𝑃(Ξ˜βˆ—)βˆ’1/2 with βˆ₯𝐻βˆ₯ βˆ₯π»βˆ’1βˆ₯ ≀ πœ…, and βˆ₯𝐿βˆ₯ ≀ 1βˆ’π›Ύ, (see Lemma B.1 of Cohen et al. [61]).

Assumption 3.2(Stabilizable Linear Dynamical System). The unknown parameter Ξ˜βˆ— ∈ S such thatS =

Ξ˜β€²= [𝐴′, 𝐡′]

Ξ˜β€²is(πœ…, 𝛾)-stabilizable,βˆ₯Ξ˜β€²βˆ₯𝐹 ≀ 𝑆 . Notice that S denotes the set of all bounded systems that are (πœ…, 𝛾)-stabilizable, whereΞ˜βˆ—is an element of, and the membership toScan be easily verified. Moreover, the proposed algorithm in this work only requires the upper bounds on these relevant control-theoretic quantitiesπœ…, 𝛾, and𝑆, which are also standard in prior works, e.g., [2, 62]. In practice, when there is a total lack of knowledge about the system, one can start with conservative upper bounds and adjust these based on the behavior of the system, e.g., the growth of the state. From (πœ…, 𝛾)-stabilizability, we have that 𝜌(𝐴′+𝐡′𝐾(Ξ˜β€²)) ≀ 1βˆ’π›Ύ, and sup{βˆ₯𝐾(Ξ˜β€²) βˆ₯ |Ξ˜β€²βˆˆ S} ≀ πœ…. The following lemma shows that for any(πœ…, 𝛾)-stabilizable system the solution of (3.3) is bounded.

Lemma 3.1(Bounded DARE Solution). For any Θthat is (πœ…, 𝛾)-stabilizable and has bounded regulatory cost matrices, i.e.,βˆ₯𝑄βˆ₯,βˆ₯𝑅βˆ₯ < 𝛼, the solution of (3.3), 𝑃, is bounded asβˆ₯𝑃βˆ₯ ≀ 𝐷 Bπ›Όπ›Ύβˆ’1πœ…2(1+πœ…2).

Proof. The solution of DARE in (3.3) corresponds to recursively applying the following

βˆ₯π‘ƒβˆ—βˆ₯ =

βˆ‘οΈβˆž

𝑑=0 (π΄βˆ—+π΅βˆ—πΎ(Ξ˜βˆ—))π‘‘βŠ€

𝑄+𝐾(Ξ˜βˆ—)βŠ€π‘… 𝐾(Ξ˜βˆ—)

(π΄βˆ—+π΅βˆ—πΎ(Ξ˜βˆ—))𝑑

=

βˆ‘οΈβˆž 𝑑=0

𝐻 πΏπ‘‘π»βˆ’1 ⊀

𝑄+𝐾(Ξ˜βˆ—)βŠ€π‘… 𝐾(Ξ˜βˆ—)

𝐻 πΏπ‘‘π»βˆ’1

≀ 𝛼(1+ βˆ₯𝐾(Ξ˜βˆ—) βˆ₯2) βˆ₯𝐻βˆ₯2βˆ₯π»βˆ’1βˆ₯2βˆ‘οΈβˆž

𝑑=0βˆ₯𝐿βˆ₯2𝑑 (3.5)

≀ π›Όπ›Ύβˆ’1πœ…2(1+πœ…2), (3.6)

where (3.5) follows from the upper bound onβˆ₯𝑄βˆ₯,βˆ₯𝑅βˆ₯ ≀ 𝛼and (3.6) follows from

the definition of(πœ…, 𝛾)-stabilizability. β–‘

This lemma also shows that for the underlying stabilizable system,𝐽(Ξ˜βˆ—) < ∞.

Finite-Time Adaptive Control/Model-based RL Problem in LQRs

In our study, we consider the adaptive control setting where the model parameters π΄βˆ— and π΅βˆ—, i.e., Ξ˜βˆ—, are unknown. In this scenario, the learning agent needs to interact with the environment to learn these parameters and aims to minimize the cumulative cost Í𝑇

𝑑=1𝑐𝑑. Note that the cost matrices 𝑄 and 𝑅 are the designer’s choice and given. After𝑇 time steps, we evaluate the regret of the learning agent as

𝑅𝑇 =βˆ‘οΈπ‘‡

𝑑=0(𝑐𝑑 βˆ’π½(Ξ˜βˆ—)),

which is the difference between the performance of the agent and the expected performance of the optimal controller.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 73-79)