Adaptive Control via LqgOpt - Optimism-Based Adaptive Control

LEARNING AND CONTROL IN PARTIALLY OBSERVABLE LINEAR DYNAMICAL SYSTEMS

5.4 Optimism-Based Adaptive Control

5.4.1 Adaptive Control via LqgOpt

In this section, we presentLqgOpt, and describe its compounding components. The outline of LqgOpt is given in Algorithm 11. The early stage of deployingLqgOpt involves a fixed warm-up period dedicated to pure exploration using Gaussian excita- tion. In particular it excites the system with𝑢_𝑡 ∼ N (0, 𝜎²

𝑢𝐼)for 1 ≤ 𝑡 ≤ 𝑇_𝑤. LqgOpt requires this exploration period to estimate the model parameters reliably enough that the controller designed based on the parameter estimation and their confidence set results in a stabilizing controller on the real system. The duration of this period 𝑇_𝑤depends on how stabilizable the true parameters are and how accurate the model

estimations should be, i.e., characterizations provided in Assumption 5.1. We will formally quantify these statements and the length of the warm-up period shortly.

After the warm-up period, LqgOpt utilizes the model parameter estimations and their confidence sets to design a controller corresponding to an optimistic model in the confidence sets, obtained by following the OFU principle. Due to the reliable estimation from the warm-up period, this controller and all the future designed controllers stabilize the underlying true unknown model. The agent deploys the prescribed controller on the real system for exploration and exploitation. The agent collects samples throughout its interaction with the environment and uses these samples for further improvement in model estimation, confidence interval construction, and design of the controller regarding an optimistic model. This process functions in epochs of doubling length until the end of execution. In particular, in an epoch, the agent uses the most recent optimistic controller to control the underlying system Θ for twice as long as the duration of the previous control policy, i.e., each epoch𝑖 for𝑖 = {1,2, . . . ,} is of length 2^𝑖−¹𝑇_𝑤 time steps.

This technique is known as “the doubling trick” in reinforcement learning and online learning which prevents frequent policy updates and balances the policy changes so that the overall regret of the algorithm is affected by a constant factor only.

System Identification

LqgOpt uses the novel system identification procedure described in Section 5.3, which allows both open-loop and closed-loop data collection to obtain consistent estimates of the dynamics. In particular, at the beginning of each epoch 𝑖, it solves the regularized least squares problem given in (5.21) to recover input- to-output and output-to-output Markov parameters of Θ in predictor form using the entire history of data up to the current time-step, D𝑖 = {𝑦_𝑡, 𝑢_𝑡}²

𝑖−1𝑇𝑤

𝑡=1 . The estimated Markov parameters Gb^𝑖

yu are then used with SysId to obtain a balanced realization of the underlying model parameters, ˆ𝐴_𝑖,𝐵ˆ_𝑖,𝐶ˆ_𝑖,𝐿ˆ_𝑖 with corresponding confidence sets C𝐴(𝑖),C𝐵(𝑖),C𝐶(𝑖), C𝐿(𝑖) as presented in (5.25). This confidence setC𝑖 B (C𝐴(𝑖) × C𝐵(𝑖) × C𝐶(𝑖) × C𝐿(𝑖))contains the underlying parameters Θ = (𝐴, 𝐵, 𝐶 , 𝐿) up to a similarity transformation with high probability. LqgOpt uses this confidence set with the set S defined in Assumption 5.1 to select the optimistic model among the plausible models. As LqgOpt collects more data, the confidence sets shrink with the rate given in Theorem 5.2, providing significantly refined estimates of the model parameters. LqgOptadapts and updates its policy by

deploying the OFU principle on the new confidence sets.

Recall that in Section 5.3.2, we showed that using control inputs of𝑢_𝑡 ∼ N (0, 𝜎²

𝑢𝐼) allows PE condition and consistent estimation. In particular, we definedG^𝑜𝑙which encodes the open-loop evolution of the disturbances in the system and represents the responses to these disturbances on thebatchof observations and actionshistory and showed thatG^𝑜𝑙is full row-rank, i.e.,𝜎_min(G^𝑜𝑙) > 𝜎_𝑜 >0 for some known 𝜎_𝑜, allowing PE condition. Thus, we have the guarantee that after the warm-up period of LqgOpt, the estimation error of model parameters is ˜O (1/√

𝑇_𝑤), due Theorem 5.2.

Similarly, in order to obtain the PE condition and the consistent system identification during the adaptive control which happens with a closed-loop controller, we define the truncated closed-loop noise evolution parameter G^𝑐𝑙. When the controller is set to be the optimal policy for the underlying system in (5.9), i.e., closed-loop system,G^𝑐𝑙 ∈R^𝐻^{(𝑚+𝑝)×}²^𝐻^(𝑛+𝑚)represents the translation of the truncated history of process and measurement noises on the inputs,𝜙’s. The exact construction ofG^𝑐𝑙 is provided in detail in Equation (5.38) of the next section. Briefly, it is formed by shifting a block matrix G¯ ∈ R^{(𝑚+𝑝)×2}^{𝐻(𝑛+𝑚)} by 𝑚+𝑛 in each block row where G¯ is constructed by 𝐻 (𝑚+𝑝) × (𝑛+𝑚) matrices. Assuming that𝐻 used inLqgOpt is large enough such that G¯ is full row rank for the given system, we will show that G^𝑐𝑙 is also full row rank. Thus, we have that for the choice of 𝐻 inLqgOpt, 𝜎_min(G^𝑐𝑙) is lower bounded by some positive value, i.e., 𝜎_min(G^𝑐𝑙) > 𝜎_𝑐 > 0, whereLqgOptonly knows𝜎_𝑐 and searches for an optimistic system whose closed- loop noise evolution parameter satisfies this lower bound. Note that we define G^𝑐𝑙 based on the optimal closed-loop system and we need to make sure that our model parameter estimates are close enough to the true ones such that the PE condition is also satisfied for the constructed controller. This analysis is provided in the next section, Lemma 5.5. With the guarantee of the PE condition in the closed-loop setting,LqgOptis guaranteed to continuously refine the model parameter estimates, thus it improves the controllers and effectively balances the exploration-exploitation trade-off.

Adaptive Control

After estimating the model parameters effectively at the beginning of epoch𝑖,LqgOpt uses these confidence sets along with the setSto implement the OFU principle. In particular, at time𝑡 = 2^𝑖−1𝑇_𝑤, the algorithm chooses a system ˜Θ𝑖 = (𝐴˜_𝑖,𝐵˜_𝑖,𝐶˜_𝑖,𝐿˜_𝑖)

fromC𝑖∩ Ssuch that

𝐽(Θ˜𝑖) ≤ inf

Θ^′∈C𝑖∩S

𝐽(Θ^′) +1/𝑇 . (5.34)

LqgOpt then designs the optimal feedback policy (𝐾˜_𝑡,𝐿˜_𝑡) for the chosen system Θ˜𝑡 as shown in (5.9), i.e., it uses ˜𝐴_𝑡,𝐵˜_𝑡,𝐶˜_𝑡, and ˜𝐿_𝑡 for estimating the underlying state and deploys the feedback gain matrix of ˜𝐾_𝑡 to design the control inputs. This measurement feedback policy is executed until the end of the epoch, whose duration is twice the previous epoch. The following gives the regret guarantee forLqgOpt. Theorem 5.5 (Regret of LqgOpt with closed-loop PE condition). Given an LQG control system Θ = (𝐴, 𝐵, 𝐶), and regulating parameters 𝑄 ⪰ 0 and 𝑅 ≻ 0, suppose Assumptions 5.1 and 5.2 hold such that the underlying system satisfies the PE condition with its optimal policy, i.e.,𝜎_min(G^𝑐𝑙) > 𝜎_𝑐 > 0. Fixing a horizon𝑇, let𝐻 ≥ max

2𝑛+1,^log(

𝑐_𝐻𝑇

√ 𝑚/√

𝜆) log(1/(1−𝛾₃))

o and 𝑇_𝑤 =poly

𝐻 , 𝜎_𝑜, 𝜎_𝑐, 𝜅₁, 𝜅₂, 𝜅₃, 1 1−𝛾₁

, 1 1−𝛾₂

, 1 1−𝛾₃

, 𝜓 , 𝑚, 𝑛, 𝑝

Then, with high probability, the regret ofLqgOptwith a warm-up duration of𝑇_𝑤 is Regret(𝑇) =O˜ √

𝑇

The proof of this result will be presented in Section 5.4.5 with intermediate results given in Sections 5.4.2–5.4.4. Here𝑇_𝑤 is chosen to guarantee well-refined model estimates, the PE condition during the warm-up and adaptive control periods, the stability of the optimistic controllers, and the boundedness of the measurements and state estimations. The exact requirements on𝑇_𝑤 are given in the following sections with detailed expressions. Nevertheless, the warm-up duration is a fixed problem- dependent constant. This result shows that LqgOpt achieves the same regret rate of LQR systems shown in Chapter 3 in the challenging partially observable LQG control system setting. Moreover, this makes LqgOpt the first adaptive control algorithm to attain ˜O√

𝑇

regret for partially observable linear dynamical systems with convex cost. The following corollary is the direct extension of the result above and considers the case when the underlying optimal controller does not satisfy the PE condition. In this case, closed-loop system identification cannot provide reliable and consistent estimates and LqgOpt relies solely on the warm-up duration with i.i.d.

Gaussian inputs, i.e., the open-loop control. Therefore, throughout the adaptive control process all the model parameter estimation errors scale with ˜O 1/√

𝑇_𝑤 .

Corollary 5.5.1(Regret of LqgOptwithout the closed-loop PE condition). For the system given in Theorem 5.5 with the choices of𝐻 and𝑇_𝑤, if the underlying system is not persistently excited with its optimal policy,LqgOptincurs the following regret with high probability, Regret(𝑇) = O˜

𝑇_𝑤+ ^𝑇−𝑇^√ ^𝑤

𝑇_𝑤

. Therefore, the optimal regret upper bound of this setting is obtained with a warm-up duration of𝑇_𝑤 = O (𝑇^2/3), which gives the regret ofRegret(𝑇) =O˜

𝑇^2/3

forLqgOpt.

This result shows that if the PE condition does not hold for the underlying system with the optimal controller, then LqgOpt requires a longer open-loop exploration, i.e., warm-up, to compensate for this lack of improvement during the adaptive control.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 172-176)