• Tidak ada hasil yang ditemukan

Adaptive Control via LqgOpt

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 172-176)

LEARNING AND CONTROL IN PARTIALLY OBSERVABLE LINEAR DYNAMICAL SYSTEMS

5.4 Optimism-Based Adaptive Control

5.4.1 Adaptive Control via LqgOpt

In this section, we presentLqgOpt, and describe its compounding components. The outline of LqgOpt is given in Algorithm 11. The early stage of deployingLqgOpt involves a fixed warm-up period dedicated to pure exploration using Gaussian excita- tion. In particular it excites the system with𝑒𝑑 ∼ N (0, 𝜎2

𝑒𝐼)for 1 ≀ 𝑑 ≀ 𝑇𝑀. LqgOpt requires this exploration period to estimate the model parameters reliably enough that the controller designed based on the parameter estimation and their confidence set results in a stabilizing controller on the real system. The duration of this period 𝑇𝑀depends on how stabilizable the true parameters are and how accurate the model

estimations should be, i.e., characterizations provided in Assumption 5.1. We will formally quantify these statements and the length of the warm-up period shortly.

After the warm-up period, LqgOpt utilizes the model parameter estimations and their confidence sets to design a controller corresponding to an optimistic model in the confidence sets, obtained by following the OFU principle. Due to the reliable estimation from the warm-up period, this controller and all the future designed controllers stabilize the underlying true unknown model. The agent deploys the prescribed controller on the real system for exploration and exploitation. The agent collects samples throughout its interaction with the environment and uses these samples for further improvement in model estimation, confidence interval construction, and design of the controller regarding an optimistic model. This process functions in epochs of doubling length until the end of execution. In particular, in an epoch, the agent uses the most recent optimistic controller to control the underlying system Θ for twice as long as the duration of the previous control policy, i.e., each epoch𝑖 for𝑖 = {1,2, . . . ,} is of length 2π‘–βˆ’1𝑇𝑀 time steps.

This technique is known as β€œthe doubling trick” in reinforcement learning and online learning which prevents frequent policy updates and balances the policy changes so that the overall regret of the algorithm is affected by a constant factor only.

System Identification

LqgOpt uses the novel system identification procedure described in Section 5.3, which allows both open-loop and closed-loop data collection to obtain consis- tent estimates of the dynamics. In particular, at the beginning of each epoch 𝑖, it solves the regularized least squares problem given in (5.21) to recover input- to-output and output-to-output Markov parameters of Θ in predictor form using the entire history of data up to the current time-step, D𝑖 = {𝑦𝑑, 𝑒𝑑}2

π‘–βˆ’1𝑇𝑀

𝑑=1 . The estimated Markov parameters Gb𝑖

yu are then used with SysId to obtain a balanced realization of the underlying model parameters, ˆ𝐴𝑖,𝐡ˆ𝑖,𝐢ˆ𝑖,𝐿ˆ𝑖 with corresponding confidence sets C𝐴(𝑖),C𝐡(𝑖),C𝐢(𝑖), C𝐿(𝑖) as presented in (5.25). This confi- dence setC𝑖 B (C𝐴(𝑖) Γ— C𝐡(𝑖) Γ— C𝐢(𝑖) Γ— C𝐿(𝑖))contains the underlying parameters Θ = (𝐴, 𝐡, 𝐢 , 𝐿) up to a similarity transformation with high probability. LqgOpt uses this confidence set with the set S defined in Assumption 5.1 to select the optimistic model among the plausible models. As LqgOpt collects more data, the confidence sets shrink with the rate given in Theorem 5.2, providing significantly refined estimates of the model parameters. LqgOptadapts and updates its policy by

deploying the OFU principle on the new confidence sets.

Recall that in Section 5.3.2, we showed that using control inputs of𝑒𝑑 ∼ N (0, 𝜎2

𝑒𝐼) allows PE condition and consistent estimation. In particular, we definedGπ‘œπ‘™which encodes the open-loop evolution of the disturbances in the system and represents the responses to these disturbances on thebatchof observations and actionshistory and showed thatGπ‘œπ‘™is full row-rank, i.e.,𝜎min(Gπ‘œπ‘™) > πœŽπ‘œ >0 for some known πœŽπ‘œ, allowing PE condition. Thus, we have the guarantee that after the warm-up period of LqgOpt, the estimation error of model parameters is ˜O (1/√

𝑇𝑀), due Theorem 5.2.

Similarly, in order to obtain the PE condition and the consistent system identification during the adaptive control which happens with a closed-loop controller, we define the truncated closed-loop noise evolution parameter G𝑐𝑙. When the controller is set to be the optimal policy for the underlying system in (5.9), i.e., closed-loop system,G𝑐𝑙 ∈R𝐻(π‘š+𝑝)Γ—2𝐻(𝑛+π‘š)represents the translation of the truncated history of process and measurement noises on the inputs,πœ™β€™s. The exact construction ofG𝑐𝑙 is provided in detail in Equation (5.38) of the next section. Briefly, it is formed by shifting a block matrix GΒ― ∈ R(π‘š+𝑝)Γ—2𝐻(𝑛+π‘š) by π‘š+𝑛 in each block row where GΒ― is constructed by 𝐻 (π‘š+𝑝) Γ— (𝑛+π‘š) matrices. Assuming that𝐻 used inLqgOpt is large enough such that GΒ― is full row rank for the given system, we will show that G𝑐𝑙 is also full row rank. Thus, we have that for the choice of 𝐻 inLqgOpt, 𝜎min(G𝑐𝑙) is lower bounded by some positive value, i.e., 𝜎min(G𝑐𝑙) > πœŽπ‘ > 0, whereLqgOptonly knowsπœŽπ‘ and searches for an optimistic system whose closed- loop noise evolution parameter satisfies this lower bound. Note that we define G𝑐𝑙 based on the optimal closed-loop system and we need to make sure that our model parameter estimates are close enough to the true ones such that the PE condition is also satisfied for the constructed controller. This analysis is provided in the next section, Lemma 5.5. With the guarantee of the PE condition in the closed-loop setting,LqgOptis guaranteed to continuously refine the model parameter estimates, thus it improves the controllers and effectively balances the exploration-exploitation trade-off.

Adaptive Control

After estimating the model parameters effectively at the beginning of epoch𝑖,LqgOpt uses these confidence sets along with the setSto implement the OFU principle. In particular, at time𝑑 = 2π‘–βˆ’1𝑇𝑀, the algorithm chooses a system ΛœΞ˜π‘– = (π΄Λœπ‘–,π΅Λœπ‘–,πΆΛœπ‘–,πΏΛœπ‘–)

fromCπ‘–βˆ© Ssuch that

𝐽(Ξ˜Λœπ‘–) ≀ inf

Ξ˜β€²βˆˆCπ‘–βˆ©S

𝐽(Ξ˜β€²) +1/𝑇 . (5.34)

LqgOpt then designs the optimal feedback policy (πΎΛœπ‘‘,πΏΛœπ‘‘) for the chosen system Ξ˜Λœπ‘‘ as shown in (5.9), i.e., it uses Λœπ΄π‘‘,π΅Λœπ‘‘,πΆΛœπ‘‘, and ΛœπΏπ‘‘ for estimating the underlying state and deploys the feedback gain matrix of ΛœπΎπ‘‘ to design the control inputs. This measurement feedback policy is executed until the end of the epoch, whose duration is twice the previous epoch. The following gives the regret guarantee forLqgOpt. Theorem 5.5 (Regret of LqgOpt with closed-loop PE condition). Given an LQG control system Θ = (𝐴, 𝐡, 𝐢), and regulating parameters 𝑄 βͺ° 0 and 𝑅 ≻ 0, suppose Assumptions 5.1 and 5.2 hold such that the underlying system satisfies the PE condition with its optimal policy, i.e.,𝜎min(G𝑐𝑙) > πœŽπ‘ > 0. Fixing a horizon𝑇, let𝐻 β‰₯ max

n

2𝑛+1,log(

𝑐𝐻𝑇

√ π‘š/√

πœ†) log(1/(1βˆ’π›Ύ3))

o and 𝑇𝑀 =poly

𝐻 , πœŽπ‘œ, πœŽπ‘, πœ…1, πœ…2, πœ…3, 1 1βˆ’π›Ύ1

, 1 1βˆ’π›Ύ2

, 1 1βˆ’π›Ύ3

, πœ“ , π‘š, 𝑛, 𝑝

.

Then, with high probability, the regret ofLqgOptwith a warm-up duration of𝑇𝑀 is Regret(𝑇) =O˜ √

𝑇

.

The proof of this result will be presented in Section 5.4.5 with intermediate results given in Sections 5.4.2–5.4.4. Here𝑇𝑀 is chosen to guarantee well-refined model estimates, the PE condition during the warm-up and adaptive control periods, the stability of the optimistic controllers, and the boundedness of the measurements and state estimations. The exact requirements on𝑇𝑀 are given in the following sections with detailed expressions. Nevertheless, the warm-up duration is a fixed problem- dependent constant. This result shows that LqgOpt achieves the same regret rate of LQR systems shown in Chapter 3 in the challenging partially observable LQG control system setting. Moreover, this makes LqgOpt the first adaptive control algorithm to attain ˜O√

𝑇

regret for partially observable linear dynamical systems with convex cost. The following corollary is the direct extension of the result above and considers the case when the underlying optimal controller does not satisfy the PE condition. In this case, closed-loop system identification cannot provide reliable and consistent estimates and LqgOpt relies solely on the warm-up duration with i.i.d.

Gaussian inputs, i.e., the open-loop control. Therefore, throughout the adaptive control process all the model parameter estimation errors scale with ˜O 1/√

𝑇𝑀 .

Corollary 5.5.1(Regret of LqgOptwithout the closed-loop PE condition). For the system given in Theorem 5.5 with the choices of𝐻 and𝑇𝑀, if the underlying system is not persistently excited with its optimal policy,LqgOptincurs the following regret with high probability, Regret(𝑇) = O˜

𝑇𝑀+ π‘‡βˆ’π‘‡βˆš 𝑀

𝑇𝑀

. Therefore, the optimal regret upper bound of this setting is obtained with a warm-up duration of𝑇𝑀 = O (𝑇2/3), which gives the regret ofRegret(𝑇) =O˜

𝑇2/3

forLqgOpt.

This result shows that if the PE condition does not hold for the underlying system with the optimal controller, then LqgOpt requires a longer open-loop exploration, i.e., warm-up, to compensate for this lack of improvement during the adaptive control.

Dalam dokumen Learning and Control of Dynamical Systems (Halaman 172-176)