LEARNING AND CONTROL IN PARTIALLY OBSERVABLE LINEAR DYNAMICAL SYSTEMS
5.4 Optimism-Based Adaptive Control
5.4.1 Adaptive Control via LqgOpt
In this section, we presentLqgOpt, and describe its compounding components. The outline of LqgOpt is given in Algorithm 11. The early stage of deployingLqgOpt involves a fixed warm-up period dedicated to pure exploration using Gaussian excita- tion. In particular it excites the system withπ’π‘ βΌ N (0, π2
π’πΌ)for 1 β€ π‘ β€ ππ€. LqgOpt requires this exploration period to estimate the model parameters reliably enough that the controller designed based on the parameter estimation and their confidence set results in a stabilizing controller on the real system. The duration of this period ππ€depends on how stabilizable the true parameters are and how accurate the model
estimations should be, i.e., characterizations provided in Assumption 5.1. We will formally quantify these statements and the length of the warm-up period shortly.
After the warm-up period, LqgOpt utilizes the model parameter estimations and their confidence sets to design a controller corresponding to an optimistic model in the confidence sets, obtained by following the OFU principle. Due to the reliable estimation from the warm-up period, this controller and all the future designed controllers stabilize the underlying true unknown model. The agent deploys the prescribed controller on the real system for exploration and exploitation. The agent collects samples throughout its interaction with the environment and uses these samples for further improvement in model estimation, confidence interval construction, and design of the controller regarding an optimistic model. This process functions in epochs of doubling length until the end of execution. In particular, in an epoch, the agent uses the most recent optimistic controller to control the underlying system Ξ for twice as long as the duration of the previous control policy, i.e., each epochπ forπ = {1,2, . . . ,} is of length 2πβ1ππ€ time steps.
This technique is known as βthe doubling trickβ in reinforcement learning and online learning which prevents frequent policy updates and balances the policy changes so that the overall regret of the algorithm is affected by a constant factor only.
System Identification
LqgOpt uses the novel system identification procedure described in Section 5.3, which allows both open-loop and closed-loop data collection to obtain consis- tent estimates of the dynamics. In particular, at the beginning of each epoch π, it solves the regularized least squares problem given in (5.21) to recover input- to-output and output-to-output Markov parameters of Ξ in predictor form using the entire history of data up to the current time-step, Dπ = {π¦π‘, π’π‘}2
πβ1ππ€
π‘=1 . The estimated Markov parameters Gbπ
yu are then used with SysId to obtain a balanced realization of the underlying model parameters, Λπ΄π,π΅Λπ,πΆΛπ,πΏΛπ with corresponding confidence sets Cπ΄(π),Cπ΅(π),CπΆ(π), CπΏ(π) as presented in (5.25). This confi- dence setCπ B (Cπ΄(π) Γ Cπ΅(π) Γ CπΆ(π) Γ CπΏ(π))contains the underlying parameters Ξ = (π΄, π΅, πΆ , πΏ) up to a similarity transformation with high probability. LqgOpt uses this confidence set with the set S defined in Assumption 5.1 to select the optimistic model among the plausible models. As LqgOpt collects more data, the confidence sets shrink with the rate given in Theorem 5.2, providing significantly refined estimates of the model parameters. LqgOptadapts and updates its policy by
deploying the OFU principle on the new confidence sets.
Recall that in Section 5.3.2, we showed that using control inputs ofπ’π‘ βΌ N (0, π2
π’πΌ) allows PE condition and consistent estimation. In particular, we definedGππwhich encodes the open-loop evolution of the disturbances in the system and represents the responses to these disturbances on thebatchof observations and actionshistory and showed thatGππis full row-rank, i.e.,πmin(Gππ) > ππ >0 for some known ππ, allowing PE condition. Thus, we have the guarantee that after the warm-up period of LqgOpt, the estimation error of model parameters is ΛO (1/β
ππ€), due Theorem 5.2.
Similarly, in order to obtain the PE condition and the consistent system identification during the adaptive control which happens with a closed-loop controller, we define the truncated closed-loop noise evolution parameter Gππ. When the controller is set to be the optimal policy for the underlying system in (5.9), i.e., closed-loop system,Gππ βRπ»(π+π)Γ2π»(π+π)represents the translation of the truncated history of process and measurement noises on the inputs,πβs. The exact construction ofGππ is provided in detail in Equation (5.38) of the next section. Briefly, it is formed by shifting a block matrix GΒ― β R(π+π)Γ2π»(π+π) by π+π in each block row where GΒ― is constructed by π» (π+π) Γ (π+π) matrices. Assuming thatπ» used inLqgOpt is large enough such that GΒ― is full row rank for the given system, we will show that Gππ is also full row rank. Thus, we have that for the choice of π» inLqgOpt, πmin(Gππ) is lower bounded by some positive value, i.e., πmin(Gππ) > ππ > 0, whereLqgOptonly knowsππ and searches for an optimistic system whose closed- loop noise evolution parameter satisfies this lower bound. Note that we define Gππ based on the optimal closed-loop system and we need to make sure that our model parameter estimates are close enough to the true ones such that the PE condition is also satisfied for the constructed controller. This analysis is provided in the next section, Lemma 5.5. With the guarantee of the PE condition in the closed-loop setting,LqgOptis guaranteed to continuously refine the model parameter estimates, thus it improves the controllers and effectively balances the exploration-exploitation trade-off.
Adaptive Control
After estimating the model parameters effectively at the beginning of epochπ,LqgOpt uses these confidence sets along with the setSto implement the OFU principle. In particular, at timeπ‘ = 2πβ1ππ€, the algorithm chooses a system ΛΞπ = (π΄Λπ,π΅Λπ,πΆΛπ,πΏΛπ)
fromCπβ© Ssuch that
π½(ΞΛπ) β€ inf
Ξβ²βCπβ©S
π½(Ξβ²) +1/π . (5.34)
LqgOpt then designs the optimal feedback policy (πΎΛπ‘,πΏΛπ‘) for the chosen system ΞΛπ‘ as shown in (5.9), i.e., it uses Λπ΄π‘,π΅Λπ‘,πΆΛπ‘, and ΛπΏπ‘ for estimating the underlying state and deploys the feedback gain matrix of ΛπΎπ‘ to design the control inputs. This measurement feedback policy is executed until the end of the epoch, whose duration is twice the previous epoch. The following gives the regret guarantee forLqgOpt. Theorem 5.5 (Regret of LqgOpt with closed-loop PE condition). Given an LQG control system Ξ = (π΄, π΅, πΆ), and regulating parameters π βͺ° 0 and π β» 0, suppose Assumptions 5.1 and 5.2 hold such that the underlying system satisfies the PE condition with its optimal policy, i.e.,πmin(Gππ) > ππ > 0. Fixing a horizonπ, letπ» β₯ max
n
2π+1,log(
ππ»π
β π/β
π) log(1/(1βπΎ3))
o and ππ€ =poly
π» , ππ, ππ, π 1, π 2, π 3, 1 1βπΎ1
, 1 1βπΎ2
, 1 1βπΎ3
, π , π, π, π
.
Then, with high probability, the regret ofLqgOptwith a warm-up duration ofππ€ is Regret(π) =OΛ β
π
.
The proof of this result will be presented in Section 5.4.5 with intermediate results given in Sections 5.4.2β5.4.4. Hereππ€ is chosen to guarantee well-refined model estimates, the PE condition during the warm-up and adaptive control periods, the stability of the optimistic controllers, and the boundedness of the measurements and state estimations. The exact requirements onππ€ are given in the following sections with detailed expressions. Nevertheless, the warm-up duration is a fixed problem- dependent constant. This result shows that LqgOpt achieves the same regret rate of LQR systems shown in Chapter 3 in the challenging partially observable LQG control system setting. Moreover, this makes LqgOpt the first adaptive control algorithm to attain ΛOβ
π
regret for partially observable linear dynamical systems with convex cost. The following corollary is the direct extension of the result above and considers the case when the underlying optimal controller does not satisfy the PE condition. In this case, closed-loop system identification cannot provide reliable and consistent estimates and LqgOpt relies solely on the warm-up duration with i.i.d.
Gaussian inputs, i.e., the open-loop control. Therefore, throughout the adaptive control process all the model parameter estimation errors scale with ΛO 1/β
ππ€ .
Corollary 5.5.1(Regret of LqgOptwithout the closed-loop PE condition). For the system given in Theorem 5.5 with the choices ofπ» andππ€, if the underlying system is not persistently excited with its optimal policy,LqgOptincurs the following regret with high probability, Regret(π) = OΛ
ππ€+ πβπβ π€
ππ€
. Therefore, the optimal regret upper bound of this setting is obtained with a warm-up duration ofππ€ = O (π2/3), which gives the regret ofRegret(π) =OΛ
π2/3
forLqgOpt.
This result shows that if the PE condition does not hold for the underlying system with the optimal controller, then LqgOpt requires a longer open-loop exploration, i.e., warm-up, to compensate for this lack of improvement during the adaptive control.