Learning and Control of Dynamical Systems

The friendly and intellectually stimulating atmosphere of our laboratory has truly been one of the most fulfilling experiences of my time at Caltech. He participated in the design of the project, the analysis of the problem and the writing of the manuscript.

LIST OF ILLUSTRATIONS

FALCON models the system dynamics as a linear map of the representation of a short history (ℎ-length) of action-measurement pairs in the concise Fourier basis learned in the warm-up phase. FALCON selects the Fourier basis vectors corresponding to non-zero coefficients in the solution of the Lasso problem as the concise Fourier basis𝜙(·) for the entire adaptive control in epoch phase for learning and control of the system.

INTRODUCTION

Research Questions Studied in This Thesis

What is the required amount of data to learn and control algorithms to meet operational constraints such as stabilizing the underlying dynamics?”. This research question focuses on finite-time stabilization and security aspects in learning and control of dynamic systems.

Figure 1.1: Methodologies to balance the exploration vs. exploitation trade-off:

Outline and Scope of the Thesis

In particular, we study the problem of adaptive control of the linear time-invariant systems given in (1.1) with quadratic regulatory cost𝑐𝑡 =𝑥⊤. This framework has been the main focus of providing finite-time guarantees in learning and control of dynamic systems due to its simplicity and ability to capture the heart of the problem in terms of system identification and adaptive control synthesis.

STOCHASTIC LINEAR BANDITS WITH PRACTICAL CONCERNS

Motivation and Background

Therefore, the chosen action is assigned a supervised reward signal, while other actions in the decision sets remain unsupervised. For the SLB framework, the majority of the previous works are mainly devoted to the general cases where no possible hidden or low intrinsic dimensional structures in the decision sets are considered.

Stochastic Linear Bandits with Hidden Low-Rank Structure

Problem Formulation
Projected Stochastic Linear Bandits (PSLB)
Theoretical Analysis of PSLB
Experiments

Υ is a reflection of the projection error of the final sample at the beginning of the algorithm. Recall that 𝛼 is the largest effective dimension of the real action and perturbation vectors.

Figure 2.1: (A) 2-D representation of the effect of increasing perturbation level in concealing the underlying subspace

Stochastic Linear Bandits with Unknown Safety Constraints

Problem Formulation
Safe-OFUL
Theoretical guarantees for Safe-OFUL
Linear Bandits with Nonlinear Constraints
Experiments

We consider the setting where the agent is subject to strong constraints, i.e., the agent must play actions belonging to 𝐷safe. The main innovation in the design of Safe-LinTS lies in this decoupling of reward parameter search and security functions.

Table 2.1: Comparison with prior works on safe stochastic linear bandit with ˜ O ( √ 𝑇 ) Regret

Conclusion and Future Directions

This result highlights the importance of the finely tuned exploration bonus under safety constraints to recover the underlying reward parameter. The empirical studies validate the effectiveness of the proposed algorithms and highlight their applicability to practical decision-making scenarios. In our study of stochastic linear bandits with unknown safety constraints and local safety feedback, our primary contribution is to decouple the safety exploration from the reward exploration and find the right balance to ensure efficient recovery of the safety features while not over-exploring the system.

Although our regret results are narrow in terms of horizon𝑇, the narrowness of√.

LEARNING AND CONTROL IN LINEAR QUADRATIC REGULATOR (LQR)

Motivation and Background

Safety Equivalent Control: Safety Equivalent Control (CEC) is one of the most simple paradigms for control design in adaptive control of dynamic systems. In CEC, an agent obtains a nominal estimate of the system, and executes the optimal control law for this estimated system (Figure 1.1). An agent following the OFU principle deploys the optimal policy of the model with the lowest optimal cost within the set of plausible models (Figure 1.1).

In this work, unlike the controllable LQR setting of the previous adaptive control algorithms without a stabilizing controller [2, 56], we study the online LQR problem in the general setting of stabilizable LQR.

Optimism-Based Adaptive Control

StabL
Theoretical Analysis of StabL
Experiments

StabL efficiently excites and explores all dimensions of the system through this improved exploration strategy (Theorem 3.1). The duration of adaptive control with an enhanced exploration phase is chosen so that StabL quickly finds the stabilization controller. Using early-stage isotropic exploration, we derive a new lower bound for the smallest eigenvalue of the design matrix𝑉𝑡 in a stabilized LQR with sub-Gaussian noise tuning.

In particular, CEC-Fix drastically inflates the state due to significantly unstable dynamics for the uncontrollable part of the system.

Figure 3.1: Evolution of the smallest eigenvalue of the design matrix for StabL and OFULQ in the Laplacian system

Thompson Sampling-Based Adaptive Control

TSAC
Theoretical Analysis of TSAC
Proof Outline of Sampling Optimistic Models with Constant Probability In this section, we provide the precise statement that the probability of sampling

Due to the lack of an a priori known stabilizing controller, TSAC focuses on the rapid learning of stabilizing controllers in the early stages of the algorithm. The difficulty of the analysis for the likelihood of optimistic parameter sampling lies in the challenging characterization of the optimistic set. In the proof, we show that the input𝑢𝑡 =𝐾(Θ˜𝑖)𝑥𝑡+𝜈𝑡for𝜈𝑡∼ N (0.2𝜅2𝜎2 . 𝑤𝐼)guarantees matrix𝑉𝑡 scales linearly over time.

In the second step, we reformulate the sampling probability of the optimistic parameters in terms of the matrix of the closed system ˜𝐴𝑐BΘ˜⊤𝐻∗=𝐴˜+ 𝐵𝐾˜ (Θ∗) of the sampled system ˜Θ =(𝐴,˜ 𝐵˜ )⊤ guided by the policy 𝐾(Θ ∗).

Figure 3.3: A visual representation of sublevel manifold M ∗ . 𝑂 is the origin and 𝐴 𝑐, ∗ is the optimal closed-loop system matrix

Conclusion and Future Directions

Even though we will provide some preliminary results in Chapter 5 that rely on the persistence of excitation, the general question of the application of TS in the measurement-feedback environment is still an open question. Furthermore, obtaining a constant probability of sampling optimistic parameters for general LQRs requires TSAC𝑇𝑤 =𝜔(√ . 𝑇log𝑇) time steps of enhanced exploration (Theorem 3.4), causing the regret to be dominated by this phase. This lengthy exploration is avoided in LQRs with non-singular optimal closed-loop matrix, leading to regret that scales polynomially in system dimensions (Theorem 3.3).

Whether this polynomial dimension dependence in regret can be achieved via TS in general LQRs remains an open problem.

LEARNING AND CONTROL IN LINEAR TIME-VARYING SYSTEMS

Related Work and Background

In the control study of LTI systems, the linear quadratic regulator (LQR) is considered in detail. In the classical setting where the underlying system is known, the optimal control law is given by a linear feedback controller obtained by solving the Riccati equations [28]. As in the case of LTI systems, the optimal control of LTV systems where the sequence of system parameters can be obtained by solving the backward Riccati equations [28].

However, in the online case when the order of systems is unknown, the design of controllers is challenging.

Stable Online Control of Linear Time-Varying Systems

Problem Setting
Naive Approach
Main Result
Experiments
Conclusion

𝐴∥ is its spectral norm, 𝐴⊤ is its transpose, and Tr(𝐴) is its trace, 𝜌(𝐴) denotes the spectral radius of 𝐴, that is, the largest absolute value of its eigenvalues. The main idea of COCO-LQ is to enforce stability through a state covariance constraint embedded in the SDP framework. Finally, we test the performance of COCO-LQ for online control of linear time-varying systems derived from local linearization of nonlinear systems.

In addition, COCO-LQ performance is very close to optimal offline, while the system frequency deviates under optimal 𝐻-horizon baseline control.

Figure 4.1: Performance comparison of COCO-LQ and naive LQ control on syn- syn-thetic systems given in a) and b)

Stability and Identification of Random Asynchronous LTI Systems In this section, we introduce the random asynchronous LTI systems, a new form ofIn this section, we introduce the random asynchronous LTI systems, a new form of

Problem Setting
Stability of Random Asynchronous LTI Systems
Randomized LTI Systems
System Identification for Randomized LTI Systems
Conclusion

Since the random asynchronous LTI systems exhibit stochastic behavior, the stability of the state vector must be considered statistically. To numerically investigate the mean square stability of the random asynchronous system, we study the system from the Markov jump linear system perspective. The stability of the matrix A is necessary, but not sufficient for the stability of the system.

It depicts the consistency and efficiency of the proposed system identification method for random LTI systems.

Figure 4.4: The spectral radius of S ℎ that represents mean-square stability of the random asynchronous systems with ℎ = 2 and (a) unstable A 1 and (b) stable A 2

LEARNING AND CONTROL IN PARTIALLY OBSERVABLE LINEAR DYNAMICAL SYSTEMS

Preliminaries .1 Notation.1Notation

Partially Observable Linear Dynamical Systems
Control Problem Under Quadratic Costs
Regret

One of the most common forms is the innovations form2 of the system characterized as. 2For simplicity, all the system representations are presented for the steady state of the system. Using the relation between 𝑒𝑡 and 𝑦𝑡, we obtain the following characterization of the systemΘ, known as the predictor form of the system.

The first one is the canonical quadratic cost on the inputs and outputs of the LQG control system.

Open-Loop System Identification

It is worth repeating that the dimension of the latent state, 𝑛, is the order of the system for observable and controllable dynamics. Using the outputs of the Ho-Kalman algorithm, ie (𝐴,ˆ 𝐵,ˆ 𝐶ˆ), we can construct confidence sets centered around these outputs containing a similarity transformation of the system parametersΘ =(𝐴, 𝐵, 𝐶) with a high probability. The difference in presentation arises from providing a different characterization of the dependence of ∥N −N ∥ˆ and centering the confidence ball over the estimates rather than the output of Ho-Kalman algorithm with the input of𝐺.

In the following section, we provide a closed-loop system identification algorithm that alleviates the correlations in the covariates and noise sequences by considering the predictor form of the system dynamics (5.7) instead of the state space form (5.1).

A Novel Closed-Loop System Identification Method

Proof of Theorem 5.2

Note that the noise 𝑒𝑡 and the covariates of the estimation problem in (5.21), 𝜙, are independent and the dependencies in the previous least squares methods (5.15) are relaxed. Note that ˆ𝐿 is an estimate of the optimal Kalman gain 𝐿 given in (5.4) for the LQG control systems, and our new estimation method allows us to estimate it, which will also be useful in control design in the upcoming sections. In the following, we show that there exists a positive𝜎𝑜 such that 𝜎𝑜 < 𝜎min(G𝑜𝑙), i.e. G𝑜𝑙 is full row rank.

This result verifies that the PE condition holds in the open-loop control setting, showing that the estimation error guarantees given in Theorem 5.2 hold for open-loop data collection.

Optimism-Based Adaptive Control

Adaptive Control via LqgOpt
PE Condition in the Closed-Loop Setting
Boundedness of the Output and State Estimation
Regret Decomposition
Regret Analysis and Proofs of Theorem 5.5 and Corollary 5.5.1
Extensions to the ARX systems

We will officially define these statements and the duration of the warm-up period soon. In Section 5.4.5, we will show that the heating phase error scales linearly as expected due to the i.i.d. Inaccuracies in system parameter estimates affect both optimal feedback gain synthesis and fundamental state estimation.

Next, we base the undo of the adaptive control phase when the underlying system and its optimal controller do not satisfy the PE condition, i.e. assumption 5.2, and LqgOpt depends solely on the warm-up period.

Thompson Sampling-Based Adaptive Control

TSPO
Algorithmic Guarantees

To do this, we first show that the regret of a fixed TS policy increases linearly over time with the estimation error in the model parameters. In the first step, TSPO uses the ordered least squares (RLS) subroutine of (5.21) to perform closed-loop model estimation from the collected input-output data. In the remaining period, TSPO uses the optimal control policy of the sampled model ˜Θ𝑡 given as 𝑢𝑡 = −𝐾˜𝑡𝑥ˆ.

The second regret term in the LqgOptis analysis is zero insignificant due to optimistic model selection.