Learning-Augmented Control and Decision-Making

We then study the problem of equipping a black-box control policy with model-based advice for nonlinear control on a single trajectory. Furthermore, with bounded nonlinearity, we show that the adaptive𝜆-safe policy achieves a bounded competitive ratio when a blackbox policy is near-optimal.

LIST OF ILLUSTRATIONS

In the latter case, there are no assumptions about the random IID selection of the mentions of Y (unlike the star/chain networks). The black vertical line indicates the best fitting location in s2forx1 and the tailx2 can be found accordingly despite the small difference between both tails.217 7.6.1 The performance for different number of clusters using three different ones.

INTRODUCTION

Key Problems and Challenges
Related Work on Learning-Augmented Online Algorithms
Learning-Augmented Control
Large-Scale Learning-Augmented Decision-Making
Learning, Inference, and Data Analysis in Smart Grids
Dissertation Outline

Achieving sustainable development is one of the most important and challenging goals of this century. Note that the generality of the models increases from top to bottom, as we will illustrate in table 8.1.1 in chapter 8.

Table 1.1: List of learning-augmented algorithms.

Learning-Augmented Control

LINEAR QUADRATIC CONTROL WITH UNTRUSTED AI PREDICTIONS

Introduction

Our main result proves that the self-tuning policy maintains a competitive relationship frontier that is always bounded regardless of the prediction error (Theorem 2.4.1). For example, the regret analysis of the Follow the Optimal Steady State (FOSS) method in [39] contains something similar.

Model

In section 2.5 we show with the help of experiments that the competitive relationships (with a fixed 'trust parameter' defined in 2.3) grow linearly in the prediction error defined in (2.2). At the other extreme, a natural way to be robust is to completely ignore unreliable predictions, that is, not to trust the predictions.

Consistent and Robust Control

In the next section, we discuss our proposed approach, which achieves both robustness and robustness. Furthermore, the term OPTin (2.6) is common in robustness and consistency analysis of online algorithms, such as

Self-Tuning 𝜆 -Confident Control

The goal of the self-tuning algorithm is to converge to the optimal confidence parameter 𝜆∗ for the problem instance. 𝜆𝑇−1) be the algorithmic cost with adaptively chosen confidence parameters 𝜆0,. Specifically, the simulated competitive ratio of self-tuning control (Algorithm 3) is a nonlinear convolution of the simulated competitive ratios for 𝜆-secure control with fixed confidence parameters and as a function of the prediction error 𝜀, matches the competition implied upper bound ratio 1+𝑂(𝜀)/(Θ(1) +Θ(𝜀)).

Robot Tracking

In Figure 2.2a, we observe that the tracking trajectory generated by the self-tuning scheme converges to the unknown trajectory (𝑦1, . . . , 𝑦𝑇) regardless of the level of prediction error. The right subfigure in Figure 2.3 shows the performance of self-tuning for the case when the prediction error is high.

Figure 2.2: Tracking trajectories and trust parameters ( 𝜆 0 , . . . , 𝜆 𝑇 − 1 ) of the self- self-tuning control scheme

Adaptive Battery-Buffered EV Charging

In Figure 2.6a and 2.6b, we show the performance of self-tuning control and the influence of trust parameters for adaptive EV charging. Furthermore, for the self-regulating scheme in both Figure 2.3 and Figure 2.6a, we observe the competitive ratio 1+𝑂(𝜀)/(Θ(1) +Θ(𝜀)), which matches the limit of the competitive ratio given in Theorem 2.4.1 in the ordinal sense (in𝜀). a) Experiments with synthetic charging of electric vehicles.

Cart-Pole

Taking sin𝜃 ≈𝜃 and cos𝜃 ≈1 and ignoring higher order terms, the dynamics of the Cart-Pole problem can be linearized as. We show the performance of the self-tuning control (Algorithm 3) and the impact of confidence parameters for the Cart-Pole problem in Figure 2.8, together with the 𝜆-safe control scheme of Algorithm 2 for multiple fixed confidence parameters𝜆.

APPENDIX

The proof is completed by noting that when the prediction error is zero and 𝑤b𝑡 =𝑤𝑡 for all 𝑡 = 0,. For the battery-buffered EV charging case, we set 𝑒𝑡 =𝑌, where 𝑋 ∼ 𝑁(0, 𝜎2) is a normal random variable with zero mean and 𝜎2 is a variance that can be varied to generate varying prediction errors.

Table 2.E.1: Hyper-parameters used in robot tracking and EV charging.

NON-LINEAR CONTROL WITH BLACK-BOX AI POLICIES

Introduction

This negative result highlights the challenges associated with combining model-based and model-free approaches. Next, we present a general policy that adaptively combines a model-free black-box policy with model-based advice (Algorithm 4). This work adds to the recent literature that seeks to combine model-free and model-based policies for online control.

Background and Model

To solve the nonlinear control problem in (3.3), we utilize both model-free and model-based approaches. The model-free policy is considered a "black box", the detail of which is not the main focus in this work. The performance of the model-free policy is not guaranteed and it may make an error, characterized by the following definition, which comparesb𝜋 to a clairvoyant optimal controller𝜋∗.

Warmup: A Naive Convex Combination
Adaptive 𝜆 -Confident Control
Practical Implementation and Applications Learning confidence coefficients onlineLearning confidence coefficients online

For any 𝜆∈ (0,1) and any linear controller 𝐾1 corresponding to 𝐴−𝐵𝐾1 ≠0, there exists a linear controller 𝐾2 that stabilizes the system such that their convex combination 𝜆𝐾2+ (1−𝜆)𝐾1 is unstable, i.e. . spectral radius 𝜌( 𝐴−𝐵(𝜆𝐾2+ (1−𝜆)𝐾1)) > 1. The negative result in Theorem 3.3.1 emphasizes that the adaptive nature of the confidence coefficients in v. We denote by COST𝑡(𝑥𝑡;𝜂, 𝑤) the final cost at time𝑡given state𝑥𝑡 with fixed operating disturbances𝜂 B (𝜂𝑡 :𝑡 ≥ 0) and state disturbances𝑤 B (𝑤𝑡 :𝑡 ≥ 0).

Figure 3.5.1: Competitiveness and stability of the adaptive policy. Top (Competi- (Competi-tiveness): costs of pre-trained RL agents, an LQR and the adaptive policy when the initial pole angle 𝜃 (unit: radians) varies

Learning-Augmented Decision-Making

LEARNING-BASED PREDICTIVE CONTROL: FORMULATION

Introduction

The challenge and importance of designing flexibility feedback signals has given rise to a rich literature. However, the rapidly changing environment and the uncertainties of the DERs require real-time flexibility feedback. We introduce a model of the real-time closed-loop control system formed by a system operator and an aggregator.

Related Literature

Within this model we define the "optimal" real-time flexibility response as the solution to an optimization problem that maximizes the entropy of the response vector. Flexibility in this work was defined as an estimate of the proportion of loads that work. Assessing and increasing overall flexibility is often considered independent of operational objectives.

Problem Formulation

Knowing the flexibility feedback 𝑝𝑡 from the aggregator, the operator dispatches an action𝑢𝑡, chosen from the Util aggregator at any time𝑡 ∈ [𝑇]. The system operator's goal is to deliver an action𝑢𝑡 ∈U at time𝑡 ∈ [𝑇]to the aggregator to minimize the cumulative system cost given by𝐶𝑇(𝑢1, . . , 𝑢𝑇) :=Í𝑇) :=Í𝑇 It does not require the aggregator to know the system operator's goal in (4.2a), but only the action𝑢𝑡 every time𝑡 ∈ [𝑇] from the operator.

Figure 4.3.1: System model: A feedback control approach for solving an online version of (4.2)

Definitions of Real-Time Aggregate Flexibility: Maximum Entropy Feed- backback

The intuition behind our proposal is to use the conditional probability 𝑝𝑡(𝑢𝑡|𝑢<𝑡) to measure the resulting future flexibility of the aggregator as the system operator. The sum of the conditional entropy of𝑝𝑡 is therefore a measure of how informative the overall feedback is. As a result, the magnitude |S(𝑢≤𝑡) | of the set of subsequent feasible trajectories is a measure of future flexibility, conditioned on 𝑢≤𝑡.

Figure 4.4.1: Feasible trajectories of power signals and the computed maximum entropy feedback in Example 3.

Approximating Maximum Entropy Feedback via Reinforcement Learning For real-world applications, computing the maximum entropy feedback (MEF) couldFor real-world applications, computing the maximum entropy feedback (MEF) could

The reward function is independent of the cost functions, which are synthetic costs in the training phase. In the EV charging case described in Section 4.3, this means that the EV batteries are fully charged. In the offline learning process presented in this section, we use a single function to generate feedback.

Figure 4.5.1: Learning and testing architecture for learning aggregator functions.

Penalized Predictive Control

The MEF as a feedback function, contains only limited information about the dynamic system on the local controller's side. In the next section, we use an EV charging example to verify the effectiveness of the proposed method. We also assume that the measure of the set of feasible actions 𝜇(S(𝑢≤𝑡)) is differentiable and strictly logarithmically concave with respect to the sequence of actions 𝑢𝑡 = (𝑢1, . . , 𝑢 ]𝑡) ❑]𝑡) , which is also common in practice, e.g. this applies in case of stock restrictionsÍ𝑇.

Application

For the experimental results presented in this section, the control state space is X=R2+×𝑊 where 𝑊 is the total number of charging stations and a state vector for each charging station is(𝑒𝑡,[𝑑(𝑗) −𝑡]+), i.e. , the remaining energy to be charged. In particular, it achieves a lower cost compared to the commonly used MPC scheme described in f). The offline optimal charging rates are also provided. Denoting by 𝑐′and 𝑓′the corresponding derivatives of a given cost function𝑐and the function 𝑓 defined in (4.19), we have.

Figure 4.7.1: Trade-offs of cost and charging performance. The dashed curve in the left figure corresponds to offline optimal cost

LEARNING-BASED PREDICTIVE CONTROL: REGRET ANALYSIS

Introduction

To briefly mention some of the results, in [105], with bandit feedback, the dynamic regret is bounded by 𝑜(𝑇). 𝑇Δ(𝑇)) and the constraint violations are bounded by𝑂(𝑇3/4Δ(𝑇)1/4)whereΔ(𝑇) is the "drive" of the sequence of optimal actions. Indeed, for the case when the offline constraints on the actionsu = (𝑢1, . . . , 𝑢𝑇) are of the formÍ𝑇.

Figure 5.2.1: Closed-loop interaction between a central controller and a local controller.

Information Aggregation

At each time 𝑡 ∈ [𝑇], the local controller is allowed to send a feedback (density) function whose domain is the action space U, denoted by 𝑝𝑡(·):U→ [0,1] to the central controller . Furthermore, in certain applications the action space is discrete, e.g. the EV charging application described in Section 4.3, so 𝑝𝑡 becomes a probability vector whose length is much smaller than the state vector. To be more precise, every time 𝑡 ∈ [𝑇], a (conditional) density function is denoted by 𝑝𝑡(·|u<𝑡) : U → [0,1], given previous actions u<𝑡 sent from the local controller to the central controller.

Penalized Predictive Control via Predicted MEF

This leads to computational problems; however, learning techniques can be used to create MEFs, as we described in Section 4.5 in Chapter 4. It is reformulated using the two-controller model described in Section 5.2 and generalized to account for the feasibility prediction 𝑝∗. Framework: Closed-Loop Control Between Local and Central Controllers Given the PPC scheme described in Section 5.4, we can now formally present our online control framework for a remote central controller and a local controller (defined in Section 5.2).

Results

Furthermore, to emphasize structures that are not causally invariant, we return to the construction of the constraints used in the proof of the lower bound in Theorem 5.5.1. We are now ready to present our main result, which bounds the dynamic regret as a decreasing function of the forecast window size under the assumption that the security sets are causally invariant. The sequence of actions u≤𝑡 that maximizes the size of the set of admissible actions is a vector of all zeros.

Figure 5.A.1: Graphical illustration of (5.10) in Definition 5.5.1.

Learning, Inference, and Data Analysis in Smart Grids

LEARNING POWER SYSTEM PARAMETERS FROM LINEAR MEASUREMENTS

Introduction
Model and Definitions
Fundamental Trade-offs
Gaussian IID Measurements

First, the analysis of such an optimization often requires stronger assumptions, e.g. are the nonzeros not concentrated in any single column (or row) of Y(𝐺), as in [161]. Therefore, for the particular examples presented in Lemma 22 and Lemma 23, the chosen threshold and counting parameters for them are both "tight" (up to multiplicative factors), in the sense that the corresponding trial complexity matches (up to multiplicative factors) the lower bounds derived from Theorem 6.3.1. Considering the original linear system in (6.1), the equations above imply that A+eA=BY+Z= BY+eBY+Z leads to eA=eBY+Z, where we have made use of (6.14) and extracted the perturbation matrixAandeB.