• Tidak ada hasil yang ditemukan

Penalized Predictive Control

Dalam dokumen Learning-Augmented Control and Decision-Making (Halaman 135-138)

LEARNING-BASED PREDICTIVE CONTROL: FORMULATION

4.6 Penalized Predictive Control

Figure 4.5.2: Average rewards (defined in (4.11)) in the training stage with a tuning parameter𝛽 =6Γ—103. Shadow region measures the variance.

all feasible trajectories:

π‘ž(𝑒) :=





ο£²



ο£³

1/πœ‡(S) if𝑒 ∈S

0 otherwise

. (4.12)

Using this observation, the constraints (4.2b)-(4.2d) in the offline optimization can be rewritten as a penalty in the objective of (4.2a). We present a useful lemma that both motivates our online control algorithm and builds up the optimality analysis in Section 4.6.

Lemma 18. The offline optimization (4.2a)-(4.2d) is equivalent to the following unconstrained minimization for any 𝛽 >0:

inf

π‘’βˆˆU𝑇 𝑇

βˆ‘οΈ

𝑑=1

𝑐𝑑(𝑒𝑑) βˆ’π›½logπ‘βˆ—

𝑑(𝑒𝑑|𝑒<𝑑) .

The proof of Lemma 18 can be found in Appendix 4.E. It draws a clear connection between MEF and the offline optimal, which we exploit in the design of an online system operator in the next section.

Algorithm: Penalized Predictive Control via MEF

Our proposed design, termed penalized predictive control (PPC), is a combination of model predictive control (MPC) (c.f.[100]), which is a competitive policy for online optimization with predictions, and the idea of using MEF as a penalty term. This design makes a connection between the MEF and the well-known MPC scheme. The MEF as a feedback function, only contains limited information about the dynamical system in the local controller’s side. (It contains only the feasibility information of the current and future time slots, as explained in Section 4.4). The PPC scheme therefore is itself a novel contribution since it shows that, even ifonly feasibility information is available, it is still possible to incorporate the limited information to MPC as apenalty term.

We present PPC in Algorithm 6, where we use the following notation. Let 𝛽𝑑 > 0 be atuning parameterin predictive control to trade-off the flexibility in the future and minimization of the current system cost at each time𝑑 ∈ [𝑇]. The next corollary follows whose proof is in Appendix 4.C.

Corollary4.6.1(Feasibility of PPC). When 𝑝𝑑 = π‘βˆ—

𝑑 for all𝑑 ∈ [𝑇], the MEF defined in Definition 4.4.1, the sequence of actions𝑒= (𝑒1, . . . , 𝑒𝑇) generated by the PPC in(4.13)always satisfies𝑒 ∈Sfor any sequence of tuning parameters(𝛽1, . . . , 𝛽𝑇).

Algorithm 6:Penalized Predictive Control (PPC).

Data: Sequentially arrived cost functions and MEF Result: Actions𝑒 =(𝑒1, . . . , 𝑒𝑇)

for𝑑 =1, . . . , 𝑇 do

Choose an action𝑒𝑑 by minimizing:

𝑒𝑑 =πœ™π‘‘(𝑝𝑑) :=arg inf

π‘’π‘‘βˆˆU

(𝑐𝑑(𝑒𝑑) βˆ’π›½π‘‘log𝑝𝑑(𝑒𝑑|𝑒<𝑑)) (4.13) end

Return𝑒;

Framework: Closed-Loop Control between Local and Central Controllers Given the PPC scheme described above, we can now formally present our online control framework for the distant central controller and local controller (defined in Section 4.3). Recall that an overview of the closed-loop control framework has been given in Algorithm 5, whereπœ™denotes an operator function andπœ“ is an aggregator function. To the best of our knowledge, this chapter is the first to consider such a closed-loop control framework with limited information communicated in real-time between two geographically separate controllers seeking to solve an online control problem. We present the framework below.

At each time𝑑 ∈ [𝑇], the local controller first efficiently generates estimated MEF 𝑝𝑑 ∈Pusing an aggregator functionπœ“π‘‘trained by a reinforcement learning algorithm.

After receiving the current MEF 𝑝𝑑 and cost function𝑐𝑑(future𝑀MEF and costs if predictions are available), the central controller uses the PPC scheme in Algorithm 6 to generate an action𝑒𝑑 ∈ Uand sends it back to the local controller. The local controller then updates its state π‘₯𝑑 ∈ X to a new state π‘₯𝑑+1 based on the system dynamic in (3.3) and repeats this procedure again. In the next Section, we use an EV charging example to verify the efficacy of the proposed method.

Optimality Analysis

To end our discussion of PPC we focus on optimality. For the ease of analysis, we assume that the action spaceUis the set of real numbersR; however, as noted in Remark 1, our system and the definition of MEF can also be made consistent with a discrete action space.

To understand the optimality of PPC we focus on standard regularity assumptions for the cost functions and the time-varying constraints. We assume cost functions

are strictly convex and differentiable, which is common in practice. Further, let πœ‡(Β·)denote the Lebesgue measure. Note that the set of subsequent feasible action trajectories S(𝑒≀𝑑) is Borel-measurable for all 𝑑 ∈ [𝑇], implied by the proof of Corollary 16. We also assume that the measure of the set of feasible actions πœ‡(S(𝑒≀𝑑))is differentiable and strictly logarithmically-concave with respect to the subsequence of actions𝑒𝑑 = (𝑒1, . . . , 𝑒𝑑) for all𝑑 ∈ [𝑇], which is also common in practice, e.g., it holds in the case of inventory constraintsÍ𝑇

𝑑=1βˆ₯𝑒𝑑βˆ₯2 ≀ 𝐡 with a budget𝐡 > 0. Finally, recall the definition of the set of subsequent feasible action trajectories:

S(𝑒≀𝑑) :=n

𝑣>𝑑 ∈Uπ‘‡βˆ’π‘‘ :𝑣satisfies (4.2b)βˆ’(4.2d), 𝑣≀𝑑 =𝑒≀𝑑 o

. Putting the above together, we can state our assumption formally as follows.

Assumption 4. The cost functions𝑐𝑑(𝑒) : R β†’ R+ are differentiable and strictly convex. The mappings πœ‡(S(𝑒≀𝑑)) : R𝑑 β†’ R+ are differentiable and strictly logarithmically-concave.

Given regularity of the cost functions and time-varying constraints, we can prove optimality of PPC.

Theorem 4.6.1(Existence of optimal actions). LetU =R. Under Assumption 3 and 4, there exists a sequence𝑏1, . . . , 𝑏𝑇 such that implementing(4.13)with𝛽𝑑 = 𝑏𝑑 and 𝑝𝑑 = π‘βˆ—

𝑑 at each time𝑑 ∈ [𝑇] ensures 𝑒 = (π‘’βˆ—

1, . . . , π‘’βˆ—

𝑇), i.e., the generated actions are optimal.

Crucially, Theorem 4.6.1 shows that there exists a sequence of β€œgood” tuning parameters so that the PPC scheme is able to generate optimal actions under reasonable assumptions. However, note that the assumption ofU=Ris fundamental.

When the action spaceUis discrete orUis a high-dimensional space, it is impossible to generate the optimal actions because, in general, fixing𝑑, the differential equations in the proof of Theorem 4.6.1 (see Appendix 4.F) do not have the same solution for all𝛽𝑑 > 0. Therefore a detailed regret analysis is necessary in such cases, which is a challenging task for future work.

Dalam dokumen Learning-Augmented Control and Decision-Making (Halaman 135-138)