LEARNING-BASED PREDICTIVE CONTROL: FORMULATION
4.6 Penalized Predictive Control
Figure 4.5.2: Average rewards (defined in (4.11)) in the training stage with a tuning parameterπ½ =6Γ103. Shadow region measures the variance.
all feasible trajectories:
π(π’) :=


ο£²

ο£³
1/π(S) ifπ’ βS
0 otherwise
. (4.12)
Using this observation, the constraints (4.2b)-(4.2d) in the offline optimization can be rewritten as a penalty in the objective of (4.2a). We present a useful lemma that both motivates our online control algorithm and builds up the optimality analysis in Section 4.6.
Lemma 18. The offline optimization (4.2a)-(4.2d) is equivalent to the following unconstrained minimization for any π½ >0:
inf
π’βUπ π
βοΈ
π‘=1
ππ‘(π’π‘) βπ½logπβ
π‘(π’π‘|π’<π‘) .
The proof of Lemma 18 can be found in Appendix 4.E. It draws a clear connection between MEF and the offline optimal, which we exploit in the design of an online system operator in the next section.
Algorithm: Penalized Predictive Control via MEF
Our proposed design, termed penalized predictive control (PPC), is a combination of model predictive control (MPC) (c.f.[100]), which is a competitive policy for online optimization with predictions, and the idea of using MEF as a penalty term. This design makes a connection between the MEF and the well-known MPC scheme. The MEF as a feedback function, only contains limited information about the dynamical system in the local controllerβs side. (It contains only the feasibility information of the current and future time slots, as explained in Section 4.4). The PPC scheme therefore is itself a novel contribution since it shows that, even ifonly feasibility information is available, it is still possible to incorporate the limited information to MPC as apenalty term.
We present PPC in Algorithm 6, where we use the following notation. Let π½π‘ > 0 be atuning parameterin predictive control to trade-off the flexibility in the future and minimization of the current system cost at each timeπ‘ β [π]. The next corollary follows whose proof is in Appendix 4.C.
Corollary4.6.1(Feasibility of PPC). When ππ‘ = πβ
π‘ for allπ‘ β [π], the MEF defined in Definition 4.4.1, the sequence of actionsπ’= (π’1, . . . , π’π) generated by the PPC in(4.13)always satisfiesπ’ βSfor any sequence of tuning parameters(π½1, . . . , π½π).
Algorithm 6:Penalized Predictive Control (PPC).
Data: Sequentially arrived cost functions and MEF Result: Actionsπ’ =(π’1, . . . , π’π)
forπ‘ =1, . . . , π do
Choose an actionπ’π‘ by minimizing:
π’π‘ =ππ‘(ππ‘) :=arg inf
π’π‘βU
(ππ‘(π’π‘) βπ½π‘logππ‘(π’π‘|π’<π‘)) (4.13) end
Returnπ’;
Framework: Closed-Loop Control between Local and Central Controllers Given the PPC scheme described above, we can now formally present our online control framework for the distant central controller and local controller (defined in Section 4.3). Recall that an overview of the closed-loop control framework has been given in Algorithm 5, whereπdenotes an operator function andπ is an aggregator function. To the best of our knowledge, this chapter is the first to consider such a closed-loop control framework with limited information communicated in real-time between two geographically separate controllers seeking to solve an online control problem. We present the framework below.
At each timeπ‘ β [π], the local controller first efficiently generates estimated MEF ππ‘ βPusing an aggregator functionππ‘trained by a reinforcement learning algorithm.
After receiving the current MEF ππ‘ and cost functionππ‘(futureπ€MEF and costs if predictions are available), the central controller uses the PPC scheme in Algorithm 6 to generate an actionπ’π‘ β Uand sends it back to the local controller. The local controller then updates its state π₯π‘ β X to a new state π₯π‘+1 based on the system dynamic in (3.3) and repeats this procedure again. In the next Section, we use an EV charging example to verify the efficacy of the proposed method.
Optimality Analysis
To end our discussion of PPC we focus on optimality. For the ease of analysis, we assume that the action spaceUis the set of real numbersR; however, as noted in Remark 1, our system and the definition of MEF can also be made consistent with a discrete action space.
To understand the optimality of PPC we focus on standard regularity assumptions for the cost functions and the time-varying constraints. We assume cost functions
are strictly convex and differentiable, which is common in practice. Further, let π(Β·)denote the Lebesgue measure. Note that the set of subsequent feasible action trajectories S(π’β€π‘) is Borel-measurable for all π‘ β [π], implied by the proof of Corollary 16. We also assume that the measure of the set of feasible actions π(S(π’β€π‘))is differentiable and strictly logarithmically-concave with respect to the subsequence of actionsπ’π‘ = (π’1, . . . , π’π‘) for allπ‘ β [π], which is also common in practice, e.g., it holds in the case of inventory constraintsΓπ
π‘=1β₯π’π‘β₯2 β€ π΅ with a budgetπ΅ > 0. Finally, recall the definition of the set of subsequent feasible action trajectories:
S(π’β€π‘) :=n
π£>π‘ βUπβπ‘ :π£satisfies (4.2b)β(4.2d), π£β€π‘ =π’β€π‘ o
. Putting the above together, we can state our assumption formally as follows.
Assumption 4. The cost functionsππ‘(π’) : R β R+ are differentiable and strictly convex. The mappings π(S(π’β€π‘)) : Rπ‘ β R+ are differentiable and strictly logarithmically-concave.
Given regularity of the cost functions and time-varying constraints, we can prove optimality of PPC.
Theorem 4.6.1(Existence of optimal actions). LetU =R. Under Assumption 3 and 4, there exists a sequenceπ1, . . . , ππ such that implementing(4.13)withπ½π‘ = ππ‘ and ππ‘ = πβ
π‘ at each timeπ‘ β [π] ensures π’ = (π’β
1, . . . , π’β
π), i.e., the generated actions are optimal.
Crucially, Theorem 4.6.1 shows that there exists a sequence of βgoodβ tuning parameters so that the PPC scheme is able to generate optimal actions under reasonable assumptions. However, note that the assumption ofU=Ris fundamental.
When the action spaceUis discrete orUis a high-dimensional space, it is impossible to generate the optimal actions because, in general, fixingπ‘, the differential equations in the proof of Theorem 4.6.1 (see Appendix 4.F) do not have the same solution for allπ½π‘ > 0. Therefore a detailed regret analysis is necessary in such cases, which is a challenging task for future work.