LEARNING-BASED PREDICTIVE CONTROL: FORMULATION
4.4 Definitions of Real-Time Aggregate Flexibility: Maximum Entropy Feed- backback
In above, constraint (4.4c) ensures that the aggregator decisionπ π‘tracks the signalπ’π‘ at each timeπ‘ β [π], the constraint (4.4d) guarantees that EV πβs energy demand is satisfied, and the other constraints say that the aggregator cannot charge an EV before its arrival, after its departure, or at a rate that exceeds its limit. Inequalities (4.4a)- (4.4e) above are examples of the constraints in (4.1). Together, for this EV charging application, (4.3a)-(4.3b) and (4.4a)-(4.4e) exemplify the dynamic system in (3.3).
The system operatorβs objective to minimize the cumulative costsπΆπ(π’) :=Γπ π‘=1ππ‘π’π‘ whereπ’= (π’1, . . . , π’π) are substation power levels, as outlined in Section 4.3. The costππ‘ depends on multiple factors such as the electricity prices and injections from an installed rooftop solar panel. Overall, the EV charging problem is formulated below, as a specific example of the generic optimization (4.2a)-(4.2d):
min
π’1,...,π’π
π
βοΈ
π‘=1
ππ‘π’π‘ (4.5a)
(4.3a)β(4.3b) and (4.4a)β(4.4e). (4.5b)
4.4 Definitions of Real-Time Aggregate Flexibility: Maximum Entropy Feed-
In the following, we present a design of the flexibility feedback ππ‘, which is first proposed in our previous work [24] for discreteU and [25] for continuousU. It quantifies future flexibility that will be enabled by an operator actionπ’π‘. The feedback ππ‘therefore is a surrogate for the aggregator constraints (4.2b) to guide the operatorβs decision. Letπ’ := (π’1, . . . , π’π). Specifically, define the set of all feasible action trajectoriesfor the aggregator as:
S:=
π’ βUπ :π’satisfies (4.2b)β(4.2d) .
The following property of the set S is useful, whose proof can be found in Ap- pendix 4.A.
Lemma 16. The set of feasible action trajectoriesSis Borel measurable.
Existing aggregate flexibility definitions focus on approximatingSsuch as finding its convex approximation (see Section 4.2 for more details). Our problem formulation needs areal-timeapproximation of this setS, i.e., decomposeSalong the time axis π‘ = 1, . . . , π. Throughout, we assume thatS is non-empty. Next, we define the space of flexibility feedback ππ‘. Formally, we letPdenote a set of density functions ππ‘(Β·):Uβ [0,1]that maps an action to a value in [0,1]and satisfies
β«
π’βU
π(π’)dπ’ =1.
Fixπ₯π‘ at timeπ‘ β [π]. The aggregator functionππ‘(Β·):XβPat each timeπ‘ outputs:
ππ‘(π₯π‘) = ππ‘(Β·|π’<π‘) (4.6)
such that ππ‘(Β·|π’<π‘) : U β [0,1] is a conditional density function in P. We refer to ππ‘ asflexibility feedbacksent at timeπ‘ β [π] from the aggregator to the system operator. In this sense, (4.6) does not specify a specific aggregator functionππ‘, but a class of possible functionsππ‘. Every function in this collection iscausalin that it depends only on information available to the aggregator at timeπ‘. In contrast to most aggregate flexibility notions in the literature [32, 33, 35, 36, 76β79], the flexibility feedback here is specifically designed for an online feedback control setting.
Maximum Entropy Feedback
The intuition behind our proposal is using the conditional probability ππ‘(π’π‘|π’<π‘) to measure the resulting future flexibility of the aggregator if the system operator
choosesπ’π‘as the signal at timeπ‘, given the action trajectory up to timeπ‘β1. The sum of the conditional entropy ofππ‘thus is a measure of how informative the overall feedback is. This suggests choosing a conditional distribution ππ‘that maximizes its conditional entropy. Consider the optimization problem:
π := max
π1,..., ππ π
βοΈ
π‘=1
H(ππ‘|π<π‘) subject toπ βS (4.7a)
where the variables are conditional density functions:
ππ‘ := ππ‘(Β·|Β·) :=Pππ‘|π<π‘(Β·|Β·), π‘ β [π], (4.7b) π βUis a random variable distributed according to the joint distributionΓπ
π‘=1ππ‘ andH(ππ‘|π<π‘)is the differential conditional entropy of ππ‘ defined as:
H(ππ‘|π<π‘) :=
β«
π’β€π‘βUπ‘
β
π‘
Γ
β=1
πβ(π’β|π’<β)
logππ‘(π’π‘|π’<π‘)dπ’β€π‘. (4.7c) By definition, a quantity conditioned on βπ’<1β means an unconditional quantity, so in the above,H(π1|π<1) :=H(π1) :=H(π1).
The chain rule shows thatΓπ
π‘=1H(ππ‘|π<π‘) =H(π). Hence (4.7) can be interpreted as maximizing the entropy H(π) of a random trajectory π sampled according to the joint distribution Γπ
π‘=1ππ‘, conditioned on π satisfyingπ β S, where the maximization is over the collection of conditional distributions(π1, . . . , ππ).
Definition 4.4.1(Maximum entropy feedback). The flexibility feedbackπβ
π‘ =πβ
π‘(π’<π‘) forπ‘ β [π] is called the maximum entropy feedback (MEF) if(πβ
1, . . . , πβ
π)is the unique optimal solution of (4.7).
Remark 3. Even though the optimization problem(4.7)involves variables ππ‘ for the entire time horizon [π], the individual variables ππ‘ in (4.7c) are conditional probabilities that depend only on information available to the aggregator at timesπ‘. Therefore the maximum entropy feedbackπβ
π‘ in Definition 4.4.1 is indeed causal and in the class of πβ
π‘ defined in(4.6). The existence ofπβ
π‘ is guaranteed by Lemma 17 below, which also implies that πβ
π‘ is unique.
We demonstrate Definition 4.4.1 using a toy example.
Example3(Maximum entropy feedback πβ). Consider the following instance of the EV charging example in Section 4.3. Suppose the number of charging time slots is π = 3and there is one customer, whose private vector is (1,3,1,1) and possible energy levels are0 (kWh) and 1 (kWh), i.e.,U β‘ {0,1}. Since there is only one EV, the scheduling algorithm π’ (disaggregation policy) assigns all power to this single EV. For this particular choices of π₯ and π’, the set of feasible trajectories isS={(0,0,1),(0,1,0),(1,0,0)}, shown in Figure 4.4.1 with the corresponding optimal conditional distributions given by(4.7).
Figure 4.4.1: Feasible trajectories of power signals and the computed maximum entropy feedback in Example 3.
Properties of πβ
π‘
We now show that the proposed maximum entropy feedback πβ
π‘ has several desirable properties. We start by computing πβ
π‘ explicitly. Given any action trajectoryπ’β€π‘, define the set ofsubsequentfeasible trajectories as:
S(π’β€π‘) :=
n
π£>π‘ βUπβπ‘ :π£satisfies (4.2b)β(4.2d), π£β€π‘ =π’β€π‘ o
.
As a corollary, the size |S(π’β€π‘) | of the set of subsequent feasible trajectories is a measure of future flexibility, conditioned onπ’β€π‘. Our first result justifies our calling πβ
π‘ the optimal flexibility feedback: πβ
π‘ is a measure of the future flexibility that will be enabled by the operatorβs actionπ’π‘and it attains a measure of system capacity for flexibility. By definition, πβ
1(π’1|π’<1) := πβ
1(π’1).
Lemma 17. Letπ(Β·)denote the Lebesgue measure. The MEF as optimal solutions of the maximization in (4.7a)-(4.7c) are given by
πβ
π‘(π’|π’<π‘) β‘π(S( (π’<π‘, π’)))
π(S(π’<π‘)) , β(π’<π‘, π’π‘) βUπ‘. (4.8) Moreover, the optimal value of (4.7a)-(4.7c)is equal tologπ(S).
Remark 4. When the denominatorπ(S(π’<π‘))is zero, the numeratorπ(S( (π’<π‘, π’))) has also to be zero. For this case, we setπβ
π‘(π’|π’<π‘) =0and this does not affect the optimality of (4.7a)-(4.7b).
The proof can be found in Appendix 4.B. The volume π(S) is a measure of flexibility inherent in the aggregator. We will hence call logπ(S) the system capacity. Lemma 17 then says that the optimal value of (4.7) is the system capacity, π=logπ(S). Moreover the maximum entropy feedback (πβ
1, . . . , πβ
π) is the unique collection of conditional distributions that attains the system capacity in (4.7). This is intuitive since the entropy of a random trajectoryπ₯inSis maximized by the uniform distributionπβ in (4.12) induced by the conditional distributions(πβ
1, . . . , πβ
π). Lemma 17 implies the following important properties of the maximum entropy feedback.
Corollary 4.4.1 (Feasibility and flexibility). Let πβ
π‘ = πβ
π‘(Β·|π’<π‘) be the maximum entropy feedback at each timeπ‘ β [π].
1. For any action trajectoryπ’ =(π’1, . . . , π’π), if πβ
π‘(π’π‘|π’<π‘) > 0 for allπ‘ β [π], thenπ’ βS.
2. For allπ’π‘, π’β²
π‘ βUat each timeπ‘ β [π], if πβ
π‘(π’π‘|π’<π‘) β₯ πβ
π‘(π’β²
π‘|π’<π‘), thenπ(|S( (π’<π‘, π’π‘))) β₯ π S( (π’<π‘, π’β²
π‘)) .
The proof is provided in Appendix 4.C. We elaborate the implication of Corollary 4.4.1 for our online feedback-based solution approach.
Remark 5(Feasibility and flexibility). Corollary 4.4.1 says that the proposed optimal flexibility feedbackπβ
π‘ provides the right information for the system operator to choose its actionπ’π‘ at timeπ‘.
1. (Feasibility) Specifically, the first statement of the corollary says that if the operator always chooses an action π’π‘ with positive conditional probability πβ
π‘(π’π‘) > 0for each timeπ‘, then the resulting action trajectory is guaranteed to be feasible,π’ βS, i.e., the system will remain feasible ateverytimeπ‘ β [π] along the way.
2. (Flexibility) Moreover, according to the second statement of the corollary, if the system operator chooses an actionπ’π‘ with a larger πβ
π‘(π’π‘)value at time π‘, then the system will be more flexible going forward than if it had chosen another signalπ’β²
π‘with a smaller πβ
π‘(π’β²
π‘)value, in the sense that there are more feasible trajectories inS( (π’<π‘, π’π‘))going forward.
As noted in Remark 2, despite characterizations that involve the whole action trajectoryπ’, such asπ’ βS, these areonlineproperties. This guarantees the feasibility of the online closed-loop control system depicted in Figure 4.3.1, and confirms the suitability of πβ
π‘ for online applications.
4.5 Approximating Maximum Entropy Feedback via Reinforcement Learning