• Tidak ada hasil yang ditemukan

Definitions of Real-Time Aggregate Flexibility: Maximum Entropy Feed- backback

Dalam dokumen Learning-Augmented Control and Decision-Making (Halaman 127-132)

LEARNING-BASED PREDICTIVE CONTROL: FORMULATION

4.4 Definitions of Real-Time Aggregate Flexibility: Maximum Entropy Feed- backback

In above, constraint (4.4c) ensures that the aggregator decision𝑠𝑑tracks the signal𝑒𝑑 at each time𝑑 ∈ [𝑇], the constraint (4.4d) guarantees that EV 𝑗’s energy demand is satisfied, and the other constraints say that the aggregator cannot charge an EV before its arrival, after its departure, or at a rate that exceeds its limit. Inequalities (4.4a)- (4.4e) above are examples of the constraints in (4.1). Together, for this EV charging application, (4.3a)-(4.3b) and (4.4a)-(4.4e) exemplify the dynamic system in (3.3).

The system operator’s objective to minimize the cumulative costs𝐢𝑇(𝑒) :=Í𝑇 𝑑=1𝑐𝑑𝑒𝑑 where𝑒= (𝑒1, . . . , 𝑒𝑇) are substation power levels, as outlined in Section 4.3. The cost𝑐𝑑 depends on multiple factors such as the electricity prices and injections from an installed rooftop solar panel. Overall, the EV charging problem is formulated below, as a specific example of the generic optimization (4.2a)-(4.2d):

min

𝑒1,...,𝑒𝑇

𝑇

βˆ‘οΈ

𝑑=1

𝑐𝑑𝑒𝑑 (4.5a)

(4.3a)βˆ’(4.3b) and (4.4a)βˆ’(4.4e). (4.5b)

4.4 Definitions of Real-Time Aggregate Flexibility: Maximum Entropy Feed-

In the following, we present a design of the flexibility feedback 𝑝𝑑, which is first proposed in our previous work [24] for discreteU and [25] for continuousU. It quantifies future flexibility that will be enabled by an operator action𝑒𝑑. The feedback 𝑝𝑑therefore is a surrogate for the aggregator constraints (4.2b) to guide the operator’s decision. Let𝑒 := (𝑒1, . . . , 𝑒𝑇). Specifically, define the set of all feasible action trajectoriesfor the aggregator as:

S:=

𝑒 ∈U𝑇 :𝑒satisfies (4.2b)βˆ’(4.2d) .

The following property of the set S is useful, whose proof can be found in Ap- pendix 4.A.

Lemma 16. The set of feasible action trajectoriesSis Borel measurable.

Existing aggregate flexibility definitions focus on approximatingSsuch as finding its convex approximation (see Section 4.2 for more details). Our problem formulation needs areal-timeapproximation of this setS, i.e., decomposeSalong the time axis 𝑑 = 1, . . . , 𝑇. Throughout, we assume thatS is non-empty. Next, we define the space of flexibility feedback 𝑝𝑑. Formally, we letPdenote a set of density functions 𝑝𝑑(Β·):Uβ†’ [0,1]that maps an action to a value in [0,1]and satisfies

∫

π‘’βˆˆU

𝑝(𝑒)d𝑒 =1.

Fixπ‘₯𝑑 at time𝑑 ∈ [𝑇]. The aggregator functionπœ“π‘‘(Β·):Xβ†’Pat each time𝑑 outputs:

πœ“π‘‘(π‘₯𝑑) = 𝑝𝑑(Β·|𝑒<𝑑) (4.6)

such that 𝑝𝑑(Β·|𝑒<𝑑) : U β†’ [0,1] is a conditional density function in P. We refer to 𝑝𝑑 asflexibility feedbacksent at time𝑑 ∈ [𝑇] from the aggregator to the system operator. In this sense, (4.6) does not specify a specific aggregator functionπœ“π‘‘, but a class of possible functionsπœ“π‘‘. Every function in this collection iscausalin that it depends only on information available to the aggregator at time𝑑. In contrast to most aggregate flexibility notions in the literature [32, 33, 35, 36, 76–79], the flexibility feedback here is specifically designed for an online feedback control setting.

Maximum Entropy Feedback

The intuition behind our proposal is using the conditional probability 𝑝𝑑(𝑒𝑑|𝑒<𝑑) to measure the resulting future flexibility of the aggregator if the system operator

chooses𝑒𝑑as the signal at time𝑑, given the action trajectory up to timeπ‘‘βˆ’1. The sum of the conditional entropy of𝑝𝑑thus is a measure of how informative the overall feedback is. This suggests choosing a conditional distribution 𝑝𝑑that maximizes its conditional entropy. Consider the optimization problem:

π­Ÿ‹ := max

𝑝1,..., 𝑝𝑇 𝑇

βˆ‘οΈ

𝑑=1

H(π‘ˆπ‘‘|π‘ˆ<𝑑) subject toπ‘ˆ ∈S (4.7a)

where the variables are conditional density functions:

𝑝𝑑 := 𝑝𝑑(Β·|Β·) :=Pπ‘ˆπ‘‘|π‘ˆ<𝑑(Β·|Β·), 𝑑 ∈ [𝑇], (4.7b) π‘ˆ ∈Uis a random variable distributed according to the joint distributionΓŽπ‘‡

𝑑=1𝑝𝑑 andH(π‘ˆπ‘‘|π‘ˆ<𝑑)is the differential conditional entropy of 𝑝𝑑 defined as:

H(π‘ˆπ‘‘|π‘ˆ<𝑑) :=

∫

π‘’β‰€π‘‘βˆˆU𝑑

βˆ’

𝑑

Γ–

β„“=1

𝑝ℓ(𝑒ℓ|𝑒<β„“)

log𝑝𝑑(𝑒𝑑|𝑒<𝑑)d𝑒≀𝑑. (4.7c) By definition, a quantity conditioned on β€œπ‘’<1” means an unconditional quantity, so in the above,H(π‘ˆ1|π‘ˆ<1) :=H(π‘ˆ1) :=H(𝑝1).

The chain rule shows thatÍ𝑇

𝑑=1H(π‘ˆπ‘‘|π‘ˆ<𝑑) =H(π‘ˆ). Hence (4.7) can be interpreted as maximizing the entropy H(π‘ˆ) of a random trajectory π‘ˆ sampled according to the joint distribution ΓŽπ‘‡

𝑑=1𝑝𝑑, conditioned on π‘ˆ satisfyingπ‘ˆ ∈ S, where the maximization is over the collection of conditional distributions(𝑝1, . . . , 𝑝𝑇).

Definition 4.4.1(Maximum entropy feedback). The flexibility feedbackπ‘βˆ—

𝑑 =πœ“βˆ—

𝑑(𝑒<𝑑) for𝑑 ∈ [𝑇] is called the maximum entropy feedback (MEF) if(π‘βˆ—

1, . . . , π‘βˆ—

𝑇)is the unique optimal solution of (4.7).

Remark 3. Even though the optimization problem(4.7)involves variables 𝑝𝑑 for the entire time horizon [𝑇], the individual variables 𝑝𝑑 in (4.7c) are conditional probabilities that depend only on information available to the aggregator at times𝑑. Therefore the maximum entropy feedbackπ‘βˆ—

𝑑 in Definition 4.4.1 is indeed causal and in the class of π‘βˆ—

𝑑 defined in(4.6). The existence ofπ‘βˆ—

𝑑 is guaranteed by Lemma 17 below, which also implies that π‘βˆ—

𝑑 is unique.

We demonstrate Definition 4.4.1 using a toy example.

Example3(Maximum entropy feedback π‘βˆ—). Consider the following instance of the EV charging example in Section 4.3. Suppose the number of charging time slots is 𝑇 = 3and there is one customer, whose private vector is (1,3,1,1) and possible energy levels are0 (kWh) and 1 (kWh), i.e.,U ≑ {0,1}. Since there is only one EV, the scheduling algorithm 𝑒 (disaggregation policy) assigns all power to this single EV. For this particular choices of π‘₯ and 𝑒, the set of feasible trajectories isS={(0,0,1),(0,1,0),(1,0,0)}, shown in Figure 4.4.1 with the corresponding optimal conditional distributions given by(4.7).

Figure 4.4.1: Feasible trajectories of power signals and the computed maximum entropy feedback in Example 3.

Properties of π‘βˆ—

𝑑

We now show that the proposed maximum entropy feedback π‘βˆ—

𝑑 has several desirable properties. We start by computing π‘βˆ—

𝑑 explicitly. Given any action trajectory𝑒≀𝑑, define the set ofsubsequentfeasible trajectories as:

S(𝑒≀𝑑) :=

n

𝑣>𝑑 ∈Uπ‘‡βˆ’π‘‘ :𝑣satisfies (4.2b)βˆ’(4.2d), 𝑣≀𝑑 =𝑒≀𝑑 o

.

As a corollary, the size |S(𝑒≀𝑑) | of the set of subsequent feasible trajectories is a measure of future flexibility, conditioned on𝑒≀𝑑. Our first result justifies our calling π‘βˆ—

𝑑 the optimal flexibility feedback: π‘βˆ—

𝑑 is a measure of the future flexibility that will be enabled by the operator’s action𝑒𝑑and it attains a measure of system capacity for flexibility. By definition, π‘βˆ—

1(𝑒1|𝑒<1) := π‘βˆ—

1(𝑒1).

Lemma 17. Letπœ‡(Β·)denote the Lebesgue measure. The MEF as optimal solutions of the maximization in (4.7a)-(4.7c) are given by

π‘βˆ—

𝑑(𝑒|𝑒<𝑑) β‰‘πœ‡(S( (𝑒<𝑑, 𝑒)))

πœ‡(S(𝑒<𝑑)) , βˆ€(𝑒<𝑑, 𝑒𝑑) ∈U𝑑. (4.8) Moreover, the optimal value of (4.7a)-(4.7c)is equal tologπœ‡(S).

Remark 4. When the denominatorπœ‡(S(𝑒<𝑑))is zero, the numeratorπœ‡(S( (𝑒<𝑑, 𝑒))) has also to be zero. For this case, we setπ‘βˆ—

𝑑(𝑒|𝑒<𝑑) =0and this does not affect the optimality of (4.7a)-(4.7b).

The proof can be found in Appendix 4.B. The volume πœ‡(S) is a measure of flexibility inherent in the aggregator. We will hence call logπœ‡(S) the system capacity. Lemma 17 then says that the optimal value of (4.7) is the system capacity, π­Ÿ‹=logπœ‡(S). Moreover the maximum entropy feedback (π‘βˆ—

1, . . . , π‘βˆ—

𝑇) is the unique collection of conditional distributions that attains the system capacity in (4.7). This is intuitive since the entropy of a random trajectoryπ‘₯inSis maximized by the uniform distributionπ‘žβˆ— in (4.12) induced by the conditional distributions(π‘βˆ—

1, . . . , π‘βˆ—

𝑇). Lemma 17 implies the following important properties of the maximum entropy feedback.

Corollary 4.4.1 (Feasibility and flexibility). Let π‘βˆ—

𝑑 = π‘βˆ—

𝑑(Β·|𝑒<𝑑) be the maximum entropy feedback at each time𝑑 ∈ [𝑇].

1. For any action trajectory𝑒 =(𝑒1, . . . , 𝑒𝑇), if π‘βˆ—

𝑑(𝑒𝑑|𝑒<𝑑) > 0 for all𝑑 ∈ [𝑇], then𝑒 ∈S.

2. For all𝑒𝑑, 𝑒′

𝑑 ∈Uat each time𝑑 ∈ [𝑇], if π‘βˆ—

𝑑(𝑒𝑑|𝑒<𝑑) β‰₯ π‘βˆ—

𝑑(𝑒′

𝑑|𝑒<𝑑), thenπœ‡(|S( (𝑒<𝑑, 𝑒𝑑))) β‰₯ πœ‡ S( (𝑒<𝑑, 𝑒′

𝑑)) .

The proof is provided in Appendix 4.C. We elaborate the implication of Corollary 4.4.1 for our online feedback-based solution approach.

Remark 5(Feasibility and flexibility). Corollary 4.4.1 says that the proposed optimal flexibility feedbackπ‘βˆ—

𝑑 provides the right information for the system operator to choose its action𝑒𝑑 at time𝑑.

1. (Feasibility) Specifically, the first statement of the corollary says that if the operator always chooses an action 𝑒𝑑 with positive conditional probability π‘βˆ—

𝑑(𝑒𝑑) > 0for each time𝑑, then the resulting action trajectory is guaranteed to be feasible,𝑒 ∈S, i.e., the system will remain feasible ateverytime𝑑 ∈ [𝑇] along the way.

2. (Flexibility) Moreover, according to the second statement of the corollary, if the system operator chooses an action𝑒𝑑 with a larger π‘βˆ—

𝑑(𝑒𝑑)value at time 𝑑, then the system will be more flexible going forward than if it had chosen another signal𝑒′

𝑑with a smaller π‘βˆ—

𝑑(𝑒′

𝑑)value, in the sense that there are more feasible trajectories inS( (𝑒<𝑑, 𝑒𝑑))going forward.

As noted in Remark 2, despite characterizations that involve the whole action trajectory𝑒, such as𝑒 ∈S, these areonlineproperties. This guarantees the feasibility of the online closed-loop control system depicted in Figure 4.3.1, and confirms the suitability of π‘βˆ—

𝑑 for online applications.

4.5 Approximating Maximum Entropy Feedback via Reinforcement Learning

Dalam dokumen Learning-Augmented Control and Decision-Making (Halaman 127-132)