LEARNING-BASED PREDICTIVE CONTROL: FORMULATION
4.3 Problem Formulation
prove its properties in Section 4.4. An RL-based approach for estimating the MEF is provided in Section 4.5. Combining MEF and model MPC, we propose an algorithm, termed the PPC in Section 4.6. Numerical results are given in Section 4.7.
accomplished by timeπ. At each timeπ‘, based on its current stateπ₯π‘, the aggregator needs to sendflexibility feedback, ππ‘, a probability density function, from a collection of feedback signals P, to the system operator, which describes the flexibility of the aggregator for accepting different actionsπ’π‘. We formally define ππ‘ and Pin Section 4.4. Designing ππ‘is one of the central problems considered in this chapter (see Section 4.4 for more details). Below we state the aggregatorβs goal in the real-time control system.
Aggregatorβs Objective. The goal of the aggregator is two-fold: (1). Maintain the feasibility of the system and guarantee thatπ₯π‘ βXπ‘ andπ’π‘ βUπ‘for allπ‘ β [π]. (2).
Generate flexibility feedback ππ‘ and send it to the operator at timeπ‘ β [π].
Remark 1. We assume that the action spaceUis a continuous set in Ronly for simplicity of presentation. The results and definitions in the chapter can be extended to discrete setting by changing the integrals to summations, and replacing the differential entropy functions by discrete entropy functions, e.g., see the definition of maximum entropy feedback (Definition 4.4.1) and Lemma 17. In practical systems e.g., an electric system consisting of an EV aggregator and an operator, Uoften represents the set of power levels and when the gap between power levels is small,U can be modeled as a continuous set.
Figure 4.3.1: System model: A feedback control approach for solving an online version of (4.2). The operator implements a control algorithm and the aggregator uses reinforcement learning to generate real-time aggregate flexibility feedback.
System Operator
A system operator is a central controller that operates the power grid. Knowing the flexibility feedback ππ‘from the aggregator, the operator sends an actionπ’π‘, chosen fromUto the aggregator at each timeπ‘ β [π]. Each action is associated with a cost functionππ‘(Β·) :Uβ R+, e.g., the aggregate EV charging rate increases load on the electricity grid. The systemβs objective is stated as follows.
Operatorβs Objective. The goal of the system operator is to provide an actionπ’π‘ βU at timeπ‘ β [π]to the aggregator so as to minimize the cumulative system costs given byπΆπ(π’1, . . . , π’π) :=Γπ
π‘=1ππ‘(π’π‘).
Real-Time Operator-Aggregator Coordination
Overall, considering the aggregator and operatorβs objectives, the goal of the closed- loop system is to solve the following problem in real-time, by coordinating the operator and aggregator via{ππ‘ :π‘ β [π]}and{π’π‘ :π‘ β [π]}:
min
π’1,...,π’π
πΆπ(π’1, . . . , π’π) (4.2a) subject toβπ‘ =1, . . . , π :
π₯π‘+1= ππ‘(π₯π‘, π’π‘) (4.2b)
π₯π‘ βXπ‘, (4.2c)
π’π‘ βUπ‘ (4.2d)
i.e., the operator aims to minimize its costπΆπ in (4.2a) while the load aggregator needs to fulfill its obligations in the form of constraints (4.2b)-(4.2d). This is an offline problem that involves global information at all timesπ‘ β [π].
Remark 2. For simplicity, we describe our model in an offline setting where the cost and the constraints in the optimization problem(4.2)are expressed in terms of the entire trajectories of states and actions. The goal of the closed-loop control system is, however, to solve an online optimization via operator-aggregator coordination.
The challenges are: (i) the aggregator and operator need to solve the online version of (4.2) jointly, and (ii) the cost function πΆπ is private to the operator and the constraints (4.2b)-(4.2d) are private to the operator. It is impractical for the aggregator to communicate the constraints to the operator because of privacy concerns or computational effort. Moreover, in an online setting, even the aggregator will not know the constraints that involve future information, e.g., future EV arrivals in an EV charging station. Formally, at each timeπ‘ β [π], we assume that the operator and aggregator have access to the following information, respectively:
1. An operator knows the costs (π1, . . . , ππ‘) and feedback (π1, . . . , ππ‘), but not the future costs(ππ‘+1, . . . , ππ)and feedback (ππ‘+1, . . . , ππ).
2. An aggregator knows the state transition functions (π1, . . . , ππ), the initial stateπ₯1and actions (π’1, . . . , π’π‘).
Systemβs Goal. Overall, the goal of a aggregator-operator system is to jointly solve the online version of (4.2a)-(4.2d)whose partial information is known to an aggregator and an operator, respectively.
Necessities of Combining Learning and Control
With the assumptions above, on the one hand the aggregator cannot solve the problem independently because it does not have cost information (since the costs are often sensitive and only of the operatorβs interests) from the operator and even if the aggregator could, it may not have enough power to solve an optimization to obtain an action. On the other hand, the operator has to receive flexibility information from the aggregator in order to act. Well-known methods in pure learning or control cannot be used for this problem directly. From a learning perspective, the aggregator cannot simply use reinforcement learning and transmit parameters of a learned Q-function or an actor-critic model to the operator because the aggregator does not know the costs.
From a control perspective, although model predictive control (MPC) is widely used for EV charging scheduling in practical charging systems [1, 96], it requires precise state information of electric vehicle supply equipment (EVSE). Thus, to solve the induced MPC problem, the system operator or aggregator needs to solve an online optimization at each time step that involves hundreds or even thousands of variables.
This not just a complex problem, but the state information of the controllable units is potentially sensitive. This combination makes controlling sub-systems using precise information impractical for a future smart grid [24, 25] In this work, we explore a solution where the system operator and the aggregator jointly solve an online version of (4.2) in a closed loop, as illustrated in Figure 4.3.1.
The real-time operator-aggregator coordination illustrated in Figure 4.3.1 combines learning and control approaches. It does not require the aggregator to know the system operatorβs objective in (4.2a), but only the actionπ’π‘ at each timeπ‘ β [π] from the operator. In addition, it does not require the system operator to know the aggregator constraints in (4.2b), but only a feedback signal ππ‘ (to be designed) from the aggregator. After receiving flexibility feedback ππ‘, which could be generated by machine learning algorithms, the system operator outputs an action π’π‘ using a causaloperator function ππ‘(Β·) : P β U. Knowing the state π₯π‘, the aggregator generates its feedbackππ‘ using a causalaggregator functionππ‘(Β·):XβPwhereP denotes the domain of flexibility feedback that will be formally defined in Section 4.4.
By an βonline feedbackβ solution, we mean that these functions (ππ‘, ππ‘) use only information available locally at timeπ‘ β [π].
Algorithm 5:Closed-Loop Feedback Control Framework(for a system operator (central controller) and an aggregator (local controller)).
forπ‘ β [π] do
Operator (Central Controller) Generate actions using the PPC:
π’π‘ =ππ‘(ππ‘) πΆπ‘ =πΆπ‘β1+ππ‘(π’π‘)
Aggregator (Local Controller) Update system state:
π₯π‘+1= ππ‘(π₯π‘, π’π‘) Compute estimated MEF:
ππ‘+1=ππ‘(π₯π‘+1)
end
Return total costπΆπ;
In summary, the closed-loop control system in our model proceeds as follows. At each time π‘, the aggregator learns or computes a length-|U| vector ππ‘ based on previously received action trajectoryπ’<π‘ = (π’1, . . . , π’π‘β1), and sends it to the system operator.1 The system operator thencomputes a (possibly random) actionπ’π‘ =ππ‘(ππ‘) based on the flexibility feedback ππ‘ and sends it to the aggregator. The operator chooses its signalπ’π‘ in order to solve the time-π‘ problem in an online version of (4.2), so the functionππ‘ denotes the mapping from the flexibility feedbackππ‘to an optimal solution of the time-π‘ problem. See 4.6 for examples. The aggregator then computes the next feedback ππ‘+1and the cycle repeats; see Algorithm 5. The goal of this chapter is to provide concrete constructions of an aggregator functionπ(as an MEF generator; see Section 4.4) and an operator functionπ(via the PPC scheme;
see Section 4.6).
In the sequel, we demonstrate our system model using an EV charging application, as an example of the problem stated in (4.2).
1We will omitπ’<π‘ in the notation when it is not essential to our discussion and simplify the probability vector as ππ‘. Note that in (4.7c) we slightly abuse the notation and useππ‘to denote a conditional distribution. This is only for computational purposes and the information sent from an aggregator to an operator at timeπ‘β [π]is still a length-|U|probability vector, conditioned on a fixed π’<π‘.
An EV Charging Example
Consider an aggregator that is an EV charging facility withπaccepted users. Each user π has a private vector (π(π), π(π), π(π), π(π)) β R4 where π(π) denotes its arrival (connecting) time;π(π)denotes its departure (disconnecting) time, normalized according to the time indices in [π];π(π)denotes the total energy to be delivered, and π(π) is its peak charging rate. Fix a set ofπ users with their private vectors (π(π), π(π), π(π), π(π)), the aggregator state π₯π‘ at time π‘ β [π] is a collection of length-2 vectors(ππ‘(π), ππ‘(π) : π(π) β€π‘ β€ π(π))for each EV that has arrived and has not departed by timeπ‘. Hereππ‘(π)is the remaining energy demand of user π at timeπ‘ and ππ‘(π) is the remaining charging time. The decision π π‘(π) is the energy delivered to each user π at timeπ‘, determined by ascheduling policyππ‘ such as the well-known earliest-deadline-first, least-laxity-first, etc. Letπ π‘ := (π π‘(1), . . . , π π‘(π)) and we haveπ π‘ =ππ‘(π’π‘)whereπ’π‘ in this example is the aggregate substation power level, chosen from a small discrete setU. The aggregator decisionπ π‘(π) βR+at each timeπ‘ updates the state, in particularππ‘(π)such that
ππ‘(π) =ππ‘β1(π) βπ π‘(π) (4.3a) ππ‘(π) =ππ‘β1(π) βΞ (4.3b) where Ξdenotes the time unit and we assume that there is no energy loss. The laws (4.3a)-(4.3b) are examples of the generic transition functions π1, . . . , ππ in (4.1).
Suppose, in the context of demand response, the system operator (a local utility company, or a building management) sends a signalπ’π‘ that is the aggregate energy that can be allocated to EV charging. The aggregator makes charging decisions π π‘(π)to track the signalπ’π‘ received from the system operator as long as they will meet the energy demands of all users before their deadlines. Then the constraints in (4.2b)-(4.2d) are the following constraints on the charging decisions π π‘, as a function ofπ’π‘:
π π‘(π) =0, π‘ < π(π), π =1, . . . , π, (4.4a) π π‘(π) =0, π‘ > π(π), π =1, . . . , π, (4.4b)
π
βοΈ
π=1
π π‘(π)=π’π‘, π‘ =1, . . . , π , (4.4c)
π
βοΈ
π‘=1
π π‘(π) =π(π), π =1, . . . , π, (4.4d) 0β€ π π‘(π) β€π(π), π‘ =1, . . . , π . (4.4e)
In above, constraint (4.4c) ensures that the aggregator decisionπ π‘tracks the signalπ’π‘ at each timeπ‘ β [π], the constraint (4.4d) guarantees that EV πβs energy demand is satisfied, and the other constraints say that the aggregator cannot charge an EV before its arrival, after its departure, or at a rate that exceeds its limit. Inequalities (4.4a)- (4.4e) above are examples of the constraints in (4.1). Together, for this EV charging application, (4.3a)-(4.3b) and (4.4a)-(4.4e) exemplify the dynamic system in (3.3).
The system operatorβs objective to minimize the cumulative costsπΆπ(π’) :=Γπ π‘=1ππ‘π’π‘ whereπ’= (π’1, . . . , π’π) are substation power levels, as outlined in Section 4.3. The costππ‘ depends on multiple factors such as the electricity prices and injections from an installed rooftop solar panel. Overall, the EV charging problem is formulated below, as a specific example of the generic optimization (4.2a)-(4.2d):
min
π’1,...,π’π
π
βοΈ
π‘=1
ππ‘π’π‘ (4.5a)
(4.3a)β(4.3b) and (4.4a)β(4.4e). (4.5b)
4.4 Definitions of Real-Time Aggregate Flexibility: Maximum Entropy Feed-