The necessity of advanced adaptive control strategies that can adapt to building non-stationarities can be demonstrated by observing the performance of some existing traditional and modern control techniques in building HVACs. For any of these cases, the controllers need access to a sufficiently accurate model of the building that can simulate building non-stationary behavior.
Proposed Approach
In the field of robotics, [7] suggested an iterative loop in which the data-driven models are learned periodically and controllers are trained on them. Therefore, the scarcity of new data regarding non-stationary behavior is a problem when relearning the models.
Challenges in Data Driven Adaptive Control
Wear in system components: We use the term safety to describe critical boundary requirements of the system. During this phase, exploratory actions of the controller can stress the system's actuators and cause severe wear[29, 30].
Contributions of the Thesis
For the building, we are concerned about the wear and tear that may occur due to frequent adjustment and changes to the operational settings of the HVAC equipment in the buildings (see Figure 1.1 3rd column for PPO). This helps reduce computational complexity for tuning the large number of hyperparameters associated with the approximation.
Organization of the Thesis
We then provide the results of benchmarking experiments comparing the performance of our relearning approach with existing state-of-the-art approaches in supervisory control. Finally, we review the state of the art in continuous reinforcement learning, from which we borrow certain ideas to advance the application of supervisory control in buildings.
Modeling of Building Systems for Energy Management
- Physics Based Modeling of HVACs
- Gray Box Modeling of HVACs
- Data-Driven Modeling of HVACs
- Statistical Regression Models
- Data Mining Algorithms
- Summary
Data-driven models apply statistical methods to data collected from the system to build operational models of system behavior. These have been commonly used to create high-level data-driven models of air conditioning systems [51, 52], predict days ahead room temperature and humidity [53], model heat pump behavior [54], compressor capacity and power consumption [55].
Supervisory Control Strategies for Buildings
Traditional Control Strategies
Due to its limited expressiveness it has been used very little, with applications only in [75] to control heating and cooling coils in small buildings. But they lack the accuracy, energy-efficient performance, the ability to control non-linear moving processes with time delays, and a complex system with uncertain information systems.
Advanced Control Strategies: Model Predictive Control and Reinforcement Learning 14
- Reinforcement Learning based approaches using Physics Based Models . 16
So far, most works in the application of CSR have been demonstrated on physics-based models of the building. There have been very few applications in the literature of data-driven models used to train reinforcement learning agents.
Summary
Apart from the drawbacks, we have one more comment about the components of the reward function used in these papers. Most of the reviewed literature has a common set of metrics as part of the reward function.
Continual Reinforcement Learning in Non-Stationary Environments
Classical Approaches in Continual Learning
Other classical approaches to the continuous reinforcement learning problem involve detecting the changes in the environmental behavior and refining the targeting policy after changes are detected [118, 121]. Here, the missed rewards are obtained by comparing a stationary policy with the best time-dependent policy for each context.
Continual Learning in a DRL setting
Also, [132] proposed a continuous learning framework in which they preserve a history of policies and constrain policy optimization by keeping the new policy as close as possible to the discounted weighted sum of the older policy. They show that this achieves stability and prevents catastrophic forgetting compared to the baseline cut PPO implementation.
Overall Summary
This means that the agent can learn using a stationary MDP environment per epoch [134]. Therefore, we model the building as a Lipschitz continuous non-stationary Markov decision process (LC-NSMDP) that can be learned with RL-based model-free interactions during the duration of δ mode changes.
Outer Loop
Performance Monitor Module
We assume that whenever the controller's performance deteriorates, it is reflected in this reward signal by an overall negative trend. Since it is a simpler scalar time series signal, we estimate the fit of the negative trend using the technique described in [135].
Inner Loop
- Data-driven Modeling of the Dynamic System
- Experience Buffer
- Supervisory Controller: Deep Reinforcement Learning Agent
- Exogenous Variable Predictors
Since we are dealing with data-driven models, we need to provide training data that contains the input variables for the models. Since measurements from real environments are prone to sensor noise, data-based models that form a dynamic system model exhibit deviations due to inaccuracies.
Summary
Since we relearn offline on a data-driven dynamic system model, we are able to speed up the training process (2.4 millisecond/sample) compared to learning online, where the sampling rate is much slower (5min/sample). To make the approach more robust under varying amounts of noise in the observations and resulting errors in the data-driven models, we run multiple simulations in parallel during the relearning phase of the controller, and the bootstrap aggregates these experiences during the controller updates.
Limitations
Therefore, we need to systematically study the effects of these hyperparameters on our approach to understand how well or worse our approach generalizes under different conditions with varying severity of non-stationarity. This requires us to 1) formalize the search process for the best set of hyperparameters and 2) study the sensitivity of the approach (to the performance metrics) to these individual hyperparameter variations.
Hyperparameter Optimization
Problem Formulation
Approach
- Creation of the Bayes Net
- Decomposition of Hyperparameter Space
- Separation of Hyperparameters: Connected Components and D-separation 37
- Bayesian Optimization for Hyperparameter Tuning
The evaluation of local metrics blocks the effect of the local hyperparameters on global metrics (Rule 1). Overall, applying the rules to the ensemble models and their performance metrics helps us to independently optimize the hyperparameters of most models.
Hyperparameter Sensitivity Study
Using the surrogate probability model and the selection function, select the best performing hyperparameter. After this process is completed, we obtain a local optimal estimate of the hyperparameters that maximize the performance metric under consideration and use these values for evaluating our approach when performing benchmark experiments or for deployment.
Summary
Limitations
Then we outline the implementation of the solution architecture based on Chapter 4, where we go into the details of (1) the data-driven models for the transition functions and the experience buffer, (2) the supervisory controller implemented as a DRL. agent, (3) the components of the exogenous variable predictors and the (4) performance monitor module. First, we provide the tuned values for the hyperparameters and the optimal architectures for the data-driven components of the solution architecture.
System Description
The air is then brought to individual areas of the building by forced draft created by the VFD fan. The goal of the controller installed with our approach is to provide efficiency in terms of energy and maintain comfort, safety and robustness when the building exhibits non-stationary behavior.
Problem Formulation
Our goal is to develop a monitoring controller for the AHU discharge setpoint to ensure energy efficiency, comfort and reduced VAV damper actuation during building non-stationarity due to changing 1) weather, 2) zone temperature requirements and 3) thermal load (occupancy). Accordingly, we conduct four sets of experiments where we evaluate the performance of our approach separately under each type of building non-stationarity and a fourth type where the non-stationarities are combined.
Implementation of the Solution
- Dynamic System Model
- Experience Buffer
- Supervisory Controller
- Exogenous Variable Predictors
- Performance Monitor Module
The transition model (p(s′,st,at)) should give the values of the following set of observations ( ¯ot) at the next time step: Total energy consumption (Etot), area temperatures (Tz,z= 1. .5), VAV damper percentages (v%,z,z=1. .5). For the test bed, we needed to predict the ambient dry bulb temperature (oat), relative humidity (orh) values.
Hyperparameter Optimization
- Bayes Net
- Identification of global and local hyperparameters
- Separation of Hyperparameters
- Two-Step Hyperparameter Optimization
- Global and Local Hyperparameter Choices
However, they are independent of the other variables as they are not connected to the rest of the nodes. Finally, the individual model hyperparameters do not affect the global metrics due to the blocking effect of the local metricsLmodel,error.
Individual Component Architecture and Performance Evaluation
- Dynamic System Model
- Experience Buffer
- Supervisory Controller
- Exogenous Variable Predictors
- Performance Monitor Module
Based on this, the network architecture is formulated as an encoder-decoder network as shown in Table 6.10. Finally, the set values of the input sequence length and the output horizon under various conditions of non-stationarity are shown in Table 6.12.
Sensitivity Analysis of global hyperparameters
For the performance monitoring module, we adjusted the values of the two windows to detect negative trends in both small intervals for faster relative non-stationarity and large intervals for slow non-stationarity.
Benchmark Experiments
Evaluation Metrics
Results
The prediction horizon considered for the MPPI is similar to ours under each case of non-stationarity to ensure fairness in the comparison. Here, the performance of PPO and DDPG was worse compared to even rule-based control under non-stationarity.
Summary
Each of the individual components must perform optimally for the whole approach to work well. Finally, we provide the results of the benchmarking experiments where we compare our approach with a previously deployed reset schedule [11] Rule-based controller.
System Description
Problem Formulation
Implementation of the Solution
- Dynamic System Model
- Experience Buffer
- Supervisory Controller
- Exogenous Variable Predictors
- Performance Monitor Module
The tuned model architecture for the Actor Critic networks, episode length (l), and discount factor (γ) are summarized in Appendix B2. As with the bench experiments, we had to predict the dry bulb temperature (oat) and relative humidity (orh) values for the real building.
Hyperparameter Optimization
Further, we wanted to get the best performance for the actual deployment and PPO, being the latest Actor Critic algorithm, was a natural choice for deployment purposes. For the exogenous variables based on the area defined point schemes, the models were based on simple rules which look for the actual schedules in the real building and place them in those plans for the required forecast horizonN.
Individual Component Architecture and Performance Evaluation
Dynamic System Model
Supervisory Controller
Benchmark Experiments
Our approach learned to exploit this by virtue of the reward function, where it emphasizes keeping the setpoint as close as possible to the zone temperature requirements. Therefore, the zone deviations from setpoints were on average lower and the zone VRF fans had to switch less compared to the rule-based control as shown in Figures 7.3 and 7.4.
Summary
According to the building manager, these figures, in the long term, can contribute to significant energy savings and less degradation of components associated with zone-level activation. Based on this, we found the optimal architecture of the various components of the approach.
Motivation and Summary of our Approach
In this work, we proposed a condition-based deep learning-based modeling and reinforcement learning control approach for performance optimization in buildings modeled as non-stationary systems. The proposed approach augments vanilla deep reinforcement learning (DRL) approaches by using an outer performance monitoring loop, which activates an inner relearning schedule whenever the controller performance in the current system deteriorates under conditions of instability.
Contributions
Limitations
Because non-stationarities indicate a new mode of operational data for the system, existing data, even if recent, can cause errors in state estimation, making the policy suboptimal. Finally, as seen in the many components in our approach, it is susceptible to performance degradation if one of the components is not performing optimally.
Future Work
Comparison of deep reinforcement learning and predictive model control for adaptive cruise control.arXiv, October 2019. Whole building energy model for optimal hvac control: a practical framework based on deep reinforcement learning. Energy and buildings.
Overview of the solution using an inner and outer loop schema
Performance Monitor Module
Dynamic System Models
Training of the Supervisory Controller implemented as an RL agent
Exogenous Variable Prediction Module
Proposed architecture for building energy control
Schematic of Hyperparameter Optimization for our Approach based on Bayesian Opti-
Schematic of the Five Zone Testbed. Source: [3]
Bayes Net for the Relearning Approach applied to the 5 Zone Testbed
Two-Step Hyperparameter Optimization for the testbed
Sobol sensitivity indices of Four Global hyperparameters are shown in this graph. Total-
Performance of Rule-Based, PPO, DDPG, MPPI, Relearning Approach deployed on the
The time in hours and the number of training steps needed for adjusting to the building
Simplified schematic of the HVAC system under Study
Similarity of weather during performance comparison of Relearning Approach and Rule
Comparison of hourly zone temperature deviation between relearning control and rule
Comparison of hourly VRF fan On/Off switching between relearning control and rule