A LaTeX Format for Theses and Dissertations

The necessity of advanced adaptive control strategies that can adapt to building non-stationarities can be demonstrated by observing the performance of some existing traditional and modern control techniques in building HVACs. For any of these cases, the controllers need access to a sufficiently accurate model of the building that can simulate building non-stationary behavior.

Figure 1.1: Distribution of Energy consumption across Different Sectors[1] and in Commercial Buildings[2](2019)

Proposed Approach

In the field of robotics, [7] suggested an iterative loop in which the data-driven models are learned periodically and controllers are trained on them. Therefore, the scarcity of new data regarding non-stationary behavior is a problem when relearning the models.

Challenges in Data Driven Adaptive Control

Wear in system components: We use the term safety to describe critical boundary requirements of the system. During this phase, exploratory actions of the controller can stress the system's actuators and cause severe wear[29, 30].

Contributions of the Thesis

For the building, we are concerned about the wear and tear that may occur due to frequent adjustment and changes to the operational settings of the HVAC equipment in the buildings (see Figure 1.1 3rd column for PPO). This helps reduce computational complexity for tuning the large number of hyperparameters associated with the approximation.

Organization of the Thesis

We then provide the results of benchmarking experiments comparing the performance of our relearning approach with existing state-of-the-art approaches in supervisory control. Finally, we review the state of the art in continuous reinforcement learning, from which we borrow certain ideas to advance the application of supervisory control in buildings.

Modeling of Building Systems for Energy Management

Physics Based Modeling of HVACs
Gray Box Modeling of HVACs
Data-Driven Modeling of HVACs

Statistical Regression Models
Data Mining Algorithms

Summary

Data-driven models apply statistical methods to data collected from the system to build operational models of system behavior. These have been commonly used to create high-level data-driven models of air conditioning systems [51, 52], predict days ahead room temperature and humidity [53], model heat pump behavior [54], compressor capacity and power consumption [55].

Supervisory Control Strategies for Buildings

Traditional Control Strategies

Due to its limited expressiveness it has been used very little, with applications only in [75] to control heating and cooling coils in small buildings. But they lack the accuracy, energy-efficient performance, the ability to control non-linear moving processes with time delays, and a complex system with uncertain information systems.

Advanced Control Strategies: Model Predictive Control and Reinforcement Learning 14

Reinforcement Learning based approaches using Physics Based Models . 16

So far, most works in the application of CSR have been demonstrated on physics-based models of the building. There have been very few applications in the literature of data-driven models used to train reinforcement learning agents.

Summary

Apart from the drawbacks, we have one more comment about the components of the reward function used in these papers. Most of the reviewed literature has a common set of metrics as part of the reward function.

Continual Reinforcement Learning in Non-Stationary Environments

Classical Approaches in Continual Learning

Other classical approaches to the continuous reinforcement learning problem involve detecting the changes in the environmental behavior and refining the targeting policy after changes are detected [118, 121]. Here, the missed rewards are obtained by comparing a stationary policy with the best time-dependent policy for each context.

Continual Learning in a DRL setting

Also, [132] proposed a continuous learning framework in which they preserve a history of policies and constrain policy optimization by keeping the new policy as close as possible to the discounted weighted sum of the older policy. They show that this achieves stability and prevents catastrophic forgetting compared to the baseline cut PPO implementation.

Overall Summary

This means that the agent can learn using a stationary MDP environment per epoch [134]. Therefore, we model the building as a Lipschitz continuous non-stationary Markov decision process (LC-NSMDP) that can be learned with RL-based model-free interactions during the duration of δ mode changes.

Outer Loop

Performance Monitor Module

We assume that whenever the controller's performance deteriorates, it is reflected in this reward signal by an overall negative trend. Since it is a simpler scalar time series signal, we estimate the fit of the negative trend using the technique described in [135].

Inner Loop

Data-driven Modeling of the Dynamic System
Experience Buffer
Supervisory Controller: Deep Reinforcement Learning Agent
Exogenous Variable Predictors

Since we are dealing with data-driven models, we need to provide training data that contains the input variables for the models. Since measurements from real environments are prone to sensor noise, data-based models that form a dynamic system model exhibit deviations due to inaccuracies.

Summary

Since we relearn offline on a data-driven dynamic system model, we are able to speed up the training process (2.4 millisecond/sample) compared to learning online, where the sampling rate is much slower (5min/sample). To make the approach more robust under varying amounts of noise in the observations and resulting errors in the data-driven models, we run multiple simulations in parallel during the relearning phase of the controller, and the bootstrap aggregates these experiences during the controller updates.

Limitations

Therefore, we need to systematically study the effects of these hyperparameters on our approach to understand how well or worse our approach generalizes under different conditions with varying severity of non-stationarity. This requires us to 1) formalize the search process for the best set of hyperparameters and 2) study the sensitivity of the approach (to the performance metrics) to these individual hyperparameter variations.

Hyperparameter Optimization

Problem Formulation

Approach

Creation of the Bayes Net
Decomposition of Hyperparameter Space
Separation of Hyperparameters: Connected Components and D-separation 37
Bayesian Optimization for Hyperparameter Tuning

The evaluation of local metrics blocks the effect of the local hyperparameters on global metrics (Rule 1). Overall, applying the rules to the ensemble models and their performance metrics helps us to independently optimize the hyperparameters of most models.

Figure 5.1: Schematic of Hyperparameter Optimization for our Approach based on Bayesian Optimization

Hyperparameter Sensitivity Study

Using the surrogate probability model and the selection function, select the best performing hyperparameter. After this process is completed, we obtain a local optimal estimate of the hyperparameters that maximize the performance metric under consideration and use these values for evaluating our approach when performing benchmark experiments or for deployment.

Summary

Limitations

Then we outline the implementation of the solution architecture based on Chapter 4, where we go into the details of (1) the data-driven models for the transition functions and the experience buffer, (2) the supervisory controller implemented as a DRL. agent, (3) the components of the exogenous variable predictors and the (4) performance monitor module. First, we provide the tuned values for the hyperparameters and the optimal architectures for the data-driven components of the solution architecture.

System Description

The air is then brought to individual areas of the building by forced draft created by the VFD fan. The goal of the controller installed with our approach is to provide efficiency in terms of energy and maintain comfort, safety and robustness when the building exhibits non-stationary behavior.

Figure 6.1: Schematic of the Five Zone Testbed. Source: [3]

Problem Formulation

Our goal is to develop a monitoring controller for the AHU discharge setpoint to ensure energy efficiency, comfort and reduced VAV damper actuation during building non-stationarity due to changing 1) weather, 2) zone temperature requirements and 3) thermal load (occupancy). Accordingly, we conduct four sets of experiments where we evaluate the performance of our approach separately under each type of building non-stationarity and a fourth type where the non-stationarities are combined.

Implementation of the Solution

Dynamic System Model
Experience Buffer
Supervisory Controller
Performance Monitor Module

The transition model (p(s′,st,at)) should give the values of the following set of observations ( ¯ot) at the next time step: Total energy consumption (Etot), area temperatures (Tz,z= 1. .5), VAV damper percentages (v%,z,z=1. .5). For the test bed, we needed to predict the ambient dry bulb temperature (oat), relative humidity (orh) values.

Hyperparameter Optimization

Bayes Net
Identification of global and local hyperparameters
Separation of Hyperparameters
Two-Step Hyperparameter Optimization
Global and Local Hyperparameter Choices

However, they are independent of the other variables as they are not connected to the rest of the nodes. Finally, the individual model hyperparameters do not affect the global metrics due to the blocking effect of the local metricsLmodel,error.

Table 6.2: Hyperparameters of the ensemble-CPS pair, the metrics they affect and the classification of the hyperparameters as global or local

Individual Component Architecture and Performance Evaluation

Experience Buffer

Based on this, the network architecture is formulated as an encoder-decoder network as shown in Table 6.10. Finally, the set values of the input sequence length and the output horizon under various conditions of non-stationarity are shown in Table 6.12.

Table 6.4: Ranges of the local hyperparameters

Sensitivity Analysis of global hyperparameters

For the performance monitoring module, we adjusted the values of the two windows to detect negative trends in both small intervals for faster relative non-stationarity and large intervals for slow non-stationarity.

Benchmark Experiments

Evaluation Metrics

Results

The prediction horizon considered for the MPPI is similar to ours under each case of non-stationarity to ensure fairness in the comparison. Here, the performance of PPO and DDPG was worse compared to even rule-based control under non-stationarity.

Figure 6.5: Performance of Rule-Based, PPO, DDPG, MPPI, Relearning Approach deployed on the testbed and simulated for a period of 1 year

Summary

Each of the individual components must perform optimally for the whole approach to work well. Finally, we provide the results of the benchmarking experiments where we compare our approach with a previously deployed reset schedule [11] Rule-based controller.

Figure 6.6: The time in hours and the number of training steps needed for adjusting to the building non- non-stationarities

System Description

Problem Formulation

Implementation of the Solution

Experience Buffer

The tuned model architecture for the Actor Critic networks, episode length (l), and discount factor (γ) are summarized in Appendix B2. As with the bench experiments, we had to predict the dry bulb temperature (oat) and relative humidity (orh) values for the real building.

Hyperparameter Optimization

Further, we wanted to get the best performance for the actual deployment and PPO, being the latest Actor Critic algorithm, was a natural choice for deployment purposes. For the exogenous variables based on the area defined point schemes, the models were based on simple rules which look for the actual schedules in the real building and place them in those plans for the required forecast horizonN.

Individual Component Architecture and Performance Evaluation

Dynamic System Model

Supervisory Controller

Benchmark Experiments

Our approach learned to exploit this by virtue of the reward function, where it emphasizes keeping the setpoint as close as possible to the zone temperature requirements. Therefore, the zone deviations from setpoints were on average lower and the zone VRF fans had to switch less compared to the rule-based control as shown in Figures 7.3 and 7.4.

Figure 7.2: Similarity of weather during performance comparison of Relearning Approach and Rule Based Controller deployed on the real building

Summary

According to the building manager, these figures, in the long term, can contribute to significant energy savings and less degradation of components associated with zone-level activation. Based on this, we found the optimal architecture of the various components of the approach.

Motivation and Summary of our Approach

In this work, we proposed a condition-based deep learning-based modeling and reinforcement learning control approach for performance optimization in buildings modeled as non-stationary systems. The proposed approach augments vanilla deep reinforcement learning (DRL) approaches by using an outer performance monitoring loop, which activates an inner relearning schedule whenever the controller performance in the current system deteriorates under conditions of instability.

Contributions

Limitations

Because non-stationarities indicate a new mode of operational data for the system, existing data, even if recent, can cause errors in state estimation, making the policy suboptimal. Finally, as seen in the many components in our approach, it is susceptible to performance degradation if one of the components is not performing optimally.

Future Work

Comparison of deep reinforcement learning and predictive model control for adaptive cruise control.arXiv, October 2019. Whole building energy model for optimal hvac control: a practical framework based on deep reinforcement learning. Energy and buildings.

Overview of the solution using an inner and outer loop schema

Performance Monitor Module

Dynamic System Models

Training of the Supervisory Controller implemented as an RL agent

Exogenous Variable Prediction Module

Proposed architecture for building energy control

Schematic of Hyperparameter Optimization for our Approach based on Bayesian Opti-

Schematic of the Five Zone Testbed. Source: [3]

Bayes Net for the Relearning Approach applied to the 5 Zone Testbed

Two-Step Hyperparameter Optimization for the testbed

Sobol sensitivity indices of Four Global hyperparameters are shown in this graph. Total-

Performance of Rule-Based, PPO, DDPG, MPPI, Relearning Approach deployed on the

The time in hours and the number of training steps needed for adjusting to the building

Simplified schematic of the HVAC system under Study

Similarity of weather during performance comparison of Relearning Approach and Rule

Comparison of hourly zone temperature deviation between relearning control and rule

Comparison of hourly VRF fan On/Off switching between relearning control and rule