Online reinforcement learning of controller parameters adaptation law
Item Type Conference Paper
Authors Alhazmi, Khalid;Sarathy, Mani
Citation Alhazmi, K., & Sarathy, S. M. (2023). Online reinforcement learning of controller parameters adaptation law. 2023
American Control Conference (ACC). https://doi.org/10.23919/
acc55779.2023.10156644 Eprint version Post-print
DOI 10.23919/acc55779.2023.10156644
Publisher IEEE
Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.
Download date 2023-12-05 22:19:10
Link to Item http://hdl.handle.net/10754/692789
Online reinforcement learning of controller parameters adaptation law
Khalid Alhazmi1 and S. Mani Sarathy1
Abstract— Real-time control of highly nonlinear systems is a challenging task in many industrial processes. Here, we propose a learning-based adaptation law for adapting the controller parameters of nonlinear systems. The method applies model- free reinforcement learning to learn an effective parameter adaptation law while maintaining a safe system operation by in- cluding a safety layer. The efficacy of the proposed algorithm is demonstrated by controlling thermoacoustic combustion insta- bility, which is a critical issue in developing high-efficiency, low- emission gas turbine technologies. We show that the learning- based mechanism is able to attenuate combustion instabilities in a time-variant system with the presence of process noise. The proposed algorithm outperforms the adaptation performance of other model-free and model-based methods, such as extremum seeking controllers and self-tuning regulators, respectively.
Keywords: Nonlinear systems, adaptive control, reinforce- ment learning, combustion instabilities
I. INTRODUCTION
Many industrial processes and energy systems are operated at sub-optimal conditions. The highly nonlinear nature of certain phenomena (e.g., chemical reactions coupled to fluid mechanics) makes process control and optimization chal- lenging. Modeling the dynamics in these systems is often inaccurate and incomplete, resulting in process disturbances.
In addition, it is common for the system dynamics and control objectives to change with time. Thus, there is a need for a class of controllers that can learn online to control unknown systems using data measured in real-time [1].
One approach is to estimate the system parameters through system identification and then derive a control law, an approach known as indirect adaptive control. However, such an approach requires recomputing controls from estimated system models at each step, which is inherently complex.
An alternative approach, known as direct adaptive control, in which controller parameters are estimated directly, is more attractive [2].
Of particular interest to us are nonlinear problems with two characteristics: (1) time-varying reference trajectory or set point, and (2) lack of availability of sufficiently accurate reduced-order models that are required for the design of well-known direct adaptive controllers, such as H-infinity and self-tuning regulators [3], [4]. An instance of a problem with these features is the case of finding the optimum fuel- air ratio in gas turbine combustion systems susceptible to thermoacoustic instabilities; proper control is critical for ensuring stable, efficient, and low emission performance.
1King Abdullah University of Science and Technology (KAUST), Clean Combustion Research Center (CCRC), Physical Science and Engineering Division (PSE), Thuwal 23955-6900, Saudi Arabia {khalid.alhazmi, mani.sarathy}@kaust.edu.sa
Such a task is challenging due to time-varying variations in fuel (or air) temperature and quality, for instance. While combustor models exist, no reduced-order model is suffi- ciently general as they apply only to a narrow window of operating conditions [5].
A successful adaptive control method for the above men- tioned problems is extremum seeking control (ESC) [6], [7].
ESC aims to find and maintain the optimum operating point by perturbing the plant and using the response to estimate the gradient of an objective function. ESC can be used as a controller or to tune the parameters of a working controller.
Despite its success, ESC has several limitations. First, it re- quires continuous perturbation of the system. Second, careful selection of several ESC tuning parameters is required to achieve satisfactory performance. Such parameters include, but are not limited to, the frequency and amplitude of high and low pass filters, the frequency and amplitude of the perturbation signal, an integrator gain, and a good initial guess. While ESC is theoretically model-free, a model is needed to select proper tuning parameters in practice.
An alternative method to ESC is reinforcement learning (RL), which can be seen as direct adaptive control, as Sutton et al. point out [2]. RL has been extensively used in continuous control tasks, but most applications involve using RL as a controller [8]–[11]. In this work, we propose using RL as an adaptation mechanism for general working controllers. A safety layer that filters out unsafe actions by the RL policy is added to ensure that system constraints are never violated during learning. We consider the problem of a time-varying dynamic system controlled by an adjustable controller. Under a time-varying dynamic system, the con- troller parameters must be dynamically adjusted to achieve the desired performance.
To demonstrate the utility and effectiveness of the pro- posed method, we apply it to the control of thermoacoustic combustion instabilities [12], [13]. An aspect of great sig- nificance in gas turbine combustors is the dynamic flame stability [14]. The susceptibility of flames to become unstable due to the coupling of the unsteady heat release and acoustic waves inside the combustor has been one of the princi- pal challenges in developing modern high-efficiency low- emission gas turbine combustors in recent decades. Combus- tion instabilities are generally considered one of the highest risk items in new engine development [15]. Introducing fuel variability with novel zero-carbon fuels (e.g., ammonia, hyrdrogen) amplifies the uncertainty in combustor–acoustic interactions; therefore, flame dynamic stability is a vital issue in developing carbon-free gas turbine technologies.
Reinforcement learning has been used before in tuning
the parameters of specific working controllers, such as PID and model predictive controllers (MPC) [16], [17]. We show here that this does not have to be the case. We propose a framework that applies to a broad class of adjustable controllers while also considering the safety implications of online system exploration by RL. Approaches that rely on recorded input and output data also exist [18]. Further, other publications either do not consider the safety of the RL actions or include a penalty in the reward function, which means that applying RL for parameter tuning must be done on a case-by-case basis.
On the other hand, our algorithm can be applied regardless of the controller with only weak assumptions. Finally, we apply our framework to the problem of attenuating ther- moacoustic flame instabilities. To our knowledge, RL has not been applied to the active control of flame instabilities.
We show that our proposed framework performs as well or better than model-free and model-based methods, such as extremum seeking controllers, self-tuning regulators, and H- infinity robust controllers.
II. ASAFE REINFORCEMENT LEARNING SCHEME FOR PARAMETER ADAPTATION
A. Problem statement
We consider a general nonlinear model
xt+1=f(xt, ut, dt) (1) where x ∈ Rn is the state, u ∈ Rm is the input, d ∈ Rd is an unknown disturbance, and f : Rn ×Rm → Rn and h:Rn→Rare smooth. Given a smooth control law
ut=α(xt;θt) (2) parametrized by a parameter θ∈Rnθ such that the closed- loop system
xt+1=f(xt, α(xt;θt)) (3) has equilibria corresponding to each θt. The problem is to selectθtthat optimizes the control objective. We assume that a set of controller parameters exists under which the closed- loop system is stable, and its performance is optimal or near- optimal for a performance measure. The main objective of this work is to learn a controller parameters adaptation law (policy),π, such that
θt+1=π(xt) (4)
B. The RL problem
The proposed controller parameter adaptation scheme con- sists of two parts: the reinforcement learning policy that maps the state of the plant to the controller parameters and a safety layer that ensures that the controller parameters updated by the RL policy do not lead to instabilities. The scheme is summarized in Figure 1.
Controller parameters
Plant Controller
RL policy Safety layer
Output Control
signal Setpoint
Fig. 1: A safe reinforcement learning scheme for adjusting controller parameters
We start by formulating the reinforcement learning prob- lem as a constrained Markov decision process (CMDP).
While a Markov decision process (MDP) introduces a single utility (reward or cost function) consisting of different objec- tives to be optimized, a CMDP considers a situation where one type of cost is to be optimized while keeping the other types of costs below some given bounds. A tuple defines a CMDP:
M=⟨x, u, f, r, C, γ⟩ (5) wherex,uandf are as defined previously,r:x×u→R is a reward function, andC ={ci :x×u→R| i∈[K]}
is an immediate-constraint function, where[K]is defined as {1, ..., K}, representing each constraint, andγ ∈(0,1) is a discount factor. We also defineC¯ ={¯ci:x→R|i∈[K]}
as the immediate-constraint values per state.
The objective is to find a parameterized policy πϕ that maps system output to controller parameter, θ by solving the following policy optimization problem
maxπϕ E
"∞ X
t=0
γtr(xt, θt)
#
(6a) s.t. ¯ci(xt)≤Ci ∀i∈[K] (6b) where, at each state, all constrained states ¯ci(·) are upper bounded by a constantCi, andπϕis a policy represented by a neural network and parameterized byϕ, which represents the weights and biases of the neural network.
C. Construction of the constraint function
The optimization problem detailed in Equation 6 presents a significant challenge as the Reinforcement Learning (RL) agent requires exploration to acquire a satisfactory policy.
The use of a penalty in the reward function,r, for being in an unsafe state alone, is inadequate in guaranteeing safety as the RL agent must have sufficient experience in such states to recognize them. Hence, incorporating prior knowledge of the dynamics prior to training is an essential requirement when implementing model-free reinforcement learning algorithms.
Following the work done by Dalal et al. [19], a constraint functionci(xt, θt)is first linearized as follows
¯
ci(xt+1) :=ci(xt, θt)≈c¯i(xt) +g(xt;wi)Tθt (7) where g(xt;wi) is a neural network parametrized by wi. Training the neural network is carried out by solving the
following problem arg min
wi
X
(xt,θt,xt+1)∈D
¯
ci(xt+1)−(¯ci(xt) +g(xt;wi)Tθt)2
(8) The data needed for trainingg(xt;wi)is described as a set of tuples D = {(xt, θt, xt+1)} and can be obtained from simulation or experiment. The functiong(xt;wi)represents the sensitivity of changes in the controlled states to the controls using knowledge of the dynamics learned from data.
The action taken by the RL policy is denoted byπϕ(xt).
The idea of the safety layer is to solve the following problem at each state
θ∗t = arg min
θt
1
2∥θt−πϕ(xt)∥2 (9a) s.t. ¯ci(xt) +g(xt;wi)Tθt< Ci ∀i∈[K] (9b) where the constraint determined earlier has been substituted.
This safety layer aims to output a controller parameter, θt∗, that is as close as possible toπϕ(xt), the original parameter determined by the RL algorithm. The previous optimization problem has a quadratic objective and a linear constraint, for which the global closed-form solution is readily obtainable and can be found in [19].
D. Solution algorithm
The RL algorithm that is selected to solve the problem described by Equation 6 is known as soft actor-critic (SAC) [20]. The SAC algorithm integrates three key elements:
an actor-critic architecture with separate policy and value function networks, an off-policy formulation that enables the reuse of previously collected data for efficiency, and entropy maximization to encourage stability and exploration.
Haarnoja et al. [20] found SAC to be more stable and scalable than other common RL algorithms, such as deep deterministic policy gradient (DDPG) [8]. SAC modifies the objective of CMDP to include an entropy term, so the optimal policy is defined as follows
π∗ϕ= arg max
πϕ E
"∞ X
t=0
γt(r(xt, πϕ(xt)) +βH(πϕ(·|xt)))
#
(10) where β is a temperature parameter that determines the relative importance of the entropy term versus the reward, and the entropy of the policy,H(πϕ(·|xt)), is given by
H(πϕ(·|xt)) =E[−logπϕ(θt|xt)] (11) The design of the reward function, r(xt, θt), can be freely chosen depending on the control objective. Algorithm 1 provides an overview of the proposed scheme. In this algorithm,tf inal is a preset simulation time.
Remark 1: Unlike conventional adaptive controllers that require tuning for different dynamic systems, an RL algo- rithm, such as SAC, can learn an adaptation mechanism for different systems without modifying the hyperparameters.
Algorithm 1: Overall algorithm
1 Identify a suitable controller for the system of interest
2 Select critical tuning parameters,θt
3 Collect a set of data,D={(xt, θt, xt+1)}, for the safety filter and learn a constraint function, ci(xt, θt)
4 forepisode = 1, end do
5 Reset training environment with randomly generated system parameters
6 fort = 0,tf inal do
7 Observe the plant output,xt
8 RL policy selects a set of tuning parameters, πϕ(xt)(Eqn. 10)
9 Safety layer determines ifπϕ(xt)is safe and outputsθt (Eqn. 9)
10 Apply updated parameters to the controller
11 A reward is calculated and fed back to the RL agent (Eqn. 16)
12 end
13 end
14 Terminate training when the desired performance is reached
III. APPLICATION:CONTROL OF THERMOACOUSTIC INSTABILITIES
The study employs Algorithm 1 to mitigate thermoacoustic instabilities, utilizing a phase shift controller as the selected method of control. This section provides an explanation of the nonlinear model employed in the study. The phase shift controller and the associated methods are then described.
The results of the study are presented and a comparison with alternative approaches is conducted. Table I clarifies the constituent components of the CMDP problem, which is formulated as a Reinforcement Learning problem.
TABLE I: CMDP Definition
Component Definition
State (observations) Magnitude of pressure oscillations Action (control) Controller parameters (gain and time delay)
Transition function Unknown (data obtained from simulation or experiment) Reward Defined by Equation 16
Constraint Maximum allowed pressure in combustor
A. Modeling thermoacoustic instabilities
Combustion instabilities arise due to the coupling of unsteady heat release and acoustic waves produced during combustion. Detailed dynamic modeling that captures all of the interactions in such a complex system is challenging.
However, a combined numerical and analytical approach for modeling a laminar conical premixed flame that captures the essential combustion dynamics was selected to demonstrate our control scheme.
An empirical flame model that relates the instantaneous heat release rate, q(t)˙ to linear perturbations in v1(t), the flow velocity at the burner inlet, is written in Equation 12
dq
dt +q(t) = ¯qLv1
¯
v1(t−τf) (12) where q¯is the mean heat release, v¯1 is the mean flow, τf
is the flame time delay, and L is a nonlinear function that describes the saturation of heat release rate [21]. The form of the saturation function,L, is proposed by Li and Morgans [22] as
L= 1
ˆ v/¯v
Z vˆ/¯v
0
1 1 + (ξ+α)β
dξ (13)
whereαandβ are two coefficients that determine the shape of the nonlinear model, ξ is a dummy variable, and the circumflex denotes the signal amplitude. Qualitatively, the relation betweenbq˙andvˆis linear for weak velocity perturba- tions. On the other hand, for stronger velocity perturbations, Ldecreases, and the heat release rate begins to saturate. The time delay is described as a nonlinear model as follows:
τf=τf0+τfN(1− L) (14) whereτf0 is the time delay when vˆ= 0, and τfN is a time delay that describes the change ofτf as L changes.
To relate the upstream and downstream acoustic waves to q(t), the equations of continuity of mass, momentum, and energy across the flame zone are combined with the ideal gas law. Performing a series of substitutions and treating the acoustic waves as linear, we can determine the time evolution of the thermoacoustic waves by numerical integration of the resulting expressions. For more details about the model, the reader is referred to [21] and [22]. The numerical values of the model parameters are reported in Table 1 of reference [22].
Remark 2: Time delay (τ) as a controller parameter is different from model time delay (τf) in Equation 12.
B. Phase shift control
Phase shift control (Equation 15) is the most prevalent approach in active combustion control [23]. In this feedback system, a pressure transducer monitors the unsteady flow, and the signal is then phase-shifted (time-delayed), amplified, and then used to actuate a loudspeaker or fuel injectors to attenuate combustion instabilities.
u(t) =p2(t)Keiωτ (15) where p2 is the pressure downstream of the flame source, K is the gain, i is a unit imaginary number, ω is the angular frequency, and τ is the time delay. The controller parameters can be obtained analytically or empirically for a single operating condition. The challenge, however, is that large industrial-scale combustors operate over a wide range of conditions. This challenge has motivated the development of adaptive controllers [24], [25].
C. Learning system setup
The dynamic system is reset with a randomly generated system parameter (τf in Eqn. 12) at the beginning of each episode to simulate varying dynamics. In this work, τf is
randomly selected from [0.8e−4,1e−3]. The input to this neural network is the magnitude of pressure oscillations (xt=p2(t)), while the outputs are the controller parameters (θt = [K, τ]). The neural network architecture consists of two hidden layers for each actor and critic network; one layer has 400 units, and the other has 300 units. Table II presents the SAC algorithm hyperparameters.
TABLE II: Hyperparameters selected to train the RL policy with the soft actor-critic (SAC) algorithm
Parameter Value
Optimizer Adam
Sample time 0.01
Learning rate 1e-3
Discount factor 0.99
Replay buffer size. 1e6 Target smoothing coefficient 1e-3
Minibatch size 32
Target update interval 1
Gradient steps 1
The reward function described in Equation 16 is designed to minimize the amplitude of the pressure fluctuations with minimum controller gain (to minimize actuation effort).
Here, the reward is increased by ten if the magnitude of pressure oscillations at timetis less than a predefined value, δ.
r(xt, θt) =r1−r2−200xt−0.1θt (16) wherer1andr2are defined as follows, andδis set to2.5e−
4.
r1=
(10 ifxt< δ
0 ifxt> δ & r2=
(5 ifxt> δ 0 ifxt< δ (17) The input to the constraint (safety layer) neural network is the magnitude of pressure oscillations and the controller parameters at time t, while the output is the magnitude of the pressure oscillations at time t+ 1. This neural network consists of three hidden layers with 100 units each and a ReLU non-linearity. The safety bounds for the magnitude of pressure oscillations are[−0.04,0.04]. That is, the safety layer shall not allow a choice of controller parameters that lead to pressure oscillations of a magnitude outside these bounds during the learning process.
D. Results and discussion
1) Policy learning: Figure 2 shows the learning curve of the RL agent with SAC. The average reward at each episode increases as training progresses. We also observe that introducing a safety layer does not negatively impact the learning performance of the RL agent, which is consistent with the findings of Dala et al. [19]. Training is stopped when no further increase in the reward is observed to avoid overfitting.
0 50 100 150 Episode
-2000 0 2000 4000
Reward
with safety layer without safety layer
Fig. 2: Learning curve for training a controller parameter adjusting policy with reinforcement learning
2) Instability attenuation performance: Initially, we as- sess the capability of the RL-based adaptation mechanism in preserving thermoacoustic stability in a time-varying system.
The system time delay (τf) initially starts at1e−3and then abruptly changes to1e−4att= 1.5sec. Although a phase- shift controller that is optimized for the initial state of the system is able to stabilize the flame, it fails to do so during the transition to a different state, as depicted in Figure 3. In contrast, our proposed approach demonstrates its efficacy in maintaining flame stability.
Time [sec]
-0.04 -0.02 0 0.02 0.04
Pressure [-]
Adaptive control - off Adaptive control - on
0 5 10
Gain [-]
2 2.1 2.2 2.3 2.4 2.5
Controller delay [-]
10-4
K
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0 0.5 1
System delay [-]
10-3
Fig. 3: Instability attenuation performance of RL-based adap- tation mechanism. Top: the pressure magnitude (p2) when the adaptive scheme is turned on and when it is not. Middle:
controller parameters (K andτ) for both scenarios. Bottom:
system parameter change.
3) Robustness to noise: The robustness of the RL-based controller in the presence of noise has been tested by adding to the measured output white Gaussian noise. The Gaussian noise has zero mean, a variance of1e−5, and a sample time of 1e−5. Figure 4 shows that the controller can stabilize the system despite the noise in the measured amplitude of pressure oscillations.
E. Comparison to other controllers
The RL-based scheme presented in this work is compared to three other established controllers. First, a model-based ro- bust controller that is designed using the H-∞loop-shaping
Time [sec]
-0.04 -0.02 0 0.02 0.04
Pressure [-]
Control off Adaptive control - off
Adaptive control - on
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0 5 10
Gain [-]
2 2.1 2.2 2.3 2.4 2.5
Controller delay [-]
10-4
Fig. 4: The robustness of the proposed scheme to mea- surement noise. Top: the pressure magnitude(p2)when the adaptive scheme is turned on and when it is not. Bottom:
controller parameters (K andτ) for both scenarios.
Glover-McFarlane method and theν-gap metric as described in [26]. Second, a model-based self-tuning regulator (STR) as described in [27]. Finally, a model-free extremum seeking controller (ESC) as in [5].
Algorithm 1 is implemented to learn a policy that mini- mizes the pressure oscillations at varying system time delays for the four control strategies. The flame model’s time delay depends on operating conditions, such as the equivalence ratio and the preheat temperature. Figure 5 shows that an effective controller parameter adaptation policy was learned, where in the pressure oscillations are attenuated at various time delays. Other adaptive controllers are not able to achieve the same performance across this wide range of time delays.
The three reference controllers are tuned for a time delay τ= 1e−3.
1 2 3 4 5 6 7 8 9 10
System time delay [sec] 10-4
-0.03 -0.02 -0.01 0 0.01 0.02
Pressure [-]
RL ESC STR H-infinity
Fig. 5: Comparing the performance of different controllers in attenuating pressure oscillations at varying system time delay (τf).
IV. CONCLUSION
We developed a learning-based adaptation mechanism for controller parameters that considers the significance of maintaining the system’s safety during the learning process.
The application of the algorithm to the problem of controlling combustion instabilities demonstrates the efficacy of the method and the practical impact it can have on operating novel gas turbines under a wide range of conditions. This learning-based scheme performs as well or better than other adaptive controllers. Finally, we note that while we apply a soft actor-critic reinforcement learning algorithm and a constraint function based on a neural network, other RL al- gorithms and types of safety constraints can be implemented
under the same framework. Since RL algorithms’ learning efficiency and safety are areas of accelerating progress, this will allow for broader application of this framework and better performance.
REFERENCES
[1] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley & Sons, 2012.
[2] R. S. Sutton, A. G. Barto, and R. J. Williams, “Rein- forcement learning is direct adaptive optimal control,”
IEEE control systems magazine, vol. 12, no. 2, pp. 19–
22, 1992.
[3] K. J. ˚Astr¨om, U. Borisson, L. Ljung, and B. Witten- mark, “Theory and applications of self-tuning regula- tors,”Automatica, vol. 13, no. 5, pp. 457–476, 1977.
[4] K. Glover and J. C. Doyle, “State-space formulae for all stabilizing controllers that satisfy an hinf-norm bound and relations to relations to risk sensitivity,”
Systems & control letters, vol. 11, no. 3, pp. 167–172, 1988.
[5] J. Moeck, M. Bothien, C. Paschereit, G. Gelbert, and R. King, “Two-parameter extremum seeking for control of thermoacoustic instabilities and character- ization of linear growth,” in 45th AIAA Aerospace Sciences Meeting and Exhibit, 2007, p. 1416.
[6] M. Krstic and H.-H. Wang, “Stability of extremum seeking feedback for general nonlinear dynamic sys- tems,”Automatica, vol. 36, no. 4, pp. 595–601, 2000.
[7] K. B. Ariyur and M. Krstic,Real-time optimization by extremum-seeking control. John Wiley & Sons, 2003.
[8] T. P. Lillicrap, J. J. Hunt, A. Pritzel,et al., “Continu- ous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[9] V. Mnih, K. Kavukcuoglu, D. Silver,et al., “Human- level control through deep reinforcement learning,”
nature, vol. 518, no. 7540, pp. 529–533, 2015.
[10] B. Recht, “A tour of reinforcement learning: The view from continuous control,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 253–
279, 2019.
[11] M. G. Bellemare, S. Candido, P. S. Castro,et al., “Au- tonomous navigation of stratospheric balloons using reinforcement learning,” Nature, vol. 588, no. 7836, pp. 77–82, 2020.
[12] A. P. Dowling and A. S. Morgans, “Feedback control of combustion oscillations,” Annu. Rev. Fluid Mech., vol. 37, pp. 151–182, 2005.
[13] K. McManus, T. Poinsot, and S. M. Candel, “A review of active control of combustion instabilities,”Progress in energy and combustion science, vol. 19, no. 1, pp. 1–29, 1993.
[14] T. C. Lieuwen and V. Yang, Combustion instabilities in gas turbine engines: operational experience, funda- mental mechanisms, and modeling. American Institute of Aeronautics and Astronautics, 2005.
[15] T. Lieuwen, M. Chang, and A. Amato, “Stationary gas turbine combustion: Technology needs and policy con- siderations,”Combustion and Flame, vol. 8, no. 160, pp. 1311–1314, 2013.
[16] E. Bøhn, S. Gros, S. Moe, and T. A. Johansen,
“Reinforcement learning of the prediction horizon in model predictive control,” IFAC-PapersOnLine, vol. 54, no. 6, pp. 314–320, 2021.
[17] M. Sedighizadeh and A. Rezazadeh, “Adaptive pid controller based on reinforcement learning for wind turbine control,” inProceedings of world academy of science, engineering and technology, Citeseer, vol. 27, 2008, pp. 257–262.
[18] R. Sun, M. L. Greene, D. M. Le, Z. I. Bell, G. Chowd- hary, and W. E. Dixon, “Lyapunov-based real-time and iterative adjustment of deep neural networks,” IEEE Control Systems Letters, vol. 6, pp. 193–198, 2021.
[19] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C.
Paduraru, and Y. Tassa, “Safe exploration in continu- ous action spaces,” arXiv preprint arXiv:1801.08757, 2018.
[20] T. Haarnoja, A. Zhou, K. Hartikainen, et al.,
“Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[21] A. P. Dowling, “Nonlinear self-excited oscillations of a ducted flame,”Journal of fluid mechanics, vol. 346, pp. 271–290, 1997.
[22] J. Li and A. S. Morgans, “Time domain simulations of nonlinear thermoacoustic behaviour in a simple combustor using a wave-based approach,” Journal of Sound and Vibration, vol. 346, pp. 345–360, 2015.
[23] A. M. Annaswamy and A. F. Ghoniem, “Active control of combustion instability: Theory and practice,”IEEE Control Systems Magazine, vol. 22, no. 6, pp. 37–54, 2002.
[24] A. Banaszuk, K. B. Ariyur, M. Krsti´c, and C. A. Ja- cobson, “An adaptive algorithm for control of combus- tion instability,”Automatica, vol. 40, no. 11, pp. 1965–
1972, 2004.
[25] S. Evesque and A. Dowling, “Adaptive control of com- bustion oscillations,” in4th AIAA/CEAS Aeroacoustics Conference, 1998, p. 2351.
[26] J. Li and A. S. Morgans, “Feedback control of com- bustion instabilities from within limit cycle oscilla- tions using h-inf loop-shaping and the ν-gap met- ric,” Proceedings of the Royal Society A: Mathemat- ical, Physical and Engineering Sciences, vol. 472, no. 2191, p. 20 150 821, 2016.
[27] S. Evesque, A. P. Dowling, and A. M. Annaswamy,
“Self-tuning regulators for combustion oscillations,”
Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 459, no. 2035, pp. 1709–1749, 2003.