Online reinforcement learning of controller parameters adaptation law

(1)

Online reinforcement learning of controller parameters adaptation law

Item Type Conference Paper

Authors Alhazmi, Khalid;Sarathy, Mani

Citation Alhazmi, K., & Sarathy, S. M. (2023). Online reinforcement learning of controller parameters adaptation law. 2023

American Control Conference (ACC). https://doi.org/10.23919/

acc55779.2023.10156644 Eprint version Post-print

DOI 10.23919/acc55779.2023.10156644

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2023-12-05 22:19:10

Link to Item http://hdl.handle.net/10754/692789

(2)

Online reinforcement learning of controller parameters adaptation law

Khalid Alhazmi¹ and S. Mani Sarathy¹

Abstract— Real-time control of highly nonlinear systems is a challenging task in many industrial processes. Here, we propose a learning-based adaptation law for adapting the controller parameters of nonlinear systems. The method applies model- free reinforcement learning to learn an effective parameter adaptation law while maintaining a safe system operation by in- cluding a safety layer. The efficacy of the proposed algorithm is demonstrated by controlling thermoacoustic combustion instability, which is a critical issue in developing high-efficiency, low- emission gas turbine technologies. We show that the learning- based mechanism is able to attenuate combustion instabilities in a time-variant system with the presence of process noise. The proposed algorithm outperforms the adaptation performance of other model-free and model-based methods, such as extremum seeking controllers and self-tuning regulators, respectively.

Keywords: Nonlinear systems, adaptive control, reinforcement learning, combustion instabilities

I. INTRODUCTION

Many industrial processes and energy systems are operated at sub-optimal conditions. The highly nonlinear nature of certain phenomena (e.g., chemical reactions coupled to fluid mechanics) makes process control and optimization challenging. Modeling the dynamics in these systems is often inaccurate and incomplete, resulting in process disturbances.

In addition, it is common for the system dynamics and control objectives to change with time. Thus, there is a need for a class of controllers that can learn online to control unknown systems using data measured in real-time [1].

One approach is to estimate the system parameters through system identification and then derive a control law, an approach known as indirect adaptive control. However, such an approach requires recomputing controls from estimated system models at each step, which is inherently complex.

An alternative approach, known as direct adaptive control, in which controller parameters are estimated directly, is more attractive [2].

Of particular interest to us are nonlinear problems with two characteristics: (1) time-varying reference trajectory or set point, and (2) lack of availability of sufficiently accurate reduced-order models that are required for the design of well-known direct adaptive controllers, such as H-infinity and self-tuning regulators [3], [4]. An instance of a problem with these features is the case of finding the optimum fuel- air ratio in gas turbine combustion systems susceptible to thermoacoustic instabilities; proper control is critical for ensuring stable, efficient, and low emission performance.

1King Abdullah University of Science and Technology (KAUST), Clean Combustion Research Center (CCRC), Physical Science and Engineering Division (PSE), Thuwal 23955-6900, Saudi Arabia {khalid.alhazmi, mani.sarathy}@kaust.edu.sa

Such a task is challenging due to time-varying variations in fuel (or air) temperature and quality, for instance. While combustor models exist, no reduced-order model is sufficiently general as they apply only to a narrow window of operating conditions [5].

A successful adaptive control method for the above men- tioned problems is extremum seeking control (ESC) [6], [7].

ESC aims to find and maintain the optimum operating point by perturbing the plant and using the response to estimate the gradient of an objective function. ESC can be used as a controller or to tune the parameters of a working controller.

Despite its success, ESC has several limitations. First, it requires continuous perturbation of the system. Second, careful selection of several ESC tuning parameters is required to achieve satisfactory performance. Such parameters include, but are not limited to, the frequency and amplitude of high and low pass filters, the frequency and amplitude of the perturbation signal, an integrator gain, and a good initial guess. While ESC is theoretically model-free, a model is needed to select proper tuning parameters in practice.

An alternative method to ESC is reinforcement learning (RL), which can be seen as direct adaptive control, as Sutton et al. point out [2]. RL has been extensively used in continuous control tasks, but most applications involve using RL as a controller [8]–[11]. In this work, we propose using RL as an adaptation mechanism for general working controllers. A safety layer that filters out unsafe actions by the RL policy is added to ensure that system constraints are never violated during learning. We consider the problem of a time-varying dynamic system controlled by an adjustable controller. Under a time-varying dynamic system, the controller parameters must be dynamically adjusted to achieve the desired performance.

To demonstrate the utility and effectiveness of the proposed method, we apply it to the control of thermoacoustic combustion instabilities [12], [13]. An aspect of great significance in gas turbine combustors is the dynamic flame stability [14]. The susceptibility of flames to become unstable due to the coupling of the unsteady heat release and acoustic waves inside the combustor has been one of the princi- pal challenges in developing modern high-efficiency low- emission gas turbine combustors in recent decades. Combus- tion instabilities are generally considered one of the highest risk items in new engine development [15]. Introducing fuel variability with novel zero-carbon fuels (e.g., ammonia, hyrdrogen) amplifies the uncertainty in combustor–acoustic interactions; therefore, flame dynamic stability is a vital issue in developing carbon-free gas turbine technologies.

Reinforcement learning has been used before in tuning

(3)

the parameters of specific working controllers, such as PID and model predictive controllers (MPC) [16], [17]. We show here that this does not have to be the case. We propose a framework that applies to a broad class of adjustable controllers while also considering the safety implications of online system exploration by RL. Approaches that rely on recorded input and output data also exist [18]. Further, other publications either do not consider the safety of the RL actions or include a penalty in the reward function, which means that applying RL for parameter tuning must be done on a case-by-case basis.

On the other hand, our algorithm can be applied regardless of the controller with only weak assumptions. Finally, we apply our framework to the problem of attenuating thermoacoustic flame instabilities. To our knowledge, RL has not been applied to the active control of flame instabilities.

We show that our proposed framework performs as well or better than model-free and model-based methods, such as extremum seeking controllers, self-tuning regulators, and H- infinity robust controllers.

II. ASAFE REINFORCEMENT LEARNING SCHEME FOR PARAMETER ADAPTATION

A. Problem statement

We consider a general nonlinear model

xt+1=f(xt, ut, dt) (1) where x ∈ Rⁿ is the state, u ∈ R^m is the input, d ∈ R^d is an unknown disturbance, and f : Rⁿ ×R^m → Rⁿ and h:Rⁿ→Rare smooth. Given a smooth control law

ut=α(xt;θt) (2) parametrized by a parameter θ∈Rⁿ^θ such that the closed- loop system

xt+1=f(xt, α(xt;θt)) (3) has equilibria corresponding to each θt. The problem is to selectθtthat optimizes the control objective. We assume that a set of controller parameters exists under which the closed- loop system is stable, and its performance is optimal or near- optimal for a performance measure. The main objective of this work is to learn a controller parameters adaptation law (policy),π, such that

θ_t+1=π(x_t) (4)

B. The RL problem

The proposed controller parameter adaptation scheme consists of two parts: the reinforcement learning policy that maps the state of the plant to the controller parameters and a safety layer that ensures that the controller parameters updated by the RL policy do not lead to instabilities. The scheme is summarized in Figure 1.

Controller parameters

Plant Controller

RL policy Safety layer

Output Control

signal Setpoint

Fig. 1: A safe reinforcement learning scheme for adjusting controller parameters

We start by formulating the reinforcement learning problem as a constrained Markov decision process (CMDP).

While a Markov decision process (MDP) introduces a single utility (reward or cost function) consisting of different objectives to be optimized, a CMDP considers a situation where one type of cost is to be optimized while keeping the other types of costs below some given bounds. A tuple defines a CMDP:

M=⟨x, u, f, r, C, γ⟩ (5) wherex,uandf are as defined previously,r:x×u→R is a reward function, andC ={ci :x×u→R| i∈[K]}

is an immediate-constraint function, where[K]is defined as {1, ..., K}, representing each constraint, andγ ∈(0,1) is a discount factor. We also defineC¯ ={¯c_i:x→R|i∈[K]}

as the immediate-constraint values per state.

The objective is to find a parameterized policy πϕ that maps system output to controller parameter, θ by solving the following policy optimization problem

maxπ_ϕ E

"_∞ X

t=0

γ^tr(xt, θt)

#

(6a) s.t. ¯ci(xt)≤Ci ∀i∈[K] (6b) where, at each state, all constrained states ¯ci(·) are upper bounded by a constantCi, andπϕis a policy represented by a neural network and parameterized byϕ, which represents the weights and biases of the neural network.

C. Construction of the constraint function

The optimization problem detailed in Equation 6 presents a significant challenge as the Reinforcement Learning (RL) agent requires exploration to acquire a satisfactory policy.

The use of a penalty in the reward function,r, for being in an unsafe state alone, is inadequate in guaranteeing safety as the RL agent must have sufficient experience in such states to recognize them. Hence, incorporating prior knowledge of the dynamics prior to training is an essential requirement when implementing model-free reinforcement learning algorithms.

Following the work done by Dalal et al. [19], a constraint functionc_i(x_t, θ_t)is first linearized as follows

¯

ci(xt+1) :=ci(xt, θt)≈c¯i(xt) +g(xt;wi)^Tθt (7) where g(xt;wi) is a neural network parametrized by wi. Training the neural network is carried out by solving the

(4)

following problem arg min

w_i

X

(x_t,θ_t,x_t+1)∈D

¯

c_i(x_t+1)−(¯c_i(x_t) +g(x_t;w_i)^Tθ_t)²

(8) The data needed for trainingg(x_t;w_i)is described as a set of tuples D = {(x_t, θ_t, x_t+1)} and can be obtained from simulation or experiment. The functiong(x_t;w_i)represents the sensitivity of changes in the controlled states to the controls using knowledge of the dynamics learned from data.

The action taken by the RL policy is denoted byπϕ(xt).

The idea of the safety layer is to solve the following problem at each state

θ^∗_t = arg min

θt

1

2∥θt−πϕ(xt)∥² (9a) s.t. ¯c_i(x_t) +g(x_t;w_i)^Tθ_t< C_i ∀i∈[K] (9b) where the constraint determined earlier has been substituted.

This safety layer aims to output a controller parameter, θ_t^∗, that is as close as possible toπ_ϕ(x_t), the original parameter determined by the RL algorithm. The previous optimization problem has a quadratic objective and a linear constraint, for which the global closed-form solution is readily obtainable and can be found in [19].

D. Solution algorithm

The RL algorithm that is selected to solve the problem described by Equation 6 is known as soft actor-critic (SAC) [20]. The SAC algorithm integrates three key elements:

an actor-critic architecture with separate policy and value function networks, an off-policy formulation that enables the reuse of previously collected data for efficiency, and entropy maximization to encourage stability and exploration.

Haarnoja et al. [20] found SAC to be more stable and scalable than other common RL algorithms, such as deep deterministic policy gradient (DDPG) [8]. SAC modifies the objective of CMDP to include an entropy term, so the optimal policy is defined as follows

π^∗_ϕ= arg max

π_ϕ E

"_∞ X

t=0

γ^t(r(xt, πϕ(xt)) +βH(πϕ(·|xt)))

#

(10) where β is a temperature parameter that determines the relative importance of the entropy term versus the reward, and the entropy of the policy,H(πϕ(·|xt)), is given by

H(πϕ(·|xt)) =E[−logπϕ(θt|xt)] (11) The design of the reward function, r(x_t, θ_t), can be freely chosen depending on the control objective. Algorithm 1 provides an overview of the proposed scheme. In this algorithm,tf inal is a preset simulation time.

Remark 1: Unlike conventional adaptive controllers that require tuning for different dynamic systems, an RL algorithm, such as SAC, can learn an adaptation mechanism for different systems without modifying the hyperparameters.

Algorithm 1: Overall algorithm

1 Identify a suitable controller for the system of interest

2 Select critical tuning parameters,θt

3 Collect a set of data,D={(xt, θt, xt+1)}, for the safety filter and learn a constraint function, c_i(x_t, θ_t)

4 forepisode = 1, end do

5 Reset training environment with randomly generated system parameters

6 fort = 0,tf inal do

7 Observe the plant output,xt

8 RL policy selects a set of tuning parameters, πϕ(xt)(Eqn. 10)

9 Safety layer determines ifπϕ(xt)is safe and outputsθt (Eqn. 9)

10 Apply updated parameters to the controller

11 A reward is calculated and fed back to the RL agent (Eqn. 16)

12 end

13 end

14 Terminate training when the desired performance is reached

III. APPLICATION:CONTROL OF THERMOACOUSTIC INSTABILITIES

The study employs Algorithm 1 to mitigate thermoacoustic instabilities, utilizing a phase shift controller as the selected method of control. This section provides an explanation of the nonlinear model employed in the study. The phase shift controller and the associated methods are then described.

The results of the study are presented and a comparison with alternative approaches is conducted. Table I clarifies the constituent components of the CMDP problem, which is formulated as a Reinforcement Learning problem.

TABLE I: CMDP Definition

Component Definition

State (observations) Magnitude of pressure oscillations Action (control) Controller parameters (gain and time delay)

Transition function Unknown (data obtained from simulation or experiment) Reward Defined by Equation 16

Constraint Maximum allowed pressure in combustor

A. Modeling thermoacoustic instabilities

Combustion instabilities arise due to the coupling of unsteady heat release and acoustic waves produced during combustion. Detailed dynamic modeling that captures all of the interactions in such a complex system is challenging.

However, a combined numerical and analytical approach for modeling a laminar conical premixed flame that captures the essential combustion dynamics was selected to demonstrate our control scheme.

An empirical flame model that relates the instantaneous heat release rate, q(t)˙ to linear perturbations in v1(t), the flow velocity at the burner inlet, is written in Equation 12

(5)

dq

dt +q(t) = ¯qLv1

¯

v₁(t−τ_f) (12) where q¯is the mean heat release, v¯1 is the mean flow, τf

is the flame time delay, and L is a nonlinear function that describes the saturation of heat release rate [21]. The form of the saturation function,L, is proposed by Li and Morgans [22] as

L= 1

ˆ v/¯v

Z ^v^ˆ/¯v

0

1 1 + (ξ+α)^β

dξ (13)

whereαandβ are two coefficients that determine the shape of the nonlinear model, ξ is a dummy variable, and the circumflex denotes the signal amplitude. Qualitatively, the relation betweenbq˙andvˆis linear for weak velocity perturbations. On the other hand, for stronger velocity perturbations, Ldecreases, and the heat release rate begins to saturate. The time delay is described as a nonlinear model as follows:

τ_f=τ_f⁰+τ_f^N(1− L) (14) whereτ_f⁰ is the time delay when vˆ= 0, and τ_f^N is a time delay that describes the change ofτf as L changes.

To relate the upstream and downstream acoustic waves to q(t), the equations of continuity of mass, momentum, and energy across the flame zone are combined with the ideal gas law. Performing a series of substitutions and treating the acoustic waves as linear, we can determine the time evolution of the thermoacoustic waves by numerical integration of the resulting expressions. For more details about the model, the reader is referred to [21] and [22]. The numerical values of the model parameters are reported in Table 1 of reference [22].

Remark 2: Time delay (τ) as a controller parameter is different from model time delay (τf) in Equation 12.

B. Phase shift control

Phase shift control (Equation 15) is the most prevalent approach in active combustion control [23]. In this feedback system, a pressure transducer monitors the unsteady flow, and the signal is then phase-shifted (time-delayed), amplified, and then used to actuate a loudspeaker or fuel injectors to attenuate combustion instabilities.

u(t) =p2(t)Ke^iωτ (15) where p2 is the pressure downstream of the flame source, K is the gain, i is a unit imaginary number, ω is the angular frequency, and τ is the time delay. The controller parameters can be obtained analytically or empirically for a single operating condition. The challenge, however, is that large industrial-scale combustors operate over a wide range of conditions. This challenge has motivated the development of adaptive controllers [24], [25].

C. Learning system setup

The dynamic system is reset with a randomly generated system parameter (τf in Eqn. 12) at the beginning of each episode to simulate varying dynamics. In this work, τf is

randomly selected from [0.8e−4,1e−3]. The input to this neural network is the magnitude of pressure oscillations (xt=p2(t)), while the outputs are the controller parameters (θ_t = [K, τ]). The neural network architecture consists of two hidden layers for each actor and critic network; one layer has 400 units, and the other has 300 units. Table II presents the SAC algorithm hyperparameters.

TABLE II: Hyperparameters selected to train the RL policy with the soft actor-critic (SAC) algorithm

Parameter Value

Optimizer Adam

Sample time 0.01

Learning rate 1e-3

Discount factor 0.99

Replay buffer size. 1e6 Target smoothing coefficient 1e-3

Minibatch size 32

Target update interval 1

Gradient steps 1

The reward function described in Equation 16 is designed to minimize the amplitude of the pressure fluctuations with minimum controller gain (to minimize actuation effort).

Here, the reward is increased by ten if the magnitude of pressure oscillations at timetis less than a predefined value, δ.

r(xt, θt) =r1−r2−200xt−0.1θt (16) wherer1andr2are defined as follows, andδis set to2.5e−

4.

r1=

(10 ifxt< δ

0 ifx_t> δ & r2=

(5 ifxt> δ 0 ifx_t< δ (17) The input to the constraint (safety layer) neural network is the magnitude of pressure oscillations and the controller parameters at time t, while the output is the magnitude of the pressure oscillations at time t+ 1. This neural network consists of three hidden layers with 100 units each and a ReLU non-linearity. The safety bounds for the magnitude of pressure oscillations are[−0.04,0.04]. That is, the safety layer shall not allow a choice of controller parameters that lead to pressure oscillations of a magnitude outside these bounds during the learning process.

D. Results and discussion

1) Policy learning: Figure 2 shows the learning curve of the RL agent with SAC. The average reward at each episode increases as training progresses. We also observe that introducing a safety layer does not negatively impact the learning performance of the RL agent, which is consistent with the findings of Dala et al. [19]. Training is stopped when no further increase in the reward is observed to avoid overfitting.

(6)

0 50 100 150 Episode

-2000 0 2000 4000

Reward

with safety layer without safety layer

Fig. 2: Learning curve for training a controller parameter adjusting policy with reinforcement learning

2) Instability attenuation performance: Initially, we as- sess the capability of the RL-based adaptation mechanism in preserving thermoacoustic stability in a time-varying system.

The system time delay (τf) initially starts at1e−3and then abruptly changes to1e−4att= 1.5sec. Although a phase- shift controller that is optimized for the initial state of the system is able to stabilize the flame, it fails to do so during the transition to a different state, as depicted in Figure 3. In contrast, our proposed approach demonstrates its efficacy in maintaining flame stability.

Time [sec]

-0.04 -0.02 0 0.02 0.04

Pressure [-]

Adaptive control - off Adaptive control - on

0 5 10

Gain [-]

2 2.1 2.2 2.3 2.4 2.5

Controller delay [-]

10^-4

K

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0 0.5 1

System delay [-]

10^-3

Fig. 3: Instability attenuation performance of RL-based adaptation mechanism. Top: the pressure magnitude (p2) when the adaptive scheme is turned on and when it is not. Middle:

controller parameters (K andτ) for both scenarios. Bottom:

system parameter change.

3) Robustness to noise: The robustness of the RL-based controller in the presence of noise has been tested by adding to the measured output white Gaussian noise. The Gaussian noise has zero mean, a variance of1e−5, and a sample time of 1e−5. Figure 4 shows that the controller can stabilize the system despite the noise in the measured amplitude of pressure oscillations.

E. Comparison to other controllers

The RL-based scheme presented in this work is compared to three other established controllers. First, a model-based robust controller that is designed using the H-∞loop-shaping

Time [sec]

-0.04 -0.02 0 0.02 0.04

Pressure [-]

Control off Adaptive control - off

Adaptive control - on

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0 5 10

Gain [-]

2 2.1 2.2 2.3 2.4 2.5

Controller delay [-]

10^-4

Fig. 4: The robustness of the proposed scheme to mea- surement noise. Top: the pressure magnitude(p2)when the adaptive scheme is turned on and when it is not. Bottom:

controller parameters (K andτ) for both scenarios.

Glover-McFarlane method and theν-gap metric as described in [26]. Second, a model-based self-tuning regulator (STR) as described in [27]. Finally, a model-free extremum seeking controller (ESC) as in [5].

Algorithm 1 is implemented to learn a policy that mini- mizes the pressure oscillations at varying system time delays for the four control strategies. The flame model’s time delay depends on operating conditions, such as the equivalence ratio and the preheat temperature. Figure 5 shows that an effective controller parameter adaptation policy was learned, where in the pressure oscillations are attenuated at various time delays. Other adaptive controllers are not able to achieve the same performance across this wide range of time delays.

The three reference controllers are tuned for a time delay τ= 1e−3.

1 2 3 4 5 6 7 8 9 10

System time delay [sec] 10^-4

-0.03 -0.02 -0.01 0 0.01 0.02

Pressure [-]

RL ESC STR H-infinity

Fig. 5: Comparing the performance of different controllers in attenuating pressure oscillations at varying system time delay (τ_f).

IV. CONCLUSION

We developed a learning-based adaptation mechanism for controller parameters that considers the significance of maintaining the system’s safety during the learning process.

The application of the algorithm to the problem of controlling combustion instabilities demonstrates the efficacy of the method and the practical impact it can have on operating novel gas turbines under a wide range of conditions. This learning-based scheme performs as well or better than other adaptive controllers. Finally, we note that while we apply a soft actor-critic reinforcement learning algorithm and a constraint function based on a neural network, other RL algorithms and types of safety constraints can be implemented

(7)

under the same framework. Since RL algorithms’ learning efficiency and safety are areas of accelerating progress, this will allow for broader application of this framework and better performance.

REFERENCES

[1] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal control. John Wiley & Sons, 2012.

[2] R. S. Sutton, A. G. Barto, and R. J. Williams, “Rein- forcement learning is direct adaptive optimal control,”

IEEE control systems magazine, vol. 12, no. 2, pp. 19–

22, 1992.

[3] K. J. ˚Astr¨om, U. Borisson, L. Ljung, and B. Witten- mark, “Theory and applications of self-tuning regulators,”Automatica, vol. 13, no. 5, pp. 457–476, 1977.

[4] K. Glover and J. C. Doyle, “State-space formulae for all stabilizing controllers that satisfy an hinf-norm bound and relations to relations to risk sensitivity,”

Systems & control letters, vol. 11, no. 3, pp. 167–172, 1988.

[5] J. Moeck, M. Bothien, C. Paschereit, G. Gelbert, and R. King, “Two-parameter extremum seeking for control of thermoacoustic instabilities and character- ization of linear growth,” in 45th AIAA Aerospace Sciences Meeting and Exhibit, 2007, p. 1416.

[6] M. Krstic and H.-H. Wang, “Stability of extremum seeking feedback for general nonlinear dynamic systems,”Automatica, vol. 36, no. 4, pp. 595–601, 2000.

[7] K. B. Ariyur and M. Krstic,Real-time optimization by extremum-seeking control. John Wiley & Sons, 2003.

[8] T. P. Lillicrap, J. J. Hunt, A. Pritzel,et al., “Continu- ous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.

[9] V. Mnih, K. Kavukcuoglu, D. Silver,et al., “Human- level control through deep reinforcement learning,”

nature, vol. 518, no. 7540, pp. 529–533, 2015.

[10] B. Recht, “A tour of reinforcement learning: The view from continuous control,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, pp. 253–

279, 2019.

[11] M. G. Bellemare, S. Candido, P. S. Castro,et al., “Au- tonomous navigation of stratospheric balloons using reinforcement learning,” Nature, vol. 588, no. 7836, pp. 77–82, 2020.

[12] A. P. Dowling and A. S. Morgans, “Feedback control of combustion oscillations,” Annu. Rev. Fluid Mech., vol. 37, pp. 151–182, 2005.

[13] K. McManus, T. Poinsot, and S. M. Candel, “A review of active control of combustion instabilities,”Progress in energy and combustion science, vol. 19, no. 1, pp. 1–29, 1993.

[14] T. C. Lieuwen and V. Yang, Combustion instabilities in gas turbine engines: operational experience, funda- mental mechanisms, and modeling. American Institute of Aeronautics and Astronautics, 2005.

[15] T. Lieuwen, M. Chang, and A. Amato, “Stationary gas turbine combustion: Technology needs and policy con- siderations,”Combustion and Flame, vol. 8, no. 160, pp. 1311–1314, 2013.

[16] E. Bøhn, S. Gros, S. Moe, and T. A. Johansen,

“Reinforcement learning of the prediction horizon in model predictive control,” IFAC-PapersOnLine, vol. 54, no. 6, pp. 314–320, 2021.

[17] M. Sedighizadeh and A. Rezazadeh, “Adaptive pid controller based on reinforcement learning for wind turbine control,” inProceedings of world academy of science, engineering and technology, Citeseer, vol. 27, 2008, pp. 257–262.

[18] R. Sun, M. L. Greene, D. M. Le, Z. I. Bell, G. Chowd- hary, and W. E. Dixon, “Lyapunov-based real-time and iterative adjustment of deep neural networks,” IEEE Control Systems Letters, vol. 6, pp. 193–198, 2021.

[19] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C.

Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.

[20] T. Haarnoja, A. Zhou, K. Hartikainen, et al.,

“Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.

[21] A. P. Dowling, “Nonlinear self-excited oscillations of a ducted flame,”Journal of fluid mechanics, vol. 346, pp. 271–290, 1997.

[22] J. Li and A. S. Morgans, “Time domain simulations of nonlinear thermoacoustic behaviour in a simple combustor using a wave-based approach,” Journal of Sound and Vibration, vol. 346, pp. 345–360, 2015.

[23] A. M. Annaswamy and A. F. Ghoniem, “Active control of combustion instability: Theory and practice,”IEEE Control Systems Magazine, vol. 22, no. 6, pp. 37–54, 2002.

[24] A. Banaszuk, K. B. Ariyur, M. Krsti´c, and C. A. Ja- cobson, “An adaptive algorithm for control of combustion instability,”Automatica, vol. 40, no. 11, pp. 1965–

1972, 2004.

[25] S. Evesque and A. Dowling, “Adaptive control of combustion oscillations,” in4th AIAA/CEAS Aeroacoustics Conference, 1998, p. 2351.

[26] J. Li and A. S. Morgans, “Feedback control of combustion instabilities from within limit cycle oscillations using h-inf loop-shaping and the ν-gap metric,” Proceedings of the Royal Society A: Mathemat- ical, Physical and Engineering Sciences, vol. 472, no. 2191, p. 20 150 821, 2016.

[27] S. Evesque, A. P. Dowling, and A. M. Annaswamy,

“Self-tuning regulators for combustion oscillations,”

Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 459, no. 2035, pp. 1709–1749, 2003.