Autorotation of an Unmanned Helicopter by a Reinforcement Learning Algorithm
D.Jin Lee*, Hyochoong Bang† and Kwangyul Baek‡
Korea Advanced Institute of Science and Technology, Daejeon, 305-701, Republic of Korea
Autorotation maneuver requires time-critical control inputs and includes highly non- linear dynamics. Reinforcement learning is a feasible approach that can be applied to this problem. The Q-learning is selected as a reinforcement learning algorithm and a radial basis function (RBF) network is applied as a function approximation technique to handle a large state space in the autorotation problem. Weights of RBF are updated by a back-propagation technique using a direct gradient method. The proposed reinforcement learning algorithm is evaluated by simulations based on a point-mass model of a modified OH-58A helicopter.
Nomenclature
T = thrust
D = drag
W = weight of the helicopter V = velocity of the helicopter u = horizontal velocity
w = sink rate
θ = angle which V makes with the horizon α = angle between thrust vector and vertical axis TPP = rotor tip path plane
h = altitude
Ω = angular rate of the rotor blades
I. Introduction
ingle-engine manned helicopters can face undesired engine failure during their mission operation. In this case, pilots typically perform autorotation maneuver which places great demand on pilot skill in descent gliding and safe landing. Similarly, unmanned helicopters can encounter engine failure situation in their autonomous operation.
Because of cost factor, the unmanned helicopters under development are single-engined and use model scale engines.
The model scale engines are liable to fail more frequently than real ones equipped in manned helicopter. For this reason, it is very significant to perform autorotation maneuver autonomously. It allows the user to save not only the unmanned helicopter platform but valuable sensors and data onboard.
However, control of a helicopter in autorotation is a particularly challenging problem that requires time-critical control inputs and highly non-linear dynamics. Missing appropriate reaction timing or using available energy improperly, the safe landing cannot be accomplished with guarantee. If in hover very near the ground, the collective input cannot be lowered and the angular rate of rotor will decay immediately. Under these conditions, there will not be a steady-state descent phase, and the control input must rely on judicious timing to extract last available rotor energy to arrest the sink rate as ground impact approaches.1 Figure 1 shows a picture of autorotation training of a manned helicopter.
A reinforcement learning technique can be applied to control problems of complicated systems such as autorotation maneuver. It is known that the reinforcement learning method is based on the way of animal’s thinking, learning and acting. 9 Contrary to supervised learning, the controller can adapt itself by the suggested reward function in a natural way. Even though we do not know how to control autorotation, the controller is able to learn
* Graduate Student, Department of Aerospace Engineering, [email protected]
† Associate Professor, Department of Aerospace Engineering, [email protected], Senior Member AIAA.
‡
S
how to perform autorotation maneuver appropriately after trials. Reinforcement learning was given intensive attention in 1983 by a paper of A.G.Barto, R.S.Sutton and C.W.Anderson7 and Q-learning was proposed by Watkins10 in 1989. Q-learning was later extended to Q( )λ and uses the action value function to improve performance.8 In the autorotation problem, the state and action space are continuous and large. It is impossible to store the knowledge of the system in every individual state-action pair. In general this kind of problem is characterized by the curse of dimensionality. A way to overcome such problem is the use of function approximation technique. By using function approximation technique in state space, it is feasible to with state generalization and state evaluation.8 Also a large number of actions have to be evaluated in order to come up with a good action policy.
Numerical integration can be suggested to handle this problem.
In Section II, the dynamic equations of a helicopter after engine failure are derived using a simplified point mass model. In Section III, the reinforcement learning technique and function approximation method are introduced. The function approximation techniques are interpreted to handle the continuous state space. In Section IV, simulations are constructed to evaluate the suggested reinforcement learning algorithm, Q-Learning using RBF network.
II. Autorotation Maneuver of a Helicopter
Typically, autorotation maneuver of a helicopter under engine failure can be divided into three phases.1 First, the entry phase consists of the arresting angular motion of the helicopter and main-rotor rpm decay. Second, during the steady-state descent phase, air flows upward through the rotor disk. Potential energy of the helicopter trades with kinetic energy in order to attain desired steady-state descent airspeed below the maximum sink rate.
In the final phase of autorotation maneuver, the pilot has to reduce airspeed and sink rate just before touchdown.
Both of these actions can be achieved by moving the pilot’s cyclic control stick to the rearward. The rearward oriented rotor disk generates a larger volume of air to flow through it resulting in an increase of the total lifting force.
Finally, the collective is raised to convert the stored rotor energy into lift for a soft landing.
A. Coordinate System
In this study, we just consider the state from engine failure to touching the ground with acceptably small forward and vertical velocity. Before engine stop, the helicopter is assumed to be in hovering or level flight with rotor angular rate Ω0, initial forward speed u0, initial sink rate w0, and initial height h0.
The state x can be defined from the point of engine failure in the horizontal direction and h can be defined from the ground in the vertical direction respectively. Figure 2 shows the coordinate system used. The point where the engine failure occurs is, therefore, x=0 and h=h0.
The control inputs are chosen with rotor thrust coefficient CT, and angle of the thrust vector α makes with vertical axis.
Tx
C and
Tz
C are expressed with CT and α as the vertical and horizontal components of CT. cos
sin
z x
T T
T T
C C
C C
αα
== (1)
Collective pitch control required to obtain the thrust may then be obtained from blade element theory as in Ref. 4.
Figure 1. Bell-206B autorotation training
2 2
75
2 4
3 6 3 1
1 1
2 2 2
1 9 4 CT
μ a λ μ
θ σ
μ μ
⎛ ⎞
⎛ + ⎞ + ⎛ − ⎞
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎝ ⎠⎝ ⎠ ⎝ ⎠
= ⎛⎜⎝ − + ⎞⎟⎠
(2)
where θ75 represents the rotor collective pitch angle at 75 percent span, while σ and a denote the rotor solidity ratio and rotor blade two dimensional lift curve slope, respectively. Furthermore μ and λ are the advance and inflow ratio defined in the rotor tip path plane. The advance ratio μ and inflow ratio λ are defined as follows;
cos sin sin cos
u w , u w
R R
α α α α ν
μ= + λ= − +
Ω Ω (3)
here, Ω is the rotor angular velocity, and ν is induced velocity of the rotor disk. Note that λ is defined positive in the positive direction of ν, on the other hand μ is defined positive in the negative direction of x.
B. Dynamic Equations
The vertical and horizontal force balance equations are given in Fig. 3.
cos sin
sin cos
mw mg T D
mu T D
α θ
α θ
= − +
= −
(4)
The parasite drag D is defined by an equivalent flat plate area fe as
( )
2 2 2
1 1
2 e 2 e
D= ρV f = ρ u +w f (5)
The angle θ in Eq. (4) can be substituted with w and u by the following relationships;
2 2 2 2
sin w , cos u
u w u w
θ= θ=
+ + (6)
Therefore, the components of the parasite drag are derived as x
Vertical
Horizon α
θ h
V
W
D T
TPP
Figure 2. Coordinate system used
mV
ν mg V
D
Figure 3. Force balance diagram.
2 2 2 2
1 1
sin , cos
2 e 2 e
D θ= ρf w u +w D θ = ρf u u +w (7)
C. Energy Model
The helicopter has ability to store energy in their main rotor. The torque is delivered from the main rotor shaft.
The torque balance equation can be expressed simply as
2 2
( )( )
R Q
I Ω = − ⎣ ⎡ρ πR πR R C⎤⎦ (8)
where IR represents moment of inertia of the rotor system and CQ is torque coefficient. The torque coefficient CQ can be replaced with power coefficient CP. Energy balance equation of the rotor system is
( )
R S R
S i pro para
I P P
P P P P
ΩΩ = −
= − + +
(9)
where PS is the power supplied from the engine and PR is the power required on the main rotor shaft to generate lift and propulsive thrust, also to overcome blade profile drag. The propulsive power can be derived from momentum theory. Profile drag of the rotor blades can be obtained from blade element theory.
D. Momentum Theory
We assume that the rotor affects only the air flow through the rotor disc in the momentum theory approximation.
The induced velocity ν is increased as the air passes through the rotor disc. The rotor generates the thrust as follows
2 2
ρπR
= − −
T V ν ν (10)
The total velocity imparted to the air flow through the rotor disc is 2ν. Momentum power PM, which is the sum of the induced and propulsive power is simply expressed by the scalar product of the thrust and the resultant velocity of the flow through the rotor disk.
( )
PM = ⋅T V−ν (11)
The first term in Eq. (11) represents propulsive power PP = ⋅T V. This is the power required to accelerate or to climb against parasite drag. This term is negative in autorotation maneuver. The rotor becomes like a windmill absorbing energy from the air.
The second term indicates induced power Pi = − ⋅T ν required to produce thrust. It is always positive since the induced velocity vector is always oriented in opposite direction to the thrust vector. Actually there are inefficiency factors of the induced power such as tip loss and non-uniform inflow distribution. One can neglect tip loss factors since it is approximately 0.97.
A non-uniform inflow distribution increases the induced power by a factor of Kind and the actual induced power becomes
i ind
P= −K Τ⋅ν (12)
where Kind is the ratio of non-uniform inflow to uniform inflow induced power. For a triangular downwash distribution, Kind is given by (0.4( 2 )3), or approximately 1.13(Ref. 1).
E. Modeling of the Induced Velocity
The induced velocity ν is approximated by Ref. 2 as follows;
ind h I G
K f f
τν ν+ = ν (13)
The ground effect factor fG is taken to be unity in this study. τ is a time constant, approximated by Ref. 5.
0.21
n
τ = λΩ (14)
where Ωn is the nominal rotor angular speed to the order of 40. In addition,λ is the inflow ratio to the order of 0.04.
As a result, the value of τ in Eq. (14) is below the order of 0.15, thereby it can be neglected.
Furthermore we introduce νh as the reference induced velocity in hovering as defined by
2 2 2
2 ( / 2)
2 h
h
h h T
T R C
ν R
= ρπ = Ω (15)
The induced velocity parameter fI is defined as the ratio of the actual induced velocity to the reference induced velocity νh.
( )
( )
2 2 2 2
2 1 1 2
2 2
1 1 2
1.0 / ( ) (2 3) 1.0
0.373 0.598 1.991
I I
x x f if x x
f
x x x otherwise
⎧ + + + + ≥
= ⎨⎪
⎪ + −
⎩
(16)
where the parameters x1 and x2 are derived as
1
sin cos sin cos
h T/ 2
u w u w
x
R C
α α α α
ν
− −
= =
Ω (17)
2
cos sin cos sin
h T/ 2
u w u w
x
R C
α α α α
ν
+ +
= =
Ω (18)
The first expression for fI is the result from the momentum theory. The second expression is an empirical approximation for the vortex-ring state. The region of roughness in the vortex-ring state is defined approximately by
2 2
1 2
(2x +3) +x ≥1.0 (Ref. 3).
III. Reinforcement Learning
There are many difficult problems like nonlinear flight control systems that we are not able to handle easily. One can use a computer which learns how to solve these difficult problems through trial and error seeking practical solutions. Traditionally, there have been two approaches for creating useful machine intelligence. The first, the supervised learning, generates an output based on the input and desired output provided. The second, the unsupervised learning, have no feedback with respect to the output generated by learning system. However, there are many situations where we don't know the correct answers that supervised learning requires. For that reason, there has been much interest in a different approach, the reinforcement learning. In this approach, the computer is simply given a goal to achieve. The computer, then learn how to achieve that goal by trial and error.
There are three fundamental classes of methods for solving the reinforcement learning problem, dynamic programming, Monte Carlo methods, and temporal-difference (TD) learning. All of these methods solve the full version of the problem including delayed rewards. Each of the three classes of methods has its strengths and weaknesses. And the methods also differ in several ways with respect to their efficiency and speed of convergence.9 In this paper, we adopted Q-learning algorithm for dealing with the autorotation problem.
A. Q-learning
TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome.9
Q-learning algorithm proposed by Watkins10 in 1989 is a kind of TD learning. It was extended from one-step learning to ( )Qλ . And it uses the action-value function to improve performance. Its simplest form of 1-step Q- learning is
1 1 1
( ,t t) ( ,t t) t max ( t , t ) ( ,t t)
a
Q s a ←Q s a +α⎡r+ +γ Q s + a+ −Q s a ⎤
⎣ ⎦ (19)
We used Eq. (19) to update the action-value function in this study. Evaluation of all possible actions is computationally very expensive in case of large and continuous action spaces. The autorotation problem includes large and continuous action spaces. Thus, the Q-learning algorithm is to be extended for dealing with continuous action spaces without too much computational power. For this reason, we consider function approximation techniques.
B. Function Approximation
To overcome the curse of dimensionality, a function approximation technique can be used. This technique requires significant less data storage. There are two types of generalization used typically: 1) quantization of the state and action space and 2) function approximation. As the state and action space are increased, function approximation becomes more favorable. The quantization of the state and action space is affected by the curse of dimensionality, while the function approximator expresses the value function satisfactorily. There are several function approximation techniques; fuzzy logic, multilayer perceptron (MLP), cerebellar model articulation controller (CMAC), and radial basis function (RBF), etc.
The function approximators have to satisfy several criteria to be applied in reinforcement learning. The amount of data to be stored has to be limited without doing significant concessions on the level of approximation of the original function. The relevant characteristics of the system have to be captured.8
When reinforcement learning is executed online, the function approximator has to calculate the output without excessive delay. Because of the learning characteristics of reinforcement learning, the function approximators have to be able to learn and adapt themselves. Also the learned data should be remembered as good as possible. Due to the reinforcement signal is an externally inserted signal, the function approximator is updated with supervised method instead of unsupervised one. As we know, supervised techniques require an input-output pair, whereas unsupervised techniques only require an input. 8
Another criteria which essentially holds for all techniques comes from the values of the states. When there is no information about the system, the limits of the value function are unknown. The discount factor and the reinforcement signal might determine the upper and lower boundaries of the expected reward. Over learning process the value function might exceed these boundaries. Therefore it might be risky to normalize the value function. This means that a function approximator that does not normalize the output is preferred.
Figure 4. Schematic of reinforcement learning controller
C. Radial Basis Function (RBF)
Continuous-valued tiles instead of binary tiles can be used by using radial basis functions. The Gaussian functions are employed instead of the binary tiling.
2
( ) 2
x
RBF x we
μ σ
− −
= (20)
where w denotes the weight, similar to the weights of CMAC, μ and σ indicate the center of the RBF and the effective width respectively. μ and σ are similar to the mean and standard deviation in a normal distributed function.
RBFs are smooth and differentiable, resulting in more precise approximations. However, they are computationally more complex and require more manual tuning. Also they are the drop-off at the edge of state space if the ratio of the distance between the centers and width is too large. This downfall relates the distance between the centers to the width of the RBFs. In such case, the normalized RBF (NRBF) can be used to overcome the problem of drop-offs.
2 2
2 2 /( )
i i
i i x
i x N j
NRBF x w e
e
μ σ
μ σ
− −
− −
=
∑
(21)
In MLP the designer has to determine the number of layers, the number of neurons per layer and initial weights.
In CMAC the a priori knowledge is limited to the number of tiling and the shape of the tiles, although weight initialization may be desired. For RBF this argument no longer holds.8
If there is no RBF within σ of a new sample, a new RBF with a center of a new sample would be created. The weight of the new RBF is initiated with a value of 0.
D. Training of the Function Approximator
Generally, in the supervised learning technique, the error between a target function and the output function is minimized by training process. Through the training the parameters are adjusted by calculating the gradient of the error with respect to the parameters.
Back-propagation method can be used to train the function approximators such as CMAC, NRBF and MLP. This algorithm is supposed to find the optimal parameters or weights by a Taylor series expansion approximation of the gradient of the cost function. The gradient provides information about the direction how the weights should be adjusted in order to minimize the error. The magnitude of the gradient gives some idea about the distance of the current weight to the optimal weight.
The formula of the back-propagation for training of the function approximators is derived as follows, called direct value iteration.
1
( ) ( )
t t t t
t t
t t t t
E V s
w w
V s w
α ε
+ ε
∂ ∂ ∂
= −
∂ ∂ ∂ (22)
where Et is the cost function defined as 0.5εt2, α denotes the learning rate, and εt represents the error between target and the current output. Also the value function V st( )t is included in the update equation, Eq. (22).
There are several approaches to update the parameters of the function approximators. In this research, we adopted direct gradient method discussed by Werbos, which is commonly used8. Werbos suggested that the partial derivative to the TD-error with respect to the current state will bring the correct solution if it converges13. Note that
1 ( 1)
t t t
r+ +γV s+ is the desired output of the system, while V st( )t is the current actual output. However, over updating, the parameters of the value function and the desired output change simultaneously. This might produce oscillation and even diverging of the parameters. This approach is known as the direct gradient using the current state.
[
1 1]
( ) ( ) t( )t
t t t t t
w r V s V s V s
α + γ + −∂ w
Δ = − + −
∂ (23)
IV. Simulation A. The Helicopter Model
The helicopter model used in this study for the autorotative landing is a point-mass model of OH-58A. We take the necessary parameters of OH-58A from Ref. 1 as listed in Table 1. Those data are used to construct the dynamic model in Eqs. (2)-(18).
B. The Setup for Q-learning Algorithm using RBFs
The Q-learning algorithm using RBFs was suggested as the reinforcement learning controller. Several learning parameters and the boundaries should be determined to apply the suggested learning algorithm to a given problem.
Two types of simulations were carried out: 1) autorotation initially in hovering and 2) autorotation initially in forward flight. In the first simulation started from hovering, we just consider the states in vertical direction and the goal is to make the sink rate below 1 ft/ sec when the helicopter performs touchdown on the ground. Also the second derivative of thrust coefficient is selected as a control input. We can reduce the size of action space by this approach.
In the second simulation, the helicopter initially performs the level flight with a certain forward velocity. The state of forward velocity is considered in this simulation and the goal is to be satisfied with the condition that the sink rate and the forward velocity are below 1 ft/ sec on the ground simultaneously. The control inputs are the second derivative of thrust coefficient and the second derivative of the angle of thrust vector with respect to the vertical axis.
Consequently, we should adjust the learning parameters and the boundaries of the state variables for the better performance of the suggested learning algorithm.
1. The Boundaries of the State and the Action Space
The boundaries of the state space and the action space affect the performance of the suggested algorithm. If we set the boundaries roughly, the resolution of the RBFs is decreased, while too tight boundaries make the state space exclude the possible state to find the solution. And all states and actions are normalized in the boundaries. Table 2 shows the adjusted boundaries of the state and the action space.
Table 1. Parameters of OH-58A
Parameter Value
fe, equivalent flat area, ft2 24.0 ρ, density of air, slug/ ft3 0.002378
σ , rotor solidity 0.048
R, radius of the main rotor, ft 17.63 M, weight of helicopter, slug 93.16 Kind, induced velocity correction factor 1.13
a, slope of lift curve 0.5
Ω, nominal rotor speed, rpm 354.0 IR, rotational inertia per blade, slug−ft2 672.0 δe, profile drag coefficient (NACA 0012) 0.0087
Table 3. Simulation condition: initially in hovering Initial State Value
,
h ft 100.0
, / sec
w ft 0.0
,rad/ sec
Ω 37.2
CT 0.0005
CT 0.0
2. Design of Reinforcement Signal
To make the agent fine the goal easily, we should design the reinforcement function carefully. Especially the autorotation problem, we can not let the agent know how to perform autorotation. After touching the ground or exceeding the boundaries, the agent can receive the reward. For that reason the learning rate is quite slow and the agent get to explore a number of state space. Intuitively, during autorotation the climbing does not occur, therefore we can give the penalty when the agent performs climbing. This prevents the agent from exploring the state space unnecessarily.
The designed reinforcement signal is shown in Eq.(24).
( ) 1 , 1 / sec , 1 / sec
( , ) 0 / sec 40 / sec 0 sec
1 1 , 1 / sec, 1 / sec
t t
w u if h ft w ft u ft
r s a h if w f or w ft or u ft
if h ft w ft u ft
− + < > >
⎧⎪
=⎨ − < > <
⎪ < < <
⎩
(24)
3. Learning Parameter Configuration
The effective width should be adjusted corresponding to the problem dealt with. Since all components of the state variable are considered to be critical in the autorotation problem, we can set the effective width of the state variables as the same value, 0.02. And the value of 0.1 was taken for the effective width of the input variables.
A small learning rate guarantees the convergence towards the correct solution in learning process. Although the RBFs are rather robust with respect to the learning rate, we adopted the small learning rate. The discount factor is generally found in literature, which has the value of 0.8 or 0.9. 9 We took the learning rate and discount factor as 0.2 and 0.8 respectively.
Table 2. The boundaries of state and action space
Variable Min MAX
,
h ft -100 100
, / sec
w ft -40 40
, / sec
u ft -30 30
,rad/ sec
Ω -37 37
CT -0.015 0.015
, deg
α -40 40
CT -0.015 0.015
, deg/ sec
α -40 40
{ 0.3, 0.0,1.0}
CT = − -0.0012 0.004
{ 1.0, 0.0, 0.8}, deg/ sec2
α= − -9 7.2
C. Simulation Results 1. Initially in Hovering
The initial condition of the simulation initially in hovering is listed in Table 3.
Over the training the RBFs were generated about 19000 and the agent attained about 40% success rate. Non- greedy actions are applied over training to fine better solutions and the number of the non-greedy actions is decreased as the trial epoch is increased.
The value function is updated by Q-learning algorithm and adjusted after transition from the current state to the next state. In autorotation problem the reinforcement signal is received at h=0ft , 0 [w≤ ft/ sec] or
40 [ / sec]
w> ft . For that reason, the value function is updated from these states to another state. The value function of the sequence of the state that has not achieved the safe landing is adjusted smaller and smaller. On the other hand, the value function of the sequence is increased after accomplishing the safe landing.
The updates of the value function over trials are shown as follows;
The control solutions after 10000 trials were shown in Fig. 8-13. In Fig. 9, the sink rate is satisfied with the safe landing criteria, w≤1 ft/ sec, h≤1ft. Although the second derivative of the thrust coefficient has just three possible actions, the thrust coefficient that affects the helicopter dynamics is smooth. And the profile of the thrust coefficient during autorotation was increased to reduce the sink rate. As a result, autorotation for the safe landing initially in hovering was accomplished by the suggested reinforcement learning algorithm.
a) b) c) d)
e) f) g) h)
i) j) k) l)
Figure 5. Update of the value function from epoch # 0-990 : a) Epoch #0, b) Epoch #90, c) Epoch #180, d) Epoch #270, e) Epoch #360, f) Epoch #450, g) Epoch #540, h) Epoch #630, i) Epoch #720, j) Epoch #810, k) Epoch #900, l) Epoch #990
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
2x 104
Number of Traing
Number of RBFs
RBFs History
Figure 6. The generated RBFs
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
500 1000 1500 2000 2500 3000 3500 4000 4500
Number of Traing
Number of Success
Success History
Figure 7. Number of success
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 10 20 30 40 50 60 70 80 90 100
Time [sec]
Altitude [ft]
Altitude History
Figure 8. Altitude response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 5 10 15 20 25 30 35 40
Time [sec]
Sink Rate [ft/s]
Sink Rate History
Figure 9. Sink rate response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4
0 5 10 15 20 25 30 35
Time [sec]
Rotor angular speed [rad/s]
Rotor angular speed History
Figure 10. Rotor angular rate response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 0.002 0.004 0.006 0.008 0.01 0.012 0.014
Time [sec]
Thrust Coeff.
Thrust Coefficient History
Figure 11. Thrust coefficient response after learning
2. Initially in Forward Flight
The initial condition of the simulation initially in cruise is listed in Table 4.While autorotation initially in hovering is affected by the thrust coefficient, the autorotation with forward velocity is influenced by the combination of the thrust coefficient and the angle of the rotor plane with respect to the vertical axis. As the helicopter flies faster in forward direction, it can maintain the altitude with smaller collective pitch angle than for hovering. This means that proper increasing of the forward velocity after engine failure makes the helicopter maintain their internal energy.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 1 2 3 4 5 6 7 8x 10-3
Time [sec]
Derivative of Thrust Coeff.
Derivative of Thrust Coeff. History
Figure 12. Derivative of thrust coefficient response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4
-2 -1 0 1 2 3 4
x 10-3
Time [sec]
Control Input
Control Input History
Figure 13. Second derivative of thrust coefficient response after learning
Table 4. Simulation condition: initially in cruise Initial State Value
,
h ft 100.0
, / sec
w ft 0.0
, / sec
u ft 15
,rad/ sec
Ω 37.2
CT 0.0005
, deg
α 3.0
CT 0.0
, deg/ sec
α 0.0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 10 20 30 40 50 60 70 80 90 100
Time [sec]
Altitude [ft]
Altitude History
Figure 14. Altitude response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 5 10 15 20 25 30 35 40
Time [sec]
Sink Rate [ft/s]
Sink Rate History
Figure 15. Sink rate response after learning
In this simulation the number of state becomes eight. Also the possible actions are nine, which is three times bigger than before. It takes relatively long time for agent to find the solution in these huge state space and action space.
Autorotation in forward flight was achieved by the suggested control method. Fig. 14-21 show that the safe landing criteria with forward speed was satisfied with w≤1 ft/ sec, h≤1 ,ft u≤1ft/ sec.The thrust coefficient is
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 2 4 6 8 10 12 14 16
Time [sec]
Forward Velocity [ft/s]
Forward Velocity History
Figure 16. Forward velocity response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4
0 5 10 15 20 25 30 35
Time [sec]
Rotor angular speed [rad/s]
Rotor angular speed History
Figure 17. Rotor angular rate response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0 0.005 0.01 0.015
Time [sec]
Thrust Coeff.
Thrust Coefficient History
Figure 18. Thrust coefficient response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
-14 -12 -10 -8 -6 -4 -2 0 2 4
Time [sec]
Rotor plane angle [deg]
Rotor Plane angle History
Figure 19. Rotor plane angle response after learning
0 0.5 1 1.5 2 2.5 3 3.5 4
-3 -2 -1 0 1 2 3 4 5 6 7
x 10-3
Time [sec]
Second derivative of thrust coeff.
Second derivative of thrust coeff. History
Figure 20. Second derivative of thrust coefficient
0 0.5 1 1.5 2 2.5 3 3.5 4
-12 -10 -8 -6 -4 -2 0 2 4 6 8 10
Time [sec]
Second derivative of rotor plane angle [deg/sec2]
Second derivative of rotor plane angle
Figure 21. Second derivative of rotor plane angle
increased exponentially at the first time and saturated at the end with (CT)max, the value of 0.015. The rotor plane angle with respect to the vertical axis is continuously decreased in this solution. The results attached above are just one of the solutions and we can find the other solutions by training.
V. Conclusion
The dynamic equations of a helicopter under autorotation were derived from dynamic relation, energy model, momentum theory, and the induced velocity model. Through the formulations, it was verified that autorotation of the helicopter is a quite complicated problem in practical scenario. Reinforcement learning technique was suggested to deal with this complex system, and the Q-learning algorithm using RBF is selected. The RBF plays a role of the function approximatior in order to overcome the curse of dimensionality. The back-propagation technique was applied to update the parameters of RBF networks. Two types of simulation were constructed to verify the performance of the suggested controller. The first simulation, from initially in hovering, showed that the safe autorotative landing was accomplished in vertical motion. The second simulation, from initially in forward flight, showed that the helicopter can be controlled even it has the forward speed. Consequently, the controller based on the reinforcement learning will be applicable to autorotation saving the vehicles even in the case of engine failure.
References
1Allan Y. Lee, “Optimal Landing of a Helicopter in Autorotation,” Ph.D. Dissertation, University of Stanford, 1985.
2Johnson, W., “Helicopter Optimal Descent and Landing after Power Loss,” NASA TM-73,244, 1977.
3Washizu, K., Azuma, A., Koo, J., and Oka, T., “Experiments of a Model Helicopter Rotor Operating in the Vortex Ring State,” Journal of Aircraft, Vol. 3, No. 3, 1996.
4Gessow, A., and Myers, G.C., Jr., “Aerodynamics of the Helicopter,” Frederick Ungar Publishing Co., New York, 1952.
5Decker, W.A., et al, “Model Development and the Use of Simulator for Investigating Autorotation,” FAA conference on Helicopter Simulation, Atlanta, Georgia, Apr., 1984.
6Bimal L. Aponso, Edward N. Bachelder, Dongchan Lee, “Autorotative Autorotation for Unmanned Rotorcraft Recovery,”
AHS International Specialist’s Meeting, 2005.
7Andrew, G. Barto, R.S. Sutton and C.W. Anderson, “Neuronlike Adaptive Elements that can Solve Difficult Learning Control Problems,” IEEE Transactions on Systems, and Cybernetics, Vol. 13, No. 5, September-October, 1983.
8J.M. Engel, “Reinforcement Learning Applied to UAV Helicopter Control,” M.S Dissertation, Faculty of Aerospace Engineering, TUDelft, 2005.
9Richard S. Sutton and Andrew G. Barto, “Reinforcement Learning,” The MIT express, 1998.
10Watkins, C.J.C.H., “Learning from Delayed Rewards,” Ph.D Dissertation, Cambridge Univ., 1989.
11Si, J. and Wang, Y.T., “On-line Learning by Association and Reinforcement,” IEEE Transactions on Neural Networks, Vol.
12, No.2, 2001.
12Baird, L.C., “Residual algorithms: Reinforcement Learning with Function Approximation,” Proceedings of the twelfth International Conference on Machine Learning, pp. 30-37, 1995.
13Werbos, P.J., “Consistency of Hdp applied to a Simple Reinforcement Learning Problem,” Neural Networks, pp. 179-189, 1990.