3D Autonomous Navigation of UAVs: An Energy-Efficient and Collision-Free Deep Reinforcement Learning Approach

(1)

3D Autonomous Navigation of UAVs: An Energy-Efficient and Collision-Free Deep Reinforcement Learning Approach

Item Type Conference Paper

Authors Wang, Yubin;Biswas, Karnika;Zhang, Liwen;Ghazzai, Hakim;Massoud, Yehia Mahmoud

Citation Wang, Y., Biswas, K., Zhang, L., Ghazzai, H., & Massoud, Y. (2022).

3D Autonomous Navigation of UAVs: An Energy-Efficient and Collision-Free Deep Reinforcement Learning Approach. 2022 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS).

https://doi.org/10.1109/apccas55924.2022.10090255 Eprint version Post-print

DOI 10.1109/apccas55924.2022.10090255

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2024-01-09 22:04:21

Link to Item http://hdl.handle.net/10754/691079

(2)

3D Autonomous Navigation of UAVs: An Energy-Efficient and Collision-Free Deep Reinforcement Learning Approach

Yubin Wang¹, Karnika Biswas¹, Liwen Zhang², Hakim Ghazzai¹ and Yehia Massoud¹

1Innovative Technologies Laboratories, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia Email:{yubin.wang, karnika.biswas, hakim.ghazzai, yehia.massoud}@kaust.edu.sa

2College of Information Science and Engineering, Northeastern University, Shenyang, 110004, China Email: [email protected]

Abstract—Energy consumption optimization is crucial for the navigation of Unmanned Aerial Vehicles (UAV), as they operate solely on battery power and have limited access to charging stations. In this paper, a novel deep reinforcement learning-based architecture has been proposed for planning energy-efficient and collision-free paths for a quadrotor UAV. The proposed method uses a unique combination of remaining flight distance and local knowledge of energy expenditure to compute an optimized route.

An information graph is used to map the environment in three dimensions and obstacles inside a pre-determined neighbourhood of the UAV are removed to obtain a local as well as collision- free reachable space. Attention-based neural network forms the key element of the proposed reinforcement learning mechanism, that trains the UAV to autonomously generate the optimized route using partial knowledge of the environment, following the trajectories from which, the UAV is driven by the trajectory tracking controller.

Index Terms—Deep reinforcement learning, unmanned aerial vehicles, motion planning, autonomous navigation, energy efficiency.

I. INTRODUCTION

In the last few years, unmanned aerial vehicles (UAVs) have garnered immense interest among researchers, businesses and the common masses due to a variety of applications supported by unmanned short-haul navigation. It is quickly developing into a mature technology in several application areas, such as target search, package delivery, surveillance and rescue.

Improvement of energy-efficiency of UAVs is an open area of research which demands special attention [1], [2]. Optimized expenditure of battery power is a necessity since the maximum flight duration under ideal operational conditions usually last between 5-7 minutes for a centimeter-scale UAV and up to only 30 minutes for larger heavy-lift aerial vehicles [3].

Real-time routing and navigation control are the most energy consuming tasks in most applications, closely followed by data acquisition and communication. However, a majority of researchers either assume unlimited energy capacities [4] or do not consider the energy aspect at all while designing UAV motion planners, with the exception of a very few studies, given as in [3]. The mainstream approach to developing energy-aware motion planners recommend algorithm-based optimization [5], which compute energy-efficient reference paths by optimizing suitable payoff functions of motor angular velocities and accelerations [6].

Fig. 1. Representation of the world (W) and body (B) coordinate systems, and forces and moments generated by propellers of the UAV about the body coordinate axes [3] .

Deep Reinforcement Learning (DRL) is a highly popular method for UAV motion control and performance enhance- ment due to its reactive approach and real-time compatibility [7]–[10]. However, to the best of the authors’ knowledge, liter- ature related to generation of minimum-energy path for UAV using DRL are very few. To address this issue, we propose a novel 3D navigation planner which is capable to compute energy-efficient and collision-free path for an UAV having partial information about its environment. We have observed that, an attention-based graph neural network performs very well in this case. Once the desired paths are generated, the UAV is driven along the desired optimal trajectories using a tracking controller.

The paper is organized as follows: Section II gives a brief description of the dynamic model and controller design for the UAV. In section III, the path is formulated using DRL.

Steps describing the implementation of the proposed DRL and training of the network have been discussed in Section IV.

Performance evaluation of the proposed navigation approach have been validated by simulation studies in Section V, before concluding remarks in Section VI.

II. UAV MODEL ANDCONTROL

A. Dynamic Model

In this paper the UAV is represented by a quadrotor system schematically illustrated in Fig.1. The position vectorrof the

(3)

center of mass of the UAV (in world coordinate frame) is related to the UAV dynamics according to equations (1) and (2) [11]:

m¨r=



 0 0

−mg



+R





0 0

F1+F2+F3+F4



, (1)

I





˙ p

˙ q

˙ r



=





L(F₂−F₄) L(F₃−F1) M1−M2+M3−M4



−



 p q r



×I



 p q r



. (2) Here, m is the UAV mass, and R is the rotation matrix transforming the body coordinateBto the world coordinateW using theZ−X−Y Euler angle convention.Iis the moment of inertia with respect to the center of mass, Lrepresents the distance between the center of mass and the axis of rotation of the rotors. p, q, r are the components of the angular velocity ω_BW of the UAV in the body coordinateB.

The control inputs contain two parts: the thrust given by u1=F1+F2+F3+F4, and the moments represented by the 3-tuple in (3).

u₂=





L(F₂−F₄) L(F3−F1) M1−M2+M3−M4



 (3) The state vector of the UAV is defined as:

x= [x, y, z, ϕ, θ, ψ,x,˙ y,˙ z, p, q, r]˙ ^⊤. (4) wherer= [x, y, z]^T andψare the position and the yaw angle.

B. Energy Consumption Model

Considering only high-speed flight mode, the total power required,PT is calculated according to equation (5) [12]:

PT = CD

CL

∗W v+ W²

Db²v (5)

where, C_D andC_D are the aerodynamic drag and lift coeffi- cient respectively, D is the density of the air, b is the width of the UAV, W is the total weight of the UAV and v is the relative speed of UAV to the wind speed.

C. 3D Trajectory Tracking Control

The controller designed has been inspired by [11], [13] and it drives the UAV to follow a reference trajectory σT(t) = rT(t)^T, ψT(t)^T

. The errors on position and velocity are denoted as:

ep=r−rT,ev= ˙r−r˙T (6) The desired force vector for controller and the desired body frame zaxis are computed as follows:

Fdes=−Kpep−Kvev+mgzW +m¨rT (7) whereK_p andK_v are positive definite gain matrices, andz_W is the direction of the vertical axis in world frame.

The desired force for the UAV and the first input is obtained by projecting the desired force vector onto the actual body framez axis assuming∥Fdes∥ ̸= 0:

u1,des=Fdes·zB (8)

The other three control inputs are corresponding to rotation errors. The desiredz_B direction is observed to be along with the desired thrust vector:

zB,des= Fdes

∥F_des∥. (9) Giving e3 = [0,0,1]^T, the desired rotation^WRB denoted by Rdes is obtained as:

R_dese₃=z_B,des. (10) The error on orientation is determined to:

eR= 1

2 R^T_des^WRB−^WR^T_BRdes

∨

(11) where vee map ^∨ maps the elements of so(3) to R³. The angular velocity error is obtained according to the difference between the actual and desired angular velocity in body frame coordinates:

eω=^B[ωBW]−^B[ωBW,T] (12) The desired moments and the three remaining inputs are available:

[u2, u3, u4]^T =−KReR−Kωeω (13) whereK_R andK_ω are diagonal gain matrices, used for roll, pitch, and yaw angle tracking control. Desired control input u = [u₁, u₂, u₃, u₄]^T is applied to quadrotor to produce corresponding rotors speeds.

III. PATHPLANNING AS AREINFORCEMENTLEARNING PROBLEM

A. Overview of navigation methodology

The trajectory generation is a sequential process demon- strated by Fig.(2). Observation received from the environment is transmitted to the RL block which conducts path planning on a locally generated graph and generates waypoints to reach the destination using attention-based neural networks. Section of the trajectory between the two consecutive waypoints are generated online using Minimum Snap trajectory generator [13]. After that, the trajectory tracking controller generates control inputs for following the reference trajectories.

B. Graph Generation and Partial Observation

In this paper, we have used Probabilistic Roadmap (PRM) mechanism [14] to generate a uniformly randomized 3D graph G= (V, E), spanning over a local environment of predefined dimensions, where V is a set of nodes and E denotes the edges on the graph.

For collision-free navigation, the nodes within the obstacles and edges overlapping with the obstacle position and extent (size) are discarded, thereby making the remaining portion of the local graph a safe reachable space.

(4)

Fig. 2. Overview of the hierarchic motion planning of UAV navigation.

Each node on the graph vi ∈ V is connected to top k nearest neighbor nodes where the size of the neighbourhood determines the visibility of the UAV. Thus, finite 3D nearly ball-shaped field-of-view (FOV) is considered, where the UAV is assumed to be able to access only local information about nodes inside the limited field-of-view.

The UAV observation s_t={G^′, v_c, ψ_s,c}, including graph state and current state, is time-variant for each decision stept.

The graph stateG^′= (V^′, E)is a combination of graphG= (V, E) and nodes information (distance to the destination).

The current state is determined as{vc, ψs,c}, where vc is the UAV current position,ψs,c= (ψs, ψ1, . . . , vc)is the executed paths.

C. Action

The trained neural network offers a learned policy p(ψt) based on partial observation at each decision step tto decide the next node from neighbor nodes. The policy parameterized by the set of weightsθ is as follows:

pθ(ψt=vi,(vc, vi)∈E|st), (14) where E is the edge set,st is the observation, vi andvc are nodes on the graph. A greedy policy that selects the maximum of possibility has been used to determine the next nodeψtto reach.

D. Reward

The reward is designed to accommodate a path that yields least energy consumption and faster reduction of flight distance between the UAV and the destination. The following dense reward is is implemented:

r_d=α·D_des(t−1)−D_des(t)

Ddes(t) −β·E_c(t−1) ET otal

(15) where α and β are weighting factor set to 0.5 and 1 respectively, Ddes(t) represents the current distance to the destination at step t, Ec(t) is the battery energy consumed from step t−1 to step t and ET otal is the total on-board battery energy.

Sparse reward at the final decision step plays as a correction term to avoid the deviation between the training objective and navigation objective introduced by the dense reward normalization:

rs=−α·Ddes(t) +β· Er(t) ET otal

(16)

where weighting factor are both 1, and Er(T)represents the UAV current remain battery energy at final stepT.

IV. DEEPREINFORCEMENTLEARNINGIMPLEMENTATION

A. Neural Network Structure

The attention-based encoder-decoder neural network is built inspired by [15] and [16], where the encoder is utilized to model the observed environment through learning the depen- dencies of nodes in the graph state G^′ and the decoder is introduced to output the policy guiding which node to reach based on the extracted features from the encoder.

The Transformer attention layer [17] which updates the query source is built as the major player in our model.

The updated feature is then passed through the feed forward sublayer containing two linear layers and a ReLU activation.

The node inputsV^′ are embedded to d−dimensional node features by encoder, and the embedded node features should be added on the Laplacian positional embeddings [18] to guarantee that the Transfomer attention layer is still working for the incomplete graph stateG^′ caused by the fact that one node only connects with its neighbor nodes.

Assuming that the UAV is capable to memorize the visited nodes on the graph, a node mask binary vectorM is found.

Each entry ofM is initialized to 0 and set to 1 if visiting the corresponding node.

The role of decoder is to search a policy guiding the UAV to plan a next node to reach based on the embedded node features, current state and maskM. With decoder, the current feature is merged with the destination feature to compute the enhanced current feature and neighbor features are selected from the enhanced node features according to the current position, edges and mask. After feeding the enhanced current feature and the neighbor features to an attention layer, the output is passed to a linear layer to generate the state value V(s_t)and the final attention layer with the neighbor features.

The attention weights of the final attention layer are directly considered as the UAV path-planning policy. If node is already traveled, the similarity u_i is clipped as:

ui=

(C·tanh_qT√·ki

d

ifM_iⁿ ̸= 1

−∞ otherwise.

(17)

(5)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0

(a) Waypoints within the obstacles.

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0

(b) Path penetrating the obstacles.

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(c) Safe navigation without collision.

Fig. 3. The principle of safe navigation via graph clipping.

The distribution pfor the next node to reach is finally yielded using softmax as:

pi=pθ(ψt=vi|st) = e^uⁱ Pn

i=1e^uⁱ (18) For more details about attention-based neural networks, we suggest reader referring to [15]–[17].

B. Training

The decision model is trained using PPO algorithm [19].

With rt(θ) = _p^p^θ^(ψ^t^|s^t⁾

θold(ψ_t|st) denoting the probability ratio, the policy is defined as:

L^p(θ) =E[min (rt(θ)At,clip (rt(θ),1−ϵ,1 +ϵ)At)] (19) where hyperparameterϵis set to 0.2 andAt=rt+γV (st+1)−

V(s)represents the advantage function (discount factorγ= 1 in implementation). A standard l2 loss is used as the value loss:

L^v(θ) =E

V(st)−

k

X

i=0

γⁱrt+i

2

!

(20) The total loss funtion reads:

L(θ) =α·L^p(θ) +β·L^v(θ) (21) whereαandβ are set to 1 and 0.5 respectively.

At each training episode, 200 nodes of the graph are generated in the world with size of [0,1]³, the number of neighbor nodes k is set to 20, the sum of on-board battery energy is fixed to 2000 unit (J). The maximum of episode length is set to 256 steps and the episode is terminated when reaching the destination or running out of battery energy. With the batch size of 512 and a learning rate 1×10⁻⁴ decaying every 32 steps by a factor of 0.96, the Adam optimizer is utilized to operate 8 iterations of PPO algorithm in each training episode.

V. EXPERIMENTS

This section describes performance evaluation of the proposed technique from the following aspects:

1) 3D Collision Avoidance: In order to show how the safe reachable space is constructed, two instances of collision have been shown.

1) Waypoint within the obstacles: if the waypoint determined by DRL coincides with the obstacles, collision will take place as shown in Fig. 3(a).

2) Path penetrating the obstacles: despite the waypoints are all outside of the obstacles, the collision still occurs in scenario showed by Fig. 3(b), if the paths pass through the extent of the obstacles.

After the deletion of unreachable nodes and edges of the graph, the safe navigation scenario is illustrated as Fig.3(c).

2) Energy Efficiency: The proposed RL approach can au- tomatically search a policy with which the flight distance is minimized as well as the energy consumption is optimized.

For comparison reward ascribed only to the decrement of distance to the destination has been studied in Fig.4(a).

Fig.4(b) represents the DRL outcome which considers both distance and energy. Unlike the distance-only approach, the proposed combined reward generates trajectory that offers flexible movement along all three dimensions due to the energy minimization factor and converges faster.

This is further corroborated by the energy consumption plot illustrated in Fig.(4(c)), where we have conducted the simu- lations several times and recorded the distribution of energy consumed. The red areas indicate the distribution of energy consumption of with energy and distance-aware navigation while the blue areas indicate distance-only approach. The difference in the spread of the area plots indicate the energy saving feature of the proposed design. For instance, with the proposed design the UAV reached destination faster and saved approximately 200 battery energy (unit isJ) in a test run.

This also benefits the training procedure of the neural networks. During training monitored by Weights and Biases [20], it is observed that removing the energy reward makes the reward design sparser, causing the learning hard to con- verge. Navigation using the distance-only strategy has been found to suffer from slow convergence. The success rates of

(6)

0.0 0.2 0.4

0.6 0.8 1.0 0.0

0.2 0.4

0.6 0.8

1.0 0.0 0.2 0.4 0.6 0.8 1.0

(a) The path generated by strategy only considering distance for navigation.

0.0 0.2 0.4

0.6 0.8 1.0 0.0

0.2 0.4

0.6 0.8

1.0 0.0 0.2 0.4 0.6 0.8 1.0

(b) The path generated by our energy-efficient strategy for navigation.

0 2 4 6 8 10

step 0

200 400 600 800

used_energy

strategy energy and distance only distance

(c) The energy consumption (used energy) during flight along step for energy-distance-aware strategy and only-distance-aware strategy.

Fig. 4. The comparison of the navigation with the baseline and energy-efficient strategy.

convergence are reported in Table I.

TABLE I

THE NUMBER OF TRAINING EPISODE AND SUCCESS RATE OF DESTINATION REACHING FOR TWO STRATEGY

strategy only distance energy distance the number of training episode 1300 2200

success rate 99.0% 90.5%

VI. CONCLUSIONS

In this paper, an energy-efficient and collision-free 3D navigation planner for UAV has been proposed using deep reinforcement learning. RL generates a sequence of waypoints using energy-optimized strategy and using only partial information limited to a predefined range of the environment.

Attention-based graph neural networks have been constructed to solve a sequential decision-making problem that identifies the waypoints of the trajectory. Simulation results show energy efficiency and convergence benefits of the proposed design.

Future directions of research may include coordinated motion of the UAV with other moving agents.

REFERENCES

[1] A. Alsharoa, H. Ghazzai, A. Kadri, and A. E. Kamal, “Spatial and temporal management of cellular hetnets with multiple solar powered drones,” IEEE Transactions on Mobile Computing, vol. 19, no. 4, pp. 954–968, 2020.

[2] A. Bahabry, X. Wan, H. Ghazzai, H. Menouar, G. Vesonder, and Y. Massoud, “Low-altitude navigation for multi-rotor drones in urban areas,”IEEE Access, vol. 7, pp. 87716–87731, 2019.

[3] N. Kreciglowa, K. Karydis, and V. Kumar, “Energy efficiency of trajectory generation methods for stop-and-go aerial robot navigation,” in 2017 International Conference on Unmanned Aircraft Systems (ICUAS), pp. 656–662, 2017.

[4] K. Dorling, J. Heinrichs, G. G. Messier, and S. Magierowski, “Vehicle routing problems for drone delivery,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 1, pp. 70–85, 2017.

[5] K. Karydis and V. Kumar, “Energetics in robotic flight at small scales,”

Interface focus, vol. 7, no. 1, p. 20160088, 2017.

[6] F. Morbidi, R. Cano, and D. Lara, “Minimum-energy path generation for a quadrotor uav,” in2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1492–1498, 2016.

[7] O. Bouhamed, H. Ghazzai, H. Besbes, and Y. Massoud, “A generic spa- tiotemporal scheduling for autonomous uavs: A reinforcement learning- based approach,”IEEE Open Journal of Vehicular Technology, vol. 1, pp. 93–106, 2020.

[8] O. Bouhamed, H. Ghazzai, H. Besbes, and Y. Massoud, “Autonomous uav navigation: A ddpg-based deep reinforcement learning approach,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, 2020.

[9] N. Imanberdiyev, C. Fu, E. Kayacan, and I.-M. Chen, “Autonomous navigation of uav by using real-time model-based reinforcement learning,”

in2016 14th international conference on control, automation, robotics and vision (ICARCV), pp. 1–6, 2016.

[10] B. Zhang, Z. Mao, W. Liu, and J. Liu, “Geometric reinforcement learning for path planning of uavs,” Journal of Intelligent & Robotic Systems, vol. 77, no. 2, pp. 391–409, 2015.

[11] N. Michael, D. Mellinger, Q. Lindsey, and V. Kumar, “The grasp multiple micro-uav testbed,”IEEE Robotics & Automation Magazine, vol. 17, no. 3, pp. 56–65, 2010.

[12] A. Thibbotuwawa, P. Nielsen, B. Zbigniew, and G. Bocewicz, “Energy consumption in unmanned aerial vehicles: A review of energy consumption models and their relation to the uav routing,” inInternational Con- ference on Information Systems Architecture and Technology, pp. 173–

184, Springer, 2018.

[13] D. Mellinger and V. Kumar, “Minimum snap trajectory generation and control for quadrotors,” in2011 IEEE international conference on robotics and automation, pp. 2520–2525, 2011.

[14] R. Geraerts and M. H. Overmars, “A comparative study of probabilistic roadmap planners,” inAlgorithmic foundations of robotics V, pp. 43–57, Springer, 2004.

[15] W. Kool, H. Van Hoof, and M. Welling, “Attention, learn to solve routing problems!,”arXiv preprint arXiv:1803.08475, 2018.

[16] Y. Cao, Z. Sun, and G. Sartoretti, “Dan: Decentralized attention- based neural network to solve the minmax multiple traveling salesman problem,”arXiv preprint arXiv:2109.04205, 2021.

[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017.

[18] V. P. Dwivedi and X. Bresson, “A generalization of transformer networks to graphs,”arXiv preprint arXiv:2012.09699, 2020.

[19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017.

[20] L. Biewald, “Experiment tracking with weights and biases,” 2020.

Software available from wandb.com.