Deep Reinforcement Learning Based Beamforming Codebook Design for RIS-aided mmWave Systems

(1)

Deep Reinforcement Learning Based Beamforming Codebook Design for RIS-aided mmWave Systems

Item Type Conference Paper

Authors Abdallah, Asmaa;Celik, Abdulkadir;Mansour, Mohammad M.;Eltawil, Ahmed

Citation Abdallah, A., Celik, A., Mansour, M. M., & Eltawil, A. M. (2023).

Deep Reinforcement Learning Based Beamforming Codebook Design for RIS-aided mmWave Systems. 2023 IEEE 20th

Consumer Communications & Networking Conference (CCNC).

https://doi.org/10.1109/ccnc51644.2023.10060056 Eprint version Post-print

DOI 10.1109/ccnc51644.2023.10060056

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2023-12-05 22:18:45

Link to Item http://hdl.handle.net/10754/690427

(2)

Deep Reinforcement Learning Based Beamforming Codebook Design for RIS-aided mmWave Systems

Asmaa Abdallah^⋆, Abdulkadir Celik^⋆, Mohammad M. Mansour^⋆⋆, Ahmed M. Eltawil^⋆

⋆⋆Department of Elect. and Computer Engineering, American University of Beirut, Beirut 1107 2020, Lebanon.

⋆Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, KSA.

Abstract—Reconfigurable intelligent surfaces (RISs) are envi- sioned to play a pivotal role in future wireless systems with the capability of enhancing propagation environments by intelligently reflecting the signals toward the target receivers. However, the optimal tuning of the phase shifters at the RIS is a challenging task due to the passive nature of reflective elements and the high complexity of acquiring channel state information (CSI).

Conventionally, wireless systems rely on pre-defined reflection beamforming codebooks for both initial access and data trans- mission. However, these existing pre-defined codebooks are com- monly not adaptive to the environments. Moreover, identifying the best beam is typically performed using an exhaustive search that leads to high beam training overhead. To address these issues, this paper develops a multi-agent deep reinforcement learning framework that learns how to jointly optimize the active beamforming from the BS and the RIS-reflection beam codebook relying only on the received power measurements. To accelerate learning convergence and reduce the search space, the proposed model divides the RIS into multiple partitions and associates beam patterns to the surrounding environments with low computational complexity. Simulation results show that the proposed learning framework can learn optimized active BS beamforming and RIS reflection codebook. For instance, the proposed MA-DRL approach with only 6 beams outperforms a 256-beam discrete Fourier transform (DFT) codebook with a 97% beam training overhead reduction.

I. INTRODUCTION

Reconfigurable intelligent surfaces (RIS) have emerged as a key enabler technology for next-generation wireless systems to improve bandwidth and energy efficiency at low power and hardware costs [1]. The RIS is equipped with many low- cost passive reconfigurable elements to intelligently control the reflection of signals using adjustable phase shifts [1].

Exploiting the full potential gains of RIS relies on finding the optimal RIS configuration that yields the best phase shifts to reflect the incident signals toward the target receivers.

Conventionally, some pre-designed reflection beam codebooks, such as discrete Fourier Transform (DFT) based codebooks [2]–[4], are used to scan all possible directions or rely on channel state information to form phased versions of DFT codebooks. However, in large RIS setups, these codebooks pose several challenges: (i) their design typically relies on channel knowledge which is difficult to acquire due to the passive nature of RIS [5], [6], and (ii) the pre-defined codebooks

The authors gratefully acknowledge financial support for this work from Ericsson AB and KAUST.

normally require significant beam training overhead, and are not adaptive to the environment.

Recent work have investigated optimization based schemes for beamforming designs in RIS assisted wireless communication system [7]–[11]. In [7], the weighted sum-rate problem is decoupled via Lagrangian dual transform. Then, the transmit beamforming is optimized by the fractional programming method, and the passive beamforming at RIS is optimized by three efficient algorithms with closed-form expressions.

In [8], a combination of symbol-level precoding and IRS techniques has been studied for a multiuser system. In [9], an alternating optimization scheme is studied to find the adequate RIS phase shifts. Moreover, the optimization based schemes for beamforming designs in [7]–[11] rely on explicit channel knowledge and do not focus on designing the BS beamforming and the RIS reflection beam codebook when the channel state information (CSI) is unknown.

Furthermore, for the ease of optimization, the reflecting elements are normally assumed to have continuous phases [9], [10], [12]. In [13], a deep reinforcement learning (DRL) approach has been presented to design the RIS codebook;

however, the authors considered a single-antenna base station (BS) and did not handle the active beamforming problem at the BS.

In this paper, we develop a low-complexity yet efficient multi-agent DRL (MA-DRL) approach for jointly designing active BS beamforming and RIS reflection beam codebook.

The proposed solution accounts for the practical hardware limitations, such as the quantized phase shifter constraints on the BS and RIS, and does not require any explicit channel knowledge and relies only on receive power measurements, which relaxes the synchronization requirements and the channel estimation overhead. Furthermore, our developed solution serves as a general solution for both active and passive beamforming problem and includes a novel cascaded learning and combining framework that highly reduces the convergence time. Simulation results confirm the learning capability of the proposed scheme to identify beam patterns that adapt to user channels. In addition, the learned small sized codebooks outperform DFT and oversampled DFT codebooks while significantly reducing the beam training overhead.

II. SYSTEMMODEL

We consider a RIS-aided multiple-input single-output (MISO) millimeter wave (mmWave) system consisting of a

(3)

BS equipped with a M_BS = M_h × M_v uniform planer array (UPA) (where M_h andM_v separately denotes the sizes along the horizontal and vertical dimensions) and a set of U single-antenna UEs, which are served by traversing a NRIS=Nh×NvUPA-RIS as line-of-sight (LoS) links to the BS are blocked [c.f. Fig. 1]. The RIS→BS channel is given by Saleh-Valenzuela geometric channel model [5] as follows

G=

L_G

X

l_g=1

α_l_gb ϑ^r_l

g, ψ^r_l

g

a

ϑ^t_l

g, ψ^t_l

g

T

∈C^M^BS^×N^RIS, (1) whereLGdenotes the number of paths;a(·,·)∈C^N^RIS^×1and b(·,·)∈C^M^BS^×1represent the normalized array steering vectors associated with the RIS and the BS, respectively;αl_gis the complex gain over the lgth

path;ϑ^r_l

g/ψ^r_l

g is azimuth/elevation angle-of-arrivals (AoAs) at the BS for thelgth

path;ϑ^t_l

g/ψ^t_l

g is azimuth/elevation angle-of-departures (AoDs) at the RIS for the l_g^th path. For an N_h×N_v UPA-RIS, a(ϑ, ψ)is given by

a(ϑ, ψ) = ^√_N¹

RIS

h

e^−j2πϑn^h^/λi

⊗h

e^−j2πψn^v^/λi , (2) where nh = [0,· · · , Nh −1], nv = [0,· · ·, Nv − 1], λ is the carrier wavelength, d = λ/2 is the antenna spacing, ϑ = dsin(ϱ) cos(υ), ψ = dsin(υ), and ϱ/υ represents azimuth/elevation angles. For anMh×MvUPA-BS,b(ϑ, ψ) is similarly given by

b(ϑ, ψ) =^√_M¹

BS

h

e^−j2πϑm^h^/λi

⊗h

e^−j2πψm^v^/λi , (3) where m_h = [0,· · ·, M_h−1] and m_v = [0,· · ·, M_v −1].

Likewise, the UE_u→RIS channel is expressed as h^u_r =

Lu

X

lu=1

αlua ϑ^r_l_u, ψ_l^r_u

∈C^N^RIS^×1,∀u∈ {1,2,· · ·, U}, (4) where Lu is the number of paths; αl_u is the complex gain caused by the path loss for the luth

path; and ϑ^r_l_u/ψ^r_l_u is the arrived theluthpath’s azimuth/elevation AoA at the RIS. Since the BS and RIS are typically deployed at fixed locations, G is considered to be a quasi-static channel with a much longer coherence time thanhr,u[1]. Following from (1) and (4), the cascaded channelHu∈C^M^BS^×N^RIS can be expressed as

H_u≜Gdiag (h^u_r) =

L_G

X

lg=1 L_u

X

lu=1

αl_gαl_ub

ϑ^r_l_g, ψ_l^r_g a^T

ϑ^t_l_g+ϑ^r_l_u, ψ^t_l_g+ψ^r_l_u , (5) where

(ϑ^r_l

g, ψ^r_l

g) ^L_l^G

g=1 are independent of the user index u since the RIS-BS channel is common for all users. Hence, the BS beamforms toward the RIS in the same direction for all users.

To mitigate hardware cost and power consumption of mixed signals components [14], we adopt analog-only beamforming where the BS has a single RF chain along with a network of q-bit quantized phase shifters. The beamformer/combiner designed for the BS is given by

w= 1

√MBS

[e^jφ¹, . . . , e^jφ^m, . . . , e^jφ^M^BS]^T∈C^M^BS^×1, (6) where φm is the phase of m^th antenna element and selected from ΘBS, a subset of 2^q possible quantized discrete values drawn uniformly from (−π, π] at the BS. To reduce the complexity of precoding optimization, the RIS also adopts re-

RIS

𝑁!"#= 𝑁$× 𝑁%elements

𝑀&#= 𝑀BS'×𝑀( antennas

RIS controller B DRL agents

Feedback reward (+1,-1)

Precoder 𝐰

𝝓_!

𝝓_" 𝝓_#

… … 𝝓_!$"

𝝓_%

DRL agent

User 𝑢

Fig. 1: Illustration of the RIS-aided mmWave MISO systems.

flection beam codebooks tailored forBuser clusters. Denoting the interaction codebook that containsB reflection vectors by B, the b^th,1≤b≤B, beam vector is given by

Φb = 1

√NRIS

[e^jθ^b¹, ..., e^jθ^bⁿ, ..., e^jθ^b^N^RIS]^T∈C^N^RIS^×1, (7) where each θ^b_n is the phase of n^th RIS element and selected fromΘRIS, a subset of 2^q possible quantized discrete values drawn uniformly from (−π, π] at the RIS. The phases of signals impinging on the RIS are reconfigured through the micro-controller connected to the RIS. Therefore, a symbol su ∈C sent by UEu traverse the RIS and is received at the BS as¹

y_u=w^HGdiag (h^u_r)Φ_bs_u+w^Hn (8)

=w^HH_uΦ_bs_u+w^Hn, (9) where n ∼ CN 0, σ²I_M_BS

is the M_BS×1 additive white Gaussian noise with varianceσ².

III. PROBLEMFORMULATION

In this section, we provide a formal problem definition that jointly optimizes beamformer/combinerw at the BS and RIS reflection beam codebook B to maximize overall SNR averaged over the entire set of users. Following from the reception signal presented in (9), the SNR of UEu can be written as

SNRu=PS|w^HHuΦb|²

σ²||w||² =ρη_u^b, (10) whereρ = ^P_σ^S2, ||w||² = 1, and the composite beamforming gain is given by

η_u^b =

αl_gαl_uw^Hb

ϑ^r_l_g, ψ_l^r_g a^T

ϑ^t_l_g +ϑ^r_l_u, ψ_l^t_u+ψ_l^r_u Φb

2

= _M ¹

BSNRIS

αlgαlu

MBS

X

m=1 NRIS

X

n=1

e^jφ^me^jθ^bⁿe^−jϕ^uⁿe^−jω^BS^m

2

(11) where we define ϕ^u_n as ϕ^u_n = 2π(n1(ϑ^t_l

g+ϑ^r_l

u) +n2(ψ_l^t

u + ψ^r_l

u)),ω_m^BS asω^BS_m = 2π(m1ϑ^r_l

g+m2ψ_l^r

g), and n1=

(mod (n, N_h)−1, if mod (n, N_h)̸= 0,

Nh−1, otherwise, . (12)

1We assumesusatisfies the average power constraintE

|su|²

=Ps.

(4)

Similarly, n₂,m₁,m₂ can be obtained by replacingn₁/N_hin (12) withn₂/N_v,m₁/M_h, andm₂/M_v, respectively.

If the BS beamforming/combining vectorwis used andΦb

is selected from a codebookB, with cardinality|B|=B, then, the maximum achievable SNR for of UEu is obtained by the exhaustive search over Bas

SNR^⋆_u=ρ max

w,Φb∈Bη^b_u=ρ max

w,Φb∈B|w^HHuΦb|². (13) The objective of this paper is to design BS beamforming/combining vector w and the RIS beam codebook B to maximize the SNR given by (13) averaged over the set of the users served by the BS. Accordingly, the joint beamforming and reflection codebook design problem that maximizes the average user SNR can be formulated as

w^⋆,B^⋆= arg max

w,B 1

|Hr|

X

h^u_r∈Hr

max

w,Φb∈Bη^b_u (14) s.t. φm∈ΘBS,∀m∈ {1, . . . , MBS}, (15)

|B|=B,Φb∈B, θ^b_n∈ΘRIS,∀n∈ {1, . . . , NRIS}, (16) where Hr represents the set of channel vectors from RIS to all users. When the constraints are ignored, the maximum is attained only when phases of BS beamformers and RIS configuration satisfy

−ω_m^BS+φ_m−ϕ^u_n+θ^b_n=c_u, ∀n, m, u, ∃Φ_b∈B (17) where cu is an arbitrary constant phase value. However, satisfying (17) is impractical due to the following limitations:

1) discretizing continuous valued phases causes quantization errors and 2) the acquisition of accurate channel state information (CSI) is challenging due to channel matrix size.

Moreover, given its non-convex and combinatorial nature, optimal solution of (14) requires an exhaustive search over theoretically finite but practically infinite feasible set. For instance, for an RIS equipped with 256 elements and 3-bit phase quantization, there exists over 1.5×10²³¹ candidate reflection vectors. Also considering the passive and low-cost nature of RIS, it is important to develop a simple yet effective solution that exploits the overall channel gain η^b_u as a figure of merit without requiring any CSI, which is explained next.

IV. MULTI-AGENTDEEPREINFORCEMENTLEARNING

BASEDJOINTBEAMFORMING ANDCODEBOOKDESIGN

This section develops a solution to the joint beamforming and codebook design by leveraging powerful exploration capability of MA-DRL to find a near-optimal solution over the huge search space mentioned above. The MA-DRL differs from its single agent counterpart in cooperating and acting jointly to achieve a common ultimate reward [15]. The MA- DRL is especially suitable for complex problems that can be decomposable into sub-problems, each of which is handled by a single DRL agents. In this manner, we decompose the joint master problem into two sub-problems: in the former, a single agent DRL0obtain the BS combiner (i.e, active beamforming) vector w for a given RIS reflection vector. While the learned w is common for all users/clusters since the RIS-BS channel is shared, user groups observe distinct channel characteristics to/from RIS. Therefore, the latter sub-problem group users with similar UE-RIS channels into B clusters and exploitsB

DRL agents such that DRLb, 1 ≤ b ≤ B, is responsible to design reflection (i.e., passive beamforming) codebook ofb^th cluster,Φ_b∈B, which is explained in detail next.

A. Operation Modes

The proposed approach operates in two modes,Multi-agent learning mode anddeployment mode.

1) Multi-agent Learning mode: This mode is executed first where the multi-agent DRL agents are trained to learn the BS beamformer/combiner and the RIS reflection beam codebook from users with established links with minimal impact on the wireless system performance. It is intended to run in the background and gather information over a relatively long period of time. In this mode, it is assumed that RIS and base station employ some classical codebook, with sporadic usage of the learned beams. Moreover, since the positions of the BS and RIS are fixed for a long period of time, the BS dedicated DRL agent needs less frequent training updates.

2) Deployment mode: Only when the beamformer and the codebook are learned does the network switch to the deployment mode, where the learned beamformer and RIS codebook replaces the classical ones. Users with similar channels will probably be assigned to the same RIS reflection vector during this mode. However, these users are assumed to be scheduled at different time or frequency resources to avoid possible interference between them. For example, the same RIS reflection beam vector can serve multiple users at different sub-bands or in different time slots.

B. DRL Based Beam Pattern Design

Previous works generally rely on deep deterministic policy gradient (DDPG) based DRL agents for beamforming optimization problems [12], [14]. Nonetheless, the DDPG does not consistently deliver the best performance since 1) it is often susceptible to hyperparameter tuning and 2) the learned Q- function begins to dramatically overestimate Q-values, leading to policy breaking [16]. To mitigate these adverse effects, twin delayed DDPG (TD3) introduces three critical enhancements [16]: 1) clipped double-Q Learning allows TD3 to learn two different Q-functions (i.e., critic networks) and select the smaller of the two Q-values to update the loss functions.

2) TD3 updates target networks (i.e., policy) less frequently than the critic network (Q-values). This helps in damping the volatility that arises in DDPG because of how a policy update changes the target. 3) Using target policy smoothing, TD3 adds exploration noise to the target action when updating the policy to less likely exploit actions with high Q-value estimates.

Therefore, we explore the TD3-DRL to solve our optimization problem and adopt the Wolpertinger architecture [17]

to efficiently explore optimal policy in a massive discrete quantized action space. Before delving into the building blocks of TD3 architecture, let us first introduce the components of DRL agents common for all sub-problems:

• s(t) denotes the state vector of t^th learning epoch and consists of the phases defined in (6) and (7) for active and passive beamforming at the BS and RIS, respectively. For instance, the states for the BS and the RIS

(5)

reflection beamformer design are defined as s(t) = [φ₁, φ₂,· · ·, φ_M_BS]^T and s(t) = [θ₁, θ₂,· · ·, θ_N]^T, respectively. Then, the BS beamformer w(t) and the RIS reflection beamΦb(t)at time instanttcan be constructed based on the current states by using equations (6) and (7), respectively.

• a(t)denotes the action vector of t^th learning epoch and shows element-wise changes to the phases in s(t). The phase changes are represented by phases selected from ΘBS and ΘRIS for active and passive beamforming at the BS and RIS, respectively. The action also determines the state of next epoch, i.e. a(t) =s(t+ 1).

• r(t)∈ {−1,+1}denotes the bi-level reward determined by the beamforming gain η^b_u(t)based on current states, i.e., w(t) andΦ_b(t),∀b. That is, r(t) = +1 if η_u^b(t)>

η^b_u(t−1),r(t) =−1 otherwise.

As shown in Fig. 2, TD3 comprises of three deep neural networks (DNNs), a single actor-network and two critic networks.

The actor-network takes the state as input and outputs a continuous proto-action. Since the proto-actions do not necessarily comply with available phase quantization levels, the quantizer maps them into the corresponding phase shifts belonging to ΘBS and ΘRIS. After that, the state and action are passed together to the critic networks. The actor-network is used to approximate the action. The actor and critic networks have duplicates called the target actor and target critic networks to provide computational stability. They cannot be trained, unlike actor and critic networks, but are nevertheless used to determine the targets. The parameters of the target actor and critic networks are updated using the parameters of the critic and actor networks after a predetermined number of training iterations.Since the critic networks can overestimate the true Q-value, TD3 selects the minimum of two estimates coming from the two target critic networks to limit the bias on Q-value estimates [16]. Moreover, the actor-network is updated using the deterministic policy gradient, which is given by [16], [18].

Then, the parameters of the critic networks are updated based on the mean squared error loss [16], [18].

It is worth noting that the proposed TD3-DRL framework is based solely on the SNR measurements and relies neither on the CSI nor user locations. Hence, it is not constrained by the channel coherence time since the DRL agents can adjust their decision to choose phases solely based on the UEs’ feedback in the downlink or received signal strength in the uplink.

Without loss of generality, the summation terms in (11) can be written as

M_BS

X

m=1 N_RIS

X

n=1

e^jφ^me^jθ^bⁿe^−jϕ^uⁿe^−jω^m^BS

=

M_BS

X

m=1

e^jφ^me^−jω^m^BS

| {z }

For BS learning N_RIS

X

n=1

e^jθ^bⁿe^−jϕ^uⁿ

| {z }

For RIS learning

(18)

where the dedicated BS DRL0 agent is tasked to learn the phases(φm),∀mthat align withω_m^BS,∀m(i.e. refer to the left side of (18)), and each RIS DRLb,∀b agent is tasked to the phases(θ^b_n),∀(n, b)that align withϕ^u_n,∀(n, u)(as per the right side of (18)). The RIS reflection codebook learning procedure

TD3 DRL Agent

Value 1

CrtiticNetwork 1 CrtiticNetwork 2

Im Re

Quantizer

Proto action

State

Phase vector Actor Network Value 2

Action

Target Crtitic 1 Target Crtitic 2

Policy loss MSE loss 1

MSE loss 2 Loss calculation

min

The chosen phase vector

Reward (+1,-1)

RF Chain

Adjust beam phases to new state for BS/RIS Receive Combining gain

Fig. 2: Beam pattern design framework with TD3 DRL agent.

is explained in the proceeding subsection.

C. Learning BS Combiner/Beamformer (w) Design

We first fix a random RIS reflection beam Φb for all users and optimize the BS combining/beamforming pattern w using DRL0 agent. As mentioned in Section II, the BS beam pattern w is common among all the users/clusters since BS-RIS channel G is shared due to the fixed BS and RIS locations. For DRL0 agent, the state is given by s(t) = [φ1, . . . , φm, . . . , φMBS]^T∈ΘBS.

D. Learning RIS Reflection Codebook (Φb) Design

In order to reduce complexity and required codebook stor- age at the RIS, we leverage the fact that some users share similar channels to/from RIS. Therefore, instead of individual learning reflection codebook for each user, we exploit B independent DRL agents to learn RIS reflection pattern of B user clusters, collection of which forms the RIS reflection codebook.

1) K-Means User Clustering: Since the proposed framework does not depend on explicit CSI that is not readily available, we leverage a K-means classifier exploiting a set of RIS sensing beams as explained in [14] that are randomly sampled from the feasible set of (16). First, we use the obtained BS beamforming vector w, and then use randomly sampled RIS sensing beams (or reflection vectors). The purpose of utilizing these RIS sensing beams is to gather sensing information in the form of receive combining gain. This information is used to cluster those users, developing a rough sense of their distribution in the environment. The main difference from [14] is that the sensing beams are reflected through the RIS instead of sending from BS directly to users. The BS listens to the RSSI feedback reported from users during the beam training stage and accumulates the received power vectors.

Once enough beam training power vectors are accumulated, the clustering can be executed to train a K-means classifier to group users. It is worth noting that a newly deployed RIS might rely on a random reflection codebook or a pre-defined codebook to serve the user [13], [14].

(6)

2) RIS Partitioning and Cascaded Learning: Upon user clustering, the RIS reflection codebook design of clusters is independently learned by B DRL agents. Nonetheless, the large number of RIS elements still renders the task of learning a single reflection vector highly complex and time-consuming.

Therefore, we partition the RIS array into multiple sub-arrays and develop a cascaded DRL learning approach to lower the computational complexity. The cascaded approach proceeds with the following steps: 1) learning the RIS reflection of a small RIS sub-array, 2) extending the learned reflection sub- array to the full-sized array, and refining the learning to obtain the entire reflection vector of the whole RIS surface, which are described in detail below.

For the sake of a better explanation, let us consider a single user case and drop the indices for both reflection vectors inB and U users. Without loss of generality, we assume that the whole array is equally divided intoNp sub-arrays (partitions) withNselements such thatNs×Np=NRISandNs=c×Nh

wherec >0is an integer. Then, the right summation term in (18) can be written as

N_RIS

X

n=1

e^jθⁿe^−jϕ^RISⁿ

=(a) N_p

X

p=1 Ns

X

n_s=1

e^j(θ^(p−1)Ns+1^+θ^ns⁾e^j(n

′

1ϑ+(n^′₂+N_s

N_h(p−1))ψ)

(b)=

Np

X

p=1

e^j(θ^(p−1)Ns+1e^j(

Ns

N_h(p−1)ψ)

| {z }

Partition-Combining

N_s

X

n_s=1

e^jθ^nse^j(n^′¹^ϑ+n^′²^ψ)

| {z }

RIS sub-array

(19)

where ϑ = −2π(ϑ^t_l_g+ϑ^r_l

u), ψ = −2π(ψ^t_l_u +ψ_l^r

u), n^′₁ = mod (ns, Nh), andn^′₂= mod (ns,^N_N^s

h). In the right side of (19).b, the phasesθn_s and(n^′₁ϑ+n^′₂ψ)are independent of the partition indexp. Hence, we need to align the phasesθn_swith (n^′₁ϑ+n^′₂ψ) ∀ns for a single sub-array, then accommodate for the coherent extension of the single sub-array phases.

Thereafter, we apply the partition-combining phase shift to align the phases θ_(p−1)N_s+1 with _N^N^s

h(p−1)ψ as shown in the left side of (19).b to form the full dimensional reflection beam vector that finally represents the full array. Moreover, the effective RIS reflection phase at the n^th element can be expressed as

θ˜n=θ^′_p+θn_s, (20) whereθ_p^′ =θ_(p−1)N_s₊₁∈ΘRISis thep^thpartition phase shift andns∈ {1, . . . , Ns}, p∈ {1, . . . , Np}, n∈ {1, . . . , NRIS}is the phase of the first sub-array. Furthermore, we can express the full dimensional N_RIS×1 RIS phase shift vector θ˜ as follows

θ˜= [˜θ1,· · · ,θ˜n,· · · ,θ˜N_RIS]^T

= [θ^′₁+θ₁, θ₁^′ +θ₂, θ^′₁+θ₃,· · ·, θ^′₁+θ_N_s,· · ·, θ_p^′ +θ1, θ^′_p+θ2,· · ·, θ^′_p+θN_s,· · · ,

θ^′_N_p+θ1,· · ·, θ^′_N_p+θN_s

i^T

. (21)

Following from (20), we use the cascaded DRL agents defined as DRL¹_b and DRL²_b,∀b that will consecutively learn the phases in two stages: the first stage is executed where DRL¹_b,∀b agent learns the phases of the first sub-array such

that its state is given by s(t) = [θ₁, θ₂, . . . , θ_N_s]^T ∈ Θ_RIS while keeping the rest of elements OFF. In the second stage, allN_RISRIS elements are activated and each DRL²_b,∀bagent learns the partitions’ phase shifts where its state is defined as s(t) = [θ^′₁, θ₂^′, . . . , θ_N^′

p]^T ∈ΘRIS and the phases of the full dimensional array are defined as per (20). It is worth pointing out that there is no need to repeat the learning for the other Np−1 sub-arrays. The cascaded DRL agents first learn the phases of the first sub-array and then learn the partitions’ phase shifts to form the full dimensional RIS array. Consequently, the maximum number of phases that need to be learned simultaneouslythroughout the cascaded DRL processes is only max(N_s, N_p). As the number of phases is decreased from N_RIS to max(N_s, N_p), the size of the searching space is significantly decreased which helps the algorithm to converge faster. It is worth noting that, since both phases θ^′_p and θ_n_s in (20) are selected fromΘRIS, the effective phaseθ˜n in (20) still holds the discrete phase shifter constraint.

V. SIMULATIONRESULTS

In this section, we evaluate the performance of our proposed MA-DRL based learning approach. In our simulations, we consider the outdoor scenario ‘O1 60’ from the DeepMIMO dataset [19]. Following the generation of channelsG,h^u_r and H_u,∀u∈ {1,2,· · · , U}, we adopt the following DeepMIMO parameters: 1) Scenario name: ‘O1 60’; 2) RIS is located at the position of BS 3; 3) Active users: Row 1201 to 1400; 4) Number of RIS elements in (x, y, z): (1, 16, 16) (i.e., N_RIS = 256); 5) Number of multipaths: 5; 6) Carrier frequency: 60 GHz. We further select 80 out of 181 users each row, yielding a total number of 16,000 users. The BS is at row 850 and column 90 Number of BS antennas: (1, 8, 4) (i.e., MBS = 32). We set the phase quantization toq= 4 bits. Moreover, for the first sub-array RIS DRL process, we first divide 256-element RIS intoNp= 8sub-arrays such that each sub-arrays having Ns= 32elements.

The DNN structures of the critic and the actor networks are fully connected deep neural networks. The critic networks and the actor network have identical structure, comprised of one input layer, one output layer, and two hidden layers.

The input and the output dimension of the critic networks equals to the cardinality of the state set together with the action set and the Q-value function, respectively. The input and output dimension of the actor network equals the cardinality of the action set. The dimensions of the hidden layers is larger than the input and output dimensions. The DNNs use adam optimizer with learning rate 10⁻³. It is worth noting that the generated dataset is used to reproduce a wireless system. The proposed framework does not have access to the dataset; it blindly learns the target beams.

For comparison purposes. we consider the following bench- marks: 1) the upper bound beamforming based on the singular value decomposition (SVD) of perfectly known channels under unquantized phase shifter [20]; 2) the DFT-based codebook scanning directions withMBScandidate beams at the BS and NRIS candidate reflection beams at the RIS [2]; and 3) the oversampled DFT-based codebook [21] with an oversampling

(7)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Iterations 10⁴

0 10 20 30 40 50

Beamforming Gain

MA-DRL RIS sub-array (1st Stage) Classicial DFT Codebook SVD perfect CSI

200 400 600 800 1000 1200 1400

Iterations 100

150 200 250 300 350 400

Beamforming Gain

MA-DRL for partition combining (2nd Stage) Classicial DFT Codebook

Oversampled DFT Codebook (x4) SVD perfect CSI

Fig. 3: Beamforming gain of the cascaded learning process for B = 16.

4 6 8 12 16 24 32 48 64

6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2

Spectral Efficiency bps/Hz

Multi-agent DRL solution Classicial DFT Codebook Oversampled DFT Codebook (x4) SVD solution perfect CSI

Fig. 4: Spectral efficiency vs. number of beams/clusters.

factor of4 with4M_BS candidate beams at the BS and4N_RIS candidate reflection beams at the RIS.

Fig.3 investigates beamforming gain of the cascaded learning process against the number of iterations. The top sub- plot shows the first stage of DRL learning process for RIS subarray of N_s = 32 whereas the bottom sub-plot shows second stage of partition-combining DRL learning process of the full dimensional RIS array. Fig.3 shows that the first RIS sub-array DRL method of proposed cascaded learning process achieves a higher beamforming gain than the best beam in the classical beamsteering codebook, with only 4000 iterations.

More interestingly, with less than 2 ×10⁴ iterations, the first RIS sub-array DRL method converges. Furthermore, the second partition-combining DRL method converges with less than 800 iterations. Hence, the partitioning-combining method of the large RIS is able to accelerate the learning convergence while decreasing the search space. Fig. 4 shows the spectral efficiency achieved by the proposed approach versus the codebook size. As shown in Fig. 4, the proposed MA-DRL

approach outperforms the classical DFT codebook only with a single BS beamformer and6-beam RIS codebook, requiring just 6 beam training slots. The classical DFT codebooks used in [2] devises a multi-beam training method to save beam training overhead that needs ^N_N^RIS

p (1 +^log²₂^(N^p⁾) = 80 beam training slots. Compared to [2], our learned RIS codebook design with6-beam require only7.5%beam training overhead.

Moreover, the designed64-beam RIS codebook matches with the oversampled DFT codebook that needs128beam training time slots for BS beamformer and 1024 beam training slots for the RIS reflection beam; yet, the proposed 64-beam RIS codebook requires only6% of the beam training overhead.

VI. CONCLUSION

In this paper, a multi-agent DRL based learning framework has been developed for designing the active beamforming and passive reflection beam codebooks for RIS-assisted mmWave systems. The developed solution incorporates a cascaded learning framework that accelerates convergence, reduces the search space, and decreases computational complexity. Simu- lation results demonstrate the effectiveness of the proposed approach to learning BS beamforming and RIS reflection beam codebooks that adapt to user distributions and channel characteristics. Additionally, the results show a considerable reduction in beam training overhead compared to DFT codebooks.

REFERENCES

[1] B. Zhenget al., “A survey on channel estimation and practical passive beamforming design for intelligent reflecting surface aided wireless communications,”IEEE Commun. Surveys Tuts., pp. 1–1, 2022.

[2] C. You, B. Zheng, and R. Zhang, “Fast beam training for IRS-assisted multiuser communications,” IEEE Wireless Commun. Letters, vol. 9, no. 11, pp. 1845–1849, Nov. 2020.

[3] S. Mabrouki, I. Dayoub, Q. Li, and M. Berbineau, “Codebook designs for millimeter-wave communication systems in both low- and high- mobility: Achievements and challenges,” IEEE Access, vol. 10, pp.

25 786–25 810, 2022.

[4] Y. Wang, N. J. Myers, N. Gonz´alez-Prelcic, and R. W. Heath, “Site- specific online compressive beam codebook learning in mmwave vehic- ular communication,”IEEE Trans. Wireless Commun., vol. 20, no. 5, pp. 3122–3136, 2021.

[5] A. Abdallah, A. Celik, M. M. Mansour, and A. M. Eltawil, “Deep learning-based channel estimation for wideband RIS-aided mmWave MIMO system with beam squint,” inProc. IEEE Int. Conf. Commun.

(ICC), Seoul, South Korea, 2022, pp. 1269–1275.

[6] A. Abdallah, A. Celik, M. M. Mansour, and A. M. Eltawil, “Ris-aided mmwave mimo channel estimation using deep learning and compressive sensing,”IEEE Trans. Wireless Commun., pp. 1–1, 2022.

[7] H. Guoet al., “Weighted sum-rate maximization for intelligent reflecting surface enhanced wireless networks,” in Proc. IEEE Global Conmun.

Conf. (GLOBECOM), 2019, pp. 1–6.

[8] R. Liu, M. Li, Q. Liu, and A. L. Swindlehurst, “Joint symbol-level precoding and reflecting designs for IRS-enhanced MU-MISO systems,”

IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 798–811, Feb. 2021.

[9] H. Ur Rehman et al., “Joint active and passive beamforming design for IRS-assisted multi-user MIMO systems: A VAMP-based approach,”

IEEE Trans. Commun., vol. 69, no. 10, pp. 6734–6749, Oct. 2021.

[10] Y. Zhu et al., “Deep reinforcement learning based joint active and passive beamforming design for RIS-assisted MISO systems,” arXiv preprint arXiv:2202.11702, 2022.

[11] W. Wang et al., “Joint beam training and positioning for intelligent reflecting surfaces assisted millimeter wave communications,” IEEE Trans. Wireless Commun., vol. 20, no. 10, pp. 6282–6297, Oct. 2021.

(8)

[12] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent surface assisted multiuser MISO systems exploiting deep reinforcement learning,”

IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1839–1850, 2020.

[13] Y. Zhang, M. Alrabeiah, and A. Alkhateeb, “Learning reflection beamforming codebooks for arbitrary RIS and non-stationary channels,”arXiv preprint arXiv:2109.14909, 2021.

[14] Y. Zhang and A. Alkhateeb, “Reinforcement learning of beam codebooks in millimeter wave and terahertz MIMO systems,”IEEE Trans.

Commun., 2022.

[15] A. Wong, T. B¨ack, A. V. Kononova, and A. Plaat, “Multiagent deep reinforcement learning: Challenges and directions towards human-like approaches,”arXiv preprint arXiv:2106.15691, 2021.

[16] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- mation error in actor-critic methods,” inInternational conference on machine learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 10–15 Jul 2018, pp. 1587–1596.

[17] G. Dulac-Arnoldet al., “Deep reinforcement learning in large discrete action spaces,”arXiv preprint arXiv:1512.07679, 2015.

[18] A. Abdallah, A. Celik, M. M. Mansour, and A. M. Eltawil, “Multi-agent deep reinforcement learning for beam codebook design in RIS-aided systems.” [Online]. Available: http://hdl.handle.net/10754/685254 [19] A. Alkhateeb, “DeepMIMO: A generic deep learning dataset for mil-

limeter wave and massive MIMO applications,” inProc. of Info. Theory and Appl. Workshop (ITA), San Diego, CA, Feb 2019, pp. 1–8.

[20] R. W. Heath, N. Gonz´alez-Prelcic et al., “An overview of signal processing techniques for Millimeter Wave MIMO systems,”IEEE J.

Sel. Topics Signal Process., vol. 10, no. 3, pp. 436–453, Apr. 2016.

[21] “Massive mimo for new radio,” Dec 2020. [Online]. Available:

https://www.samsung.com/global/business/networks/insights/white- papers/1208-massive-mimo-for-new-radio/