Deep Reinforcement Learning Based Beamforming Codebook Design for RIS-aided mmWave Systems
Item Type Conference Paper
Authors Abdallah, Asmaa;Celik, Abdulkadir;Mansour, Mohammad M.;Eltawil, Ahmed
Citation Abdallah, A., Celik, A., Mansour, M. M., & Eltawil, A. M. (2023).
Deep Reinforcement Learning Based Beamforming Codebook Design for RIS-aided mmWave Systems. 2023 IEEE 20th
Consumer Communications & Networking Conference (CCNC).
https://doi.org/10.1109/ccnc51644.2023.10060056 Eprint version Post-print
DOI 10.1109/ccnc51644.2023.10060056
Publisher IEEE
Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.
Download date 2023-12-05 22:18:45
Link to Item http://hdl.handle.net/10754/690427
Deep Reinforcement Learning Based Beamforming Codebook Design for RIS-aided mmWave Systems
Asmaa Abdallah⋆, Abdulkadir Celik⋆, Mohammad M. Mansour⋆⋆, Ahmed M. Eltawil⋆
⋆⋆Department of Elect. and Computer Engineering, American University of Beirut, Beirut 1107 2020, Lebanon.
⋆Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, KSA.
Abstract—Reconfigurable intelligent surfaces (RISs) are envi- sioned to play a pivotal role in future wireless systems with the capability of enhancing propagation environments by intelligently reflecting the signals toward the target receivers. However, the optimal tuning of the phase shifters at the RIS is a challenging task due to the passive nature of reflective elements and the high complexity of acquiring channel state information (CSI).
Conventionally, wireless systems rely on pre-defined reflection beamforming codebooks for both initial access and data trans- mission. However, these existing pre-defined codebooks are com- monly not adaptive to the environments. Moreover, identifying the best beam is typically performed using an exhaustive search that leads to high beam training overhead. To address these issues, this paper develops a multi-agent deep reinforcement learning framework that learns how to jointly optimize the active beamforming from the BS and the RIS-reflection beam codebook relying only on the received power measurements. To accelerate learning convergence and reduce the search space, the proposed model divides the RIS into multiple partitions and associates beam patterns to the surrounding environments with low computational complexity. Simulation results show that the proposed learning framework can learn optimized active BS beamforming and RIS reflection codebook. For instance, the proposed MA-DRL approach with only 6 beams outperforms a 256-beam discrete Fourier transform (DFT) codebook with a 97% beam training overhead reduction.
I. INTRODUCTION
Reconfigurable intelligent surfaces (RIS) have emerged as a key enabler technology for next-generation wireless systems to improve bandwidth and energy efficiency at low power and hardware costs [1]. The RIS is equipped with many low- cost passive reconfigurable elements to intelligently control the reflection of signals using adjustable phase shifts [1].
Exploiting the full potential gains of RIS relies on finding the optimal RIS configuration that yields the best phase shifts to reflect the incident signals toward the target receivers.
Conventionally, some pre-designed reflection beam code- books, such as discrete Fourier Transform (DFT) based code- books [2]–[4], are used to scan all possible directions or rely on channel state information to form phased versions of DFT codebooks. However, in large RIS setups, these codebooks pose several challenges: (i) their design typically relies on channel knowledge which is difficult to acquire due to the pas- sive nature of RIS [5], [6], and (ii) the pre-defined codebooks
The authors gratefully acknowledge financial support for this work from Ericsson AB and KAUST.
normally require significant beam training overhead, and are not adaptive to the environment.
Recent work have investigated optimization based schemes for beamforming designs in RIS assisted wireless communica- tion system [7]–[11]. In [7], the weighted sum-rate problem is decoupled via Lagrangian dual transform. Then, the transmit beamforming is optimized by the fractional programming method, and the passive beamforming at RIS is optimized by three efficient algorithms with closed-form expressions.
In [8], a combination of symbol-level precoding and IRS techniques has been studied for a multiuser system. In [9], an alternating optimization scheme is studied to find the adequate RIS phase shifts. Moreover, the optimization based schemes for beamforming designs in [7]–[11] rely on explicit channel knowledge and do not focus on designing the BS beamforming and the RIS reflection beam codebook when the channel state information (CSI) is unknown.
Furthermore, for the ease of optimization, the reflecting elements are normally assumed to have continuous phases [9], [10], [12]. In [13], a deep reinforcement learning (DRL) approach has been presented to design the RIS codebook;
however, the authors considered a single-antenna base station (BS) and did not handle the active beamforming problem at the BS.
In this paper, we develop a low-complexity yet efficient multi-agent DRL (MA-DRL) approach for jointly designing active BS beamforming and RIS reflection beam codebook.
The proposed solution accounts for the practical hardware limitations, such as the quantized phase shifter constraints on the BS and RIS, and does not require any explicit channel knowledge and relies only on receive power measurements, which relaxes the synchronization requirements and the chan- nel estimation overhead. Furthermore, our developed solution serves as a general solution for both active and passive beamforming problem and includes a novel cascaded learning and combining framework that highly reduces the convergence time. Simulation results confirm the learning capability of the proposed scheme to identify beam patterns that adapt to user channels. In addition, the learned small sized codebooks outperform DFT and oversampled DFT codebooks while sig- nificantly reducing the beam training overhead.
II. SYSTEMMODEL
We consider a RIS-aided multiple-input single-output (MISO) millimeter wave (mmWave) system consisting of a
BS equipped with a MBS = Mh × Mv uniform planer array (UPA) (where Mh andMv separately denotes the sizes along the horizontal and vertical dimensions) and a set of U single-antenna UEs, which are served by traversing a NRIS=Nh×NvUPA-RIS as line-of-sight (LoS) links to the BS are blocked [c.f. Fig. 1]. The RIS→BS channel is given by Saleh-Valenzuela geometric channel model [5] as follows
G=
LG
X
lg=1
αlgb ϑrl
g, ψrl
g
a
ϑtl
g, ψtl
g
T
∈CMBS×NRIS, (1) whereLGdenotes the number of paths;a(·,·)∈CNRIS×1and b(·,·)∈CMBS×1represent the normalized array steering vec- tors associated with the RIS and the BS, respectively;αlgis the complex gain over the lgth
path;ϑrl
g/ψrl
g is azimuth/elevation angle-of-arrivals (AoAs) at the BS for thelgth
path;ϑtl
g/ψtl
g is azimuth/elevation angle-of-departures (AoDs) at the RIS for the lgth path. For an Nh×Nv UPA-RIS, a(ϑ, ψ)is given by
a(ϑ, ψ) = √N1
RIS
h
e−j2πϑnh/λi
⊗h
e−j2πψnv/λi , (2) where nh = [0,· · · , Nh −1], nv = [0,· · ·, Nv − 1], λ is the carrier wavelength, d = λ/2 is the antenna spacing, ϑ = dsin(ϱ) cos(υ), ψ = dsin(υ), and ϱ/υ represents azimuth/elevation angles. For anMh×MvUPA-BS,b(ϑ, ψ) is similarly given by
b(ϑ, ψ) =√M1
BS
h
e−j2πϑmh/λi
⊗h
e−j2πψmv/λi , (3) where mh = [0,· · ·, Mh−1] and mv = [0,· · ·, Mv −1].
Likewise, the UEu→RIS channel is expressed as hur =
Lu
X
lu=1
αlua ϑrlu, ψlru
∈CNRIS×1,∀u∈ {1,2,· · ·, U}, (4) where Lu is the number of paths; αlu is the complex gain caused by the path loss for the luth
path; and ϑrlu/ψrlu is the arrived theluthpath’s azimuth/elevation AoA at the RIS. Since the BS and RIS are typically deployed at fixed locations, G is considered to be a quasi-static channel with a much longer coherence time thanhr,u[1]. Following from (1) and (4), the cascaded channelHu∈CMBS×NRIS can be expressed as
Hu≜Gdiag (hur) =
LG
X
lg=1 Lu
X
lu=1
αlgαlub
ϑrlg, ψlrg aT
ϑtlg+ϑrlu, ψtlg+ψrlu , (5) where
(ϑrl
g, ψrl
g) LlG
g=1 are independent of the user index u since the RIS-BS channel is common for all users. Hence, the BS beamforms toward the RIS in the same direction for all users.
To mitigate hardware cost and power consumption of mixed signals components [14], we adopt analog-only beamforming where the BS has a single RF chain along with a network of q-bit quantized phase shifters. The beamformer/combiner designed for the BS is given by
w= 1
√MBS
[ejφ1, . . . , ejφm, . . . , ejφMBS]T∈CMBS×1, (6) where φm is the phase of mth antenna element and selected from ΘBS, a subset of 2q possible quantized discrete values drawn uniformly from (−π, π] at the BS. To reduce the complexity of precoding optimization, the RIS also adopts re-
RIS
𝑁!"#= 𝑁$× 𝑁%elements
𝑀&#= 𝑀BS'×𝑀( antennas
RIS controller B DRL agents
Feedback reward (+1,-1)
Precoder 𝐰
𝝓!
𝝓" 𝝓#
… … 𝝓!$"
𝝓%
DRL agent
User 𝑢
Fig. 1: Illustration of the RIS-aided mmWave MISO systems.
flection beam codebooks tailored forBuser clusters. Denoting the interaction codebook that containsB reflection vectors by B, the bth,1≤b≤B, beam vector is given by
Φb = 1
√NRIS
[ejθb1, ..., ejθbn, ..., ejθbNRIS]T∈CNRIS×1, (7) where each θbn is the phase of nth RIS element and selected fromΘRIS, a subset of 2q possible quantized discrete values drawn uniformly from (−π, π] at the RIS. The phases of signals impinging on the RIS are reconfigured through the micro-controller connected to the RIS. Therefore, a symbol su ∈C sent by UEu traverse the RIS and is received at the BS as1
yu=wHGdiag (hur)Φbsu+wHn (8)
=wHHuΦbsu+wHn, (9) where n ∼ CN 0, σ2IMBS
is the MBS×1 additive white Gaussian noise with varianceσ2.
III. PROBLEMFORMULATION
In this section, we provide a formal problem definition that jointly optimizes beamformer/combinerw at the BS and RIS reflection beam codebook B to maximize overall SNR averaged over the entire set of users. Following from the reception signal presented in (9), the SNR of UEu can be written as
SNRu=PS|wHHuΦb|2
σ2||w||2 =ρηub, (10) whereρ = PσS2, ||w||2 = 1, and the composite beamforming gain is given by
ηub =
αlgαluwHb
ϑrlg, ψlrg aT
ϑtlg +ϑrlu, ψltu+ψlru Φb
2
= M 1
BSNRIS
αlgαlu
MBS
X
m=1 NRIS
X
n=1
ejφmejθbne−jϕune−jωBSm
2
(11) where we define ϕun as ϕun = 2π(n1(ϑtl
g+ϑrl
u) +n2(ψlt
u + ψrl
u)),ωmBS asωBSm = 2π(m1ϑrl
g+m2ψlr
g), and n1=
(mod (n, Nh)−1, if mod (n, Nh)̸= 0,
Nh−1, otherwise, . (12)
1We assumesusatisfies the average power constraintE
|su|2
=Ps.
Similarly, n2,m1,m2 can be obtained by replacingn1/Nhin (12) withn2/Nv,m1/Mh, andm2/Mv, respectively.
If the BS beamforming/combining vectorwis used andΦb
is selected from a codebookB, with cardinality|B|=B, then, the maximum achievable SNR for of UEu is obtained by the exhaustive search over Bas
SNR⋆u=ρ max
w,Φb∈Bηbu=ρ max
w,Φb∈B|wHHuΦb|2. (13) The objective of this paper is to design BS beamform- ing/combining vector w and the RIS beam codebook B to maximize the SNR given by (13) averaged over the set of the users served by the BS. Accordingly, the joint beamforming and reflection codebook design problem that maximizes the average user SNR can be formulated as
w⋆,B⋆= arg max
w,B 1
|Hr|
X
hur∈Hr
max
w,Φb∈Bηbu (14) s.t. φm∈ΘBS,∀m∈ {1, . . . , MBS}, (15)
|B|=B,Φb∈B, θbn∈ΘRIS,∀n∈ {1, . . . , NRIS}, (16) where Hr represents the set of channel vectors from RIS to all users. When the constraints are ignored, the maximum is attained only when phases of BS beamformers and RIS configuration satisfy
−ωmBS+φm−ϕun+θbn=cu, ∀n, m, u, ∃Φb∈B (17) where cu is an arbitrary constant phase value. However, satisfying (17) is impractical due to the following limitations:
1) discretizing continuous valued phases causes quantization errors and 2) the acquisition of accurate channel state in- formation (CSI) is challenging due to channel matrix size.
Moreover, given its non-convex and combinatorial nature, optimal solution of (14) requires an exhaustive search over theoretically finite but practically infinite feasible set. For instance, for an RIS equipped with 256 elements and 3-bit phase quantization, there exists over 1.5×10231 candidate reflection vectors. Also considering the passive and low-cost nature of RIS, it is important to develop a simple yet effective solution that exploits the overall channel gain ηbu as a figure of merit without requiring any CSI, which is explained next.
IV. MULTI-AGENTDEEPREINFORCEMENTLEARNING
BASEDJOINTBEAMFORMING ANDCODEBOOKDESIGN
This section develops a solution to the joint beamforming and codebook design by leveraging powerful exploration ca- pability of MA-DRL to find a near-optimal solution over the huge search space mentioned above. The MA-DRL differs from its single agent counterpart in cooperating and acting jointly to achieve a common ultimate reward [15]. The MA- DRL is especially suitable for complex problems that can be decomposable into sub-problems, each of which is handled by a single DRL agents. In this manner, we decompose the joint master problem into two sub-problems: in the former, a single agent DRL0obtain the BS combiner (i.e, active beamforming) vector w for a given RIS reflection vector. While the learned w is common for all users/clusters since the RIS-BS channel is shared, user groups observe distinct channel characteristics to/from RIS. Therefore, the latter sub-problem group users with similar UE-RIS channels into B clusters and exploitsB
DRL agents such that DRLb, 1 ≤ b ≤ B, is responsible to design reflection (i.e., passive beamforming) codebook ofbth cluster,Φb∈B, which is explained in detail next.
A. Operation Modes
The proposed approach operates in two modes,Multi-agent learning mode anddeployment mode.
1) Multi-agent Learning mode: This mode is executed first where the multi-agent DRL agents are trained to learn the BS beamformer/combiner and the RIS reflection beam codebook from users with established links with minimal impact on the wireless system performance. It is intended to run in the background and gather information over a relatively long period of time. In this mode, it is assumed that RIS and base station employ some classical codebook, with sporadic usage of the learned beams. Moreover, since the positions of the BS and RIS are fixed for a long period of time, the BS dedicated DRL agent needs less frequent training updates.
2) Deployment mode: Only when the beamformer and the codebook are learned does the network switch to the deployment mode, where the learned beamformer and RIS codebook replaces the classical ones. Users with similar channels will probably be assigned to the same RIS reflection vector during this mode. However, these users are assumed to be scheduled at different time or frequency resources to avoid possible interference between them. For example, the same RIS reflection beam vector can serve multiple users at different sub-bands or in different time slots.
B. DRL Based Beam Pattern Design
Previous works generally rely on deep deterministic policy gradient (DDPG) based DRL agents for beamforming opti- mization problems [12], [14]. Nonetheless, the DDPG does not consistently deliver the best performance since 1) it is often susceptible to hyperparameter tuning and 2) the learned Q- function begins to dramatically overestimate Q-values, leading to policy breaking [16]. To mitigate these adverse effects, twin delayed DDPG (TD3) introduces three critical enhancements [16]: 1) clipped double-Q Learning allows TD3 to learn two different Q-functions (i.e., critic networks) and select the smaller of the two Q-values to update the loss functions.
2) TD3 updates target networks (i.e., policy) less frequently than the critic network (Q-values). This helps in damping the volatility that arises in DDPG because of how a policy update changes the target. 3) Using target policy smoothing, TD3 adds exploration noise to the target action when updating the policy to less likely exploit actions with high Q-value estimates.
Therefore, we explore the TD3-DRL to solve our optimiza- tion problem and adopt the Wolpertinger architecture [17]
to efficiently explore optimal policy in a massive discrete quantized action space. Before delving into the building blocks of TD3 architecture, let us first introduce the components of DRL agents common for all sub-problems:
• s(t) denotes the state vector of tth learning epoch and consists of the phases defined in (6) and (7) for active and passive beamforming at the BS and RIS, respec- tively. For instance, the states for the BS and the RIS
reflection beamformer design are defined as s(t) = [φ1, φ2,· · ·, φMBS]T and s(t) = [θ1, θ2,· · ·, θN]T, re- spectively. Then, the BS beamformer w(t) and the RIS reflection beamΦb(t)at time instanttcan be constructed based on the current states by using equations (6) and (7), respectively.
• a(t)denotes the action vector of tth learning epoch and shows element-wise changes to the phases in s(t). The phase changes are represented by phases selected from ΘBS and ΘRIS for active and passive beamforming at the BS and RIS, respectively. The action also determines the state of next epoch, i.e. a(t) =s(t+ 1).
• r(t)∈ {−1,+1}denotes the bi-level reward determined by the beamforming gain ηbu(t)based on current states, i.e., w(t) andΦb(t),∀b. That is, r(t) = +1 if ηub(t)>
ηbu(t−1),r(t) =−1 otherwise.
As shown in Fig. 2, TD3 comprises of three deep neural net- works (DNNs), a single actor-network and two critic networks.
The actor-network takes the state as input and outputs a contin- uous proto-action. Since the proto-actions do not necessarily comply with available phase quantization levels, the quantizer maps them into the corresponding phase shifts belonging to ΘBS and ΘRIS. After that, the state and action are passed together to the critic networks. The actor-network is used to approximate the action. The actor and critic networks have duplicates called the target actor and target critic networks to provide computational stability. They cannot be trained, unlike actor and critic networks, but are nevertheless used to determine the targets. The parameters of the target actor and critic networks are updated using the parameters of the critic and actor networks after a predetermined number of training iterations.Since the critic networks can overestimate the true Q-value, TD3 selects the minimum of two estimates coming from the two target critic networks to limit the bias on Q-value estimates [16]. Moreover, the actor-network is updated using the deterministic policy gradient, which is given by [16], [18].
Then, the parameters of the critic networks are updated based on the mean squared error loss [16], [18].
It is worth noting that the proposed TD3-DRL framework is based solely on the SNR measurements and relies neither on the CSI nor user locations. Hence, it is not constrained by the channel coherence time since the DRL agents can adjust their decision to choose phases solely based on the UEs’ feedback in the downlink or received signal strength in the uplink.
Without loss of generality, the summation terms in (11) can be written as
MBS
X
m=1 NRIS
X
n=1
ejφmejθbne−jϕune−jωmBS
=
MBS
X
m=1
ejφme−jωmBS
| {z }
For BS learning NRIS
X
n=1
ejθbne−jϕun
| {z }
For RIS learning
(18)
where the dedicated BS DRL0 agent is tasked to learn the phases(φm),∀mthat align withωmBS,∀m(i.e. refer to the left side of (18)), and each RIS DRLb,∀b agent is tasked to the phases(θbn),∀(n, b)that align withϕun,∀(n, u)(as per the right side of (18)). The RIS reflection codebook learning procedure
TD3 DRL Agent
Value 1
CrtiticNetwork 1 CrtiticNetwork 2
Im Re
Quantizer
Proto action
State
Phase vector Actor Network Value 2
Action
Target Crtitic 1 Target Crtitic 2
Policy loss MSE loss 1
MSE loss 2 Loss calculation
min
The chosen phase vector
Reward (+1,-1)
RF Chain
Adjust beam phases to new state for BS/RIS Receive Combining gain
Fig. 2: Beam pattern design framework with TD3 DRL agent.
is explained in the proceeding subsection.
C. Learning BS Combiner/Beamformer (w) Design
We first fix a random RIS reflection beam Φb for all users and optimize the BS combining/beamforming pattern w using DRL0 agent. As mentioned in Section II, the BS beam pattern w is common among all the users/clusters since BS-RIS channel G is shared due to the fixed BS and RIS locations. For DRL0 agent, the state is given by s(t) = [φ1, . . . , φm, . . . , φMBS]T∈ΘBS.
D. Learning RIS Reflection Codebook (Φb) Design
In order to reduce complexity and required codebook stor- age at the RIS, we leverage the fact that some users share similar channels to/from RIS. Therefore, instead of individual learning reflection codebook for each user, we exploit B independent DRL agents to learn RIS reflection pattern of B user clusters, collection of which forms the RIS reflection codebook.
1) K-Means User Clustering: Since the proposed frame- work does not depend on explicit CSI that is not readily avail- able, we leverage a K-means classifier exploiting a set of RIS sensing beams as explained in [14] that are randomly sampled from the feasible set of (16). First, we use the obtained BS beamforming vector w, and then use randomly sampled RIS sensing beams (or reflection vectors). The purpose of utilizing these RIS sensing beams is to gather sensing information in the form of receive combining gain. This information is used to cluster those users, developing a rough sense of their distribution in the environment. The main difference from [14] is that the sensing beams are reflected through the RIS instead of sending from BS directly to users. The BS listens to the RSSI feedback reported from users during the beam training stage and accumulates the received power vectors.
Once enough beam training power vectors are accumulated, the clustering can be executed to train a K-means classifier to group users. It is worth noting that a newly deployed RIS might rely on a random reflection codebook or a pre-defined codebook to serve the user [13], [14].
2) RIS Partitioning and Cascaded Learning: Upon user clustering, the RIS reflection codebook design of clusters is independently learned by B DRL agents. Nonetheless, the large number of RIS elements still renders the task of learning a single reflection vector highly complex and time-consuming.
Therefore, we partition the RIS array into multiple sub-arrays and develop a cascaded DRL learning approach to lower the computational complexity. The cascaded approach proceeds with the following steps: 1) learning the RIS reflection of a small RIS sub-array, 2) extending the learned reflection sub- array to the full-sized array, and refining the learning to obtain the entire reflection vector of the whole RIS surface, which are described in detail below.
For the sake of a better explanation, let us consider a single user case and drop the indices for both reflection vectors inB and U users. Without loss of generality, we assume that the whole array is equally divided intoNp sub-arrays (partitions) withNselements such thatNs×Np=NRISandNs=c×Nh
wherec >0is an integer. Then, the right summation term in (18) can be written as
NRIS
X
n=1
ejθne−jϕRISn
=(a) Np
X
p=1 Ns
X
ns=1
ej(θ(p−1)Ns+1+θns)ej(n
′
1ϑ+(n′2+Ns
Nh(p−1))ψ)
(b)=
Np
X
p=1
ej(θ(p−1)Ns+1ej(
Ns
Nh(p−1)ψ)
| {z }
Partition-Combining
Ns
X
ns=1
ejθnsej(n′1ϑ+n′2ψ)
| {z }
RIS sub-array
(19)
where ϑ = −2π(ϑtlg+ϑrl
u), ψ = −2π(ψtlu +ψlr
u), n′1 = mod (ns, Nh), andn′2= mod (ns,NNs
h). In the right side of (19).b, the phasesθns and(n′1ϑ+n′2ψ)are independent of the partition indexp. Hence, we need to align the phasesθnswith (n′1ϑ+n′2ψ) ∀ns for a single sub-array, then accommodate for the coherent extension of the single sub-array phases.
Thereafter, we apply the partition-combining phase shift to align the phases θ(p−1)Ns+1 with NNs
h(p−1)ψ as shown in the left side of (19).b to form the full dimensional reflection beam vector that finally represents the full array. Moreover, the effective RIS reflection phase at the nth element can be expressed as
θ˜n=θ′p+θns, (20) whereθp′ =θ(p−1)Ns+1∈ΘRISis thepthpartition phase shift andns∈ {1, . . . , Ns}, p∈ {1, . . . , Np}, n∈ {1, . . . , NRIS}is the phase of the first sub-array. Furthermore, we can express the full dimensional NRIS×1 RIS phase shift vector θ˜ as follows
θ˜= [˜θ1,· · · ,θ˜n,· · · ,θ˜NRIS]T
= [θ′1+θ1, θ1′ +θ2, θ′1+θ3,· · ·, θ′1+θNs,· · ·, θp′ +θ1, θ′p+θ2,· · ·, θ′p+θNs,· · · ,
θ′Np+θ1,· · ·, θ′Np+θNs
iT
. (21)
Following from (20), we use the cascaded DRL agents defined as DRL1b and DRL2b,∀b that will consecutively learn the phases in two stages: the first stage is executed where DRL1b,∀b agent learns the phases of the first sub-array such
that its state is given by s(t) = [θ1, θ2, . . . , θNs]T ∈ ΘRIS while keeping the rest of elements OFF. In the second stage, allNRISRIS elements are activated and each DRL2b,∀bagent learns the partitions’ phase shifts where its state is defined as s(t) = [θ′1, θ2′, . . . , θN′
p]T ∈ΘRIS and the phases of the full dimensional array are defined as per (20). It is worth pointing out that there is no need to repeat the learning for the other Np−1 sub-arrays. The cascaded DRL agents first learn the phases of the first sub-array and then learn the partitions’ phase shifts to form the full dimensional RIS array. Consequently, the maximum number of phases that need to be learned simultaneouslythroughout the cascaded DRL processes is only max(Ns, Np). As the number of phases is decreased from NRIS to max(Ns, Np), the size of the searching space is significantly decreased which helps the algorithm to converge faster. It is worth noting that, since both phases θ′p and θns in (20) are selected fromΘRIS, the effective phaseθ˜n in (20) still holds the discrete phase shifter constraint.
V. SIMULATIONRESULTS
In this section, we evaluate the performance of our proposed MA-DRL based learning approach. In our simulations, we consider the outdoor scenario ‘O1 60’ from the DeepMIMO dataset [19]. Following the generation of channelsG,hur and Hu,∀u∈ {1,2,· · · , U}, we adopt the following DeepMIMO parameters: 1) Scenario name: ‘O1 60’; 2) RIS is located at the position of BS 3; 3) Active users: Row 1201 to 1400; 4) Number of RIS elements in (x, y, z): (1, 16, 16) (i.e., NRIS = 256); 5) Number of multipaths: 5; 6) Carrier frequency: 60 GHz. We further select 80 out of 181 users each row, yielding a total number of 16,000 users. The BS is at row 850 and column 90 Number of BS antennas: (1, 8, 4) (i.e., MBS = 32). We set the phase quantization toq= 4 bits. Moreover, for the first sub-array RIS DRL process, we first divide 256-element RIS intoNp= 8sub-arrays such that each sub-arrays having Ns= 32elements.
The DNN structures of the critic and the actor networks are fully connected deep neural networks. The critic networks and the actor network have identical structure, comprised of one input layer, one output layer, and two hidden layers.
The input and the output dimension of the critic networks equals to the cardinality of the state set together with the action set and the Q-value function, respectively. The input and output dimension of the actor network equals the cardinality of the action set. The dimensions of the hidden layers is larger than the input and output dimensions. The DNNs use adam optimizer with learning rate 10−3. It is worth noting that the generated dataset is used to reproduce a wireless system. The proposed framework does not have access to the dataset; it blindly learns the target beams.
For comparison purposes. we consider the following bench- marks: 1) the upper bound beamforming based on the singular value decomposition (SVD) of perfectly known channels under unquantized phase shifter [20]; 2) the DFT-based codebook scanning directions withMBScandidate beams at the BS and NRIS candidate reflection beams at the RIS [2]; and 3) the oversampled DFT-based codebook [21] with an oversampling
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Iterations 104
0 10 20 30 40 50
Beamforming Gain
MA-DRL RIS sub-array (1st Stage) Classicial DFT Codebook SVD perfect CSI
200 400 600 800 1000 1200 1400
Iterations 100
150 200 250 300 350 400
Beamforming Gain
MA-DRL for partition combining (2nd Stage) Classicial DFT Codebook
Oversampled DFT Codebook (x4) SVD perfect CSI
Fig. 3: Beamforming gain of the cascaded learning process for B = 16.
4 6 8 12 16 24 32 48 64
6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2
Spectral Efficiency bps/Hz
Multi-agent DRL solution Classicial DFT Codebook Oversampled DFT Codebook (x4) SVD solution perfect CSI
Fig. 4: Spectral efficiency vs. number of beams/clusters.
factor of4 with4MBS candidate beams at the BS and4NRIS candidate reflection beams at the RIS.
Fig.3 investigates beamforming gain of the cascaded learn- ing process against the number of iterations. The top sub- plot shows the first stage of DRL learning process for RIS subarray of Ns = 32 whereas the bottom sub-plot shows second stage of partition-combining DRL learning process of the full dimensional RIS array. Fig.3 shows that the first RIS sub-array DRL method of proposed cascaded learning process achieves a higher beamforming gain than the best beam in the classical beamsteering codebook, with only 4000 iterations.
More interestingly, with less than 2 ×104 iterations, the first RIS sub-array DRL method converges. Furthermore, the second partition-combining DRL method converges with less than 800 iterations. Hence, the partitioning-combining method of the large RIS is able to accelerate the learning convergence while decreasing the search space. Fig. 4 shows the spectral efficiency achieved by the proposed approach versus the codebook size. As shown in Fig. 4, the proposed MA-DRL
approach outperforms the classical DFT codebook only with a single BS beamformer and6-beam RIS codebook, requiring just 6 beam training slots. The classical DFT codebooks used in [2] devises a multi-beam training method to save beam training overhead that needs NNRIS
p (1 +log22(Np)) = 80 beam training slots. Compared to [2], our learned RIS codebook design with6-beam require only7.5%beam training overhead.
Moreover, the designed64-beam RIS codebook matches with the oversampled DFT codebook that needs128beam training time slots for BS beamformer and 1024 beam training slots for the RIS reflection beam; yet, the proposed 64-beam RIS codebook requires only6% of the beam training overhead.
VI. CONCLUSION
In this paper, a multi-agent DRL based learning framework has been developed for designing the active beamforming and passive reflection beam codebooks for RIS-assisted mmWave systems. The developed solution incorporates a cascaded learning framework that accelerates convergence, reduces the search space, and decreases computational complexity. Simu- lation results demonstrate the effectiveness of the proposed approach to learning BS beamforming and RIS reflection beam codebooks that adapt to user distributions and channel characteristics. Additionally, the results show a considerable reduction in beam training overhead compared to DFT code- books.
REFERENCES
[1] B. Zhenget al., “A survey on channel estimation and practical passive beamforming design for intelligent reflecting surface aided wireless communications,”IEEE Commun. Surveys Tuts., pp. 1–1, 2022.
[2] C. You, B. Zheng, and R. Zhang, “Fast beam training for IRS-assisted multiuser communications,” IEEE Wireless Commun. Letters, vol. 9, no. 11, pp. 1845–1849, Nov. 2020.
[3] S. Mabrouki, I. Dayoub, Q. Li, and M. Berbineau, “Codebook designs for millimeter-wave communication systems in both low- and high- mobility: Achievements and challenges,” IEEE Access, vol. 10, pp.
25 786–25 810, 2022.
[4] Y. Wang, N. J. Myers, N. Gonz´alez-Prelcic, and R. W. Heath, “Site- specific online compressive beam codebook learning in mmwave vehic- ular communication,”IEEE Trans. Wireless Commun., vol. 20, no. 5, pp. 3122–3136, 2021.
[5] A. Abdallah, A. Celik, M. M. Mansour, and A. M. Eltawil, “Deep learning-based channel estimation for wideband RIS-aided mmWave MIMO system with beam squint,” inProc. IEEE Int. Conf. Commun.
(ICC), Seoul, South Korea, 2022, pp. 1269–1275.
[6] A. Abdallah, A. Celik, M. M. Mansour, and A. M. Eltawil, “Ris-aided mmwave mimo channel estimation using deep learning and compressive sensing,”IEEE Trans. Wireless Commun., pp. 1–1, 2022.
[7] H. Guoet al., “Weighted sum-rate maximization for intelligent reflecting surface enhanced wireless networks,” in Proc. IEEE Global Conmun.
Conf. (GLOBECOM), 2019, pp. 1–6.
[8] R. Liu, M. Li, Q. Liu, and A. L. Swindlehurst, “Joint symbol-level precoding and reflecting designs for IRS-enhanced MU-MISO systems,”
IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 798–811, Feb. 2021.
[9] H. Ur Rehman et al., “Joint active and passive beamforming design for IRS-assisted multi-user MIMO systems: A VAMP-based approach,”
IEEE Trans. Commun., vol. 69, no. 10, pp. 6734–6749, Oct. 2021.
[10] Y. Zhu et al., “Deep reinforcement learning based joint active and passive beamforming design for RIS-assisted MISO systems,” arXiv preprint arXiv:2202.11702, 2022.
[11] W. Wang et al., “Joint beam training and positioning for intelligent reflecting surfaces assisted millimeter wave communications,” IEEE Trans. Wireless Commun., vol. 20, no. 10, pp. 6282–6297, Oct. 2021.
[12] C. Huang, R. Mo, and C. Yuen, “Reconfigurable intelligent surface as- sisted multiuser MISO systems exploiting deep reinforcement learning,”
IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1839–1850, 2020.
[13] Y. Zhang, M. Alrabeiah, and A. Alkhateeb, “Learning reflection beam- forming codebooks for arbitrary RIS and non-stationary channels,”arXiv preprint arXiv:2109.14909, 2021.
[14] Y. Zhang and A. Alkhateeb, “Reinforcement learning of beam code- books in millimeter wave and terahertz MIMO systems,”IEEE Trans.
Commun., 2022.
[15] A. Wong, T. B¨ack, A. V. Kononova, and A. Plaat, “Multiagent deep reinforcement learning: Challenges and directions towards human-like approaches,”arXiv preprint arXiv:2106.15691, 2021.
[16] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- mation error in actor-critic methods,” inInternational conference on machine learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 10–15 Jul 2018, pp. 1587–1596.
[17] G. Dulac-Arnoldet al., “Deep reinforcement learning in large discrete action spaces,”arXiv preprint arXiv:1512.07679, 2015.
[18] A. Abdallah, A. Celik, M. M. Mansour, and A. M. Eltawil, “Multi-agent deep reinforcement learning for beam codebook design in RIS-aided systems.” [Online]. Available: http://hdl.handle.net/10754/685254 [19] A. Alkhateeb, “DeepMIMO: A generic deep learning dataset for mil-
limeter wave and massive MIMO applications,” inProc. of Info. Theory and Appl. Workshop (ITA), San Diego, CA, Feb 2019, pp. 1–8.
[20] R. W. Heath, N. Gonz´alez-Prelcic et al., “An overview of signal processing techniques for Millimeter Wave MIMO systems,”IEEE J.
Sel. Topics Signal Process., vol. 10, no. 3, pp. 436–453, Apr. 2016.
[21] “Massive mimo for new radio,” Dec 2020. [Online]. Available:
https://www.samsung.com/global/business/networks/insights/white- papers/1208-massive-mimo-for-new-radio/