Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming

(1)

Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming

Item Type Conference Paper

Authors Koppel, Alec;Singh Bedi, Amrit;Ganguly, Bhargav;Aggarwal, Vaneet

Citation Koppel, A., Singh Bedi, A., Ganguly, B., & Aggarwal, V.

(2022). Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming.

2022 IEEE 61st Conference on Decision and Control (CDC). https://

doi.org/10.1109/cdc51059.2022.9992556 Eprint version Pre-print

DOI 10.1109/CDC51059.2022.9992556

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.;This file is an open access version redistributed from:

http://arxiv.org/pdf/2110.12929 Download date 2023-12-03 20:05:37

Link to Item http://hdl.handle.net/10754/689851

(2)

Convergence Rates of Average-Reward Multi-agent

Reinforcement Learning via Randomized Linear Programming

Alec Koppel^∗†, Amrit Singh Bedi^∗^$, Bhargav Ganguly^‡, and Vaneet Aggarwal^‡

Abstract— In tabular multi-agent reinforcement learning with average-cost criterion, a team of agents sequentially interacts with the environment and observes local incentives.

We focus on the case that the global reward is a sum of local rewards, the joint policy factorizes into agents’ marginals, and full state observability. To date, few global optimality guarantees exist even for this simple setting, as most results yield convergence to stationarity for parameterized policies in large/possibly continuous spaces. To solidify the founda- tions of MARL, we build upon linear programming (LP) reformulations, for which stochastic primal-dual methods yield a model-free approach to achieve optimal sample complexity in the centralized case. We develop multi-agent extensions, whereby agents solve their local saddle point problems and then perform local weighted averaging. We establish that the sample complexity to obtain near-globally optimal solutions matches tight dependencies on the cardinality of the state and action spaces, and exhibits classical scalings with respect to the network in accordance with multi-agent optimization.

Experiments corroborate these results in practice.

I. INTRODUCTION

In multi-agent reinforcement learning (MARL), a collection of agents repeatedly interact with their environment and are exposed to localized incentives. This framework has gained traction in recent years through successful application to autonomous vehicular networks [1], games [2], and vari- ous settings in econometrics [3], [4]. At the core of MARL is a Markov Decision Process (MDP) [5], which determines the interplay between agents, states, actions, and rewards.

We focus on the standard objective, whereby the goal of the network agents is to discern policies so as to maximize the long-term accumulation of instantaneous rewards, which may be written as a node-separable sum of all localized rewards [6].

Defining the team reward in this way implies that agents seek to cooperate towards a common goal, which may be contrasted with competitive or mixed settings [7]. Due to the surge of interest in MARL, disparate possible technical settings have been considered, which span how one de- fines MDP transition dynamics; the observability of agents trajectories, the availability of computational resources at a centralized location, and the protocol by which agents

∗Equal contributions.

A. Koppel is with the Supply Chain Optimization Technologies at Amazon, Seattle [email protected] (work completed while at the U.S. Army Research Laboratory in Adelphi, MD 20783)

A. S. Bedi is with the Institute of Systems Research, University of Maryland, College Park, Maryland, USA [email protected] B. Ganguly is with Purdue University, USA. V. Aggarwal is with Purdue University, USA and King Abdullah University of Science and Technology, Saudi Arabia.{bganguly,vaneet}@purdue.edu

exchange information. We consider the case that agents have global knowledge of the state and action (in contrast to partial observability [8], [9], which necessitates pooling information as in centralized training decentralized execution (CTDE) [10], [11], [12], [13]). Further, we hypothesize that the team’s joint policy factorizes into the product of marginals, which is referred to asjoint action learners(JAL) [14], [15].

Our focus is on decentralized training of JAL, which means agents’ rewards and policy parameters are locally held and private. Numerous recent works on MARL operate in this setting, as in multi-agent extensions of temporal difference (TD) learning [16], [17], Q-learning [18], value iteration [19], [20], and actor-critic [21], [22]. In these works, agents may communicate according to the connectivity of a possibly time-varying graph, which is intimately connected to multi-agent optimization.

In the aforementioned references (limited to discounted objectives), convergence guarantees are mostly asymptotic, apply only to MARL sub-problems as policy evaluation (estimating the value function assuming a fixed policy [27], [28]), or due to implied non-convexity induced by policy parameterization, cannot avoid spurious¹ policies [26], [29]

– see [30] for further details.

For these reasons, we focus on LP reformulation of RL in the average reward setting [31], [32], [33], for which stochastic primal-dual method achievesoptimal sample complexity in the centralized tabular case [24]. ² Our goal is to understand which settings the O˜(^τ²^t²^mix2^|S||A|log(1/δ)) complexity achieved for finding an -optimal solution with probability 1−δ in the centralized case [24][Theorem 4]

may be translatable to the multi-agent setting when agents may only exchange local information with their one-hop neighbors. In particular, when agents combine localized stochastic primal-dual methods with weighted averaging to diffuse information across the network [6], [34], we seek to determine whether the optimal sample complexity of the LP approach generalizes to the average-reward tabular MARL.

Our contributions are to:

(i) propose a novel multi-agent variant of the dual LP formulation of RL, where agents’ decisions are defined by estimates of an average state-action occupancy measure and value vector, and consensus constraints are imposed on agents’ localized estimates (Sec. II).

1Spurious here should be interpreted in the sense of stationary points that are far from global optimality.

2In particular, we useO(1)to denote an absolute constant, andO(1)˜ to hide polylog factors in|S|,|A|, and, which are the respective cardinalities in the state and action spaces, andis some pre-defined optimization error.

(3)

References Rewards Setting Sample Complexity

[23] Discounted Centralized Oe

n|S||A|

(1−γ)²

[24], [25] Average Centralized, Parallel Ω

τ²t²_mix^E⁰^|S||A|₂ log¹_δ

[23] Average Parallel Oe_n|S||A|

(1−γ)²

[26] Average Decentralized —-

This work Average Decentralized Ω

τ²t²_mix^√^nE⁰|S||A|D(Γ,ρ) ² log¹_δ TABLE I. MARL for average rewards case. The proposed scheme is the

first decentralized algorithm in the average reward settings with PAC sample complexity. Here,is the accuracy parameter,δdenotes the high

probability parameter,nis the number of agents,D(Γ, ρ) :=_1+Γ

1−ρ

whereΓ=(1−η/4n²)⁻²,ρ=(1−η/(4n²))^1/B,Bis the network strong connectivity parameter, andηis a lower-bound on the entries of the

mixing matrix.

(ii) owing to node-separability of the Lagrangian relaxation of the resulting optimization problem, we derive a decentralized model-free training mechanism based on a stochastic variant of primal-dual method that employs Kullback-Lieber (KL) divergence as its proximal term in the space of occupancy measures (Sec. III), together with local weighted averaging.

(iii) establish that the number of samples required to attain near-globally optimal solutions matches tight dependencies on the cardinality of the state and action spaces [24], and exhibits classical scalings with the size of the team in prior theory [6].

(iv) demonstrate the experimental merits of this approach in cooperative navigation problems.

Additional Context.Local averaging as a strategy for information mixing in multi-agent optimization is outperformed by schemes based upon Lagrange multiplier exchange, e.g., primal-dual method [35], alternating direction method of multipliers (ADMM) [36], and dual reformulations [37]. In this work, however, we opt for a primal-only approach to enforcing consensus for simplicity and its compatibility with Perron-Frobenius theory [38].

We further focus on the case where the communications network is a structural component of the problem setting, as in [21], [22]. However, a separate but related body of works estimate the communications architecture when agents’ behavior is fixed using graph neural networks [39], [40], [41] or statistical tests for correlation between agents’

local utilities [29], [42].

To the best of our knowledge, none of the aforementioned works deal with the average reward settings in MARL, with the exception of [26]. However, it provides asymptotic-only analysis. By contrast, the probably approximately correct (PAC) sample complexity results given here are unique to the MARL average-reward setting, and may be seen as a multi- agent generalization of [24]. Critical to this generalization is a novel Lyapunov function that result to jointly tracks the convergence of the primal-dual iterates and the consensus error. We note that PAC results have been developed for average-reward MARL in [25]. However, it operates under a setting where policy and reward information are glob-

ally shared agents at each step, which in the optimization literature is known as parallel [43], not decentralized, as the updates cannot be executed with local and neighboring information only. For results most similar to this work in the MARL setting, please see Table I.

II. PROBLEMFORMULATION

We consider MARL problems among agents who share a globally observable state, but take actions and observe rewards which are distinctly local. In this context, agents seek to coordinate in order to maximize the team’s cumulative return of rewards, which is a sum over all locally observed rewards. More specifically, we consider a time-varying net- workG^t= (V, E^t, W^t)ofnagentsN={1,2, . . . , n}, where agent i∈ N may communicate with its neighborsj if they share an edge (i, j) ∈ E^t, and no others, at a given time t. The weight matrix W^t:=[w_ij^t] ∈ Rⁿ^×ⁿ, where w_ij^t ≥ 0 andw^t_ij =w^t_ji ∀i, j, t, assigns weights to each edge (i, j).

One canonical example ofw^t_ij is the relative degree between agent iandj at timet:w^t_ij =d^t_i/(d^t_i+d^t_j), withd^t_i as the degree, or number of nodes that are a one-hop neighbor of agenti.

With the network structure clarified, we now detail how the states, actions, and rewards interconnect. Precisely, at each time, agent i ∈ V observes the current system state s∈ S and synchronously takes an action ai∈ Aⁱ, which is concatenated as the joint actiona=(a1, ..., an)∈ A¹× · · · × Aⁿ. The state spaceS and the constituent action spacesAⁱ are discrete finite sets with respectively|S and|A|elements.

Trajectories are Markovian, that is, upon execution of the joint action a, the state transitions to next state s⁰ with probability ps,s⁰(a) := P(s | s, a). That the joint action a is observed by all agents after execution is needed to ensure full observability, i.e., that the MARL problem can be defined by an MDP. After the joint action ais executed in state s, each agenti receives a rewardrⁱ(s, a)∈[0,1], only known to the agent i. The system reward r(s, a) is defined as the aggregation of local rewardsr(s, a):=_n¹Pn

i=1rⁱ(s, a). The goal of the cooperative agents here is the maximization of theglobal cumulative return defined as

maxπ Jπ(s) := lim

T→∞

1 TE

"_T₋₁ X

t=0

r(s, a)s0=s

# , (1) where π denotes the joint policy of all agents, that is, a probability distribution over joint action-space given system state, π : S × A → [0,1]. The joint policy prescribes the probability that a joint actiona= (a1, . . . , an)is taken by the collection of the agents when in system state s, which we assume factors into marginals of each individual agent’s policy:π(a|s)=QN

i=1πi(ai|s). That is, the local policies are statistically independent, and are further denoted asπi(ai|s) which define the probability of taking action ai by agent i when in state s. Moreover, the expectation in (1) is over the product measure associated with state transition dynamics and the policy known as the ergodic state occupancy measure. We remark here that a simialr policy factorization

(4)

has been considered widely in the existing literature [17], [16], [18], [20], [19], [21], [22] which is important for the value function factorization. Moreover, this does not mean that the occupancy measure factorizes into local occupancy measures.

Our specific goal in this work is the design of policy optimization schemes to solve (1) such that each agent, upon the basis of its local action selections and rewards, together with information exchange amongst neighbors, in possession of global state-action information, learns local policy parameters that result in the overall team attaining the optimal value (1). We place specific emphasis upon the non- asymptotic convergence of such schemes and their scaling with respect to the parameters off network G. Moreover, we consider in the model-free setting, i.e., the dynamics of the environment (the transition probabilities and transitional rewards) are unknown to the agents, but a simulation oracle is available to the team to generate state-action-reward tuples (s, a, r). We require that the transition dynamics for a fixed policy define an irreducible Markov chain: for each state pair (s, s⁰) and any policy π, there exist t such that the probability that the system transitions from state s to state s⁰ under policyπinttime-steps is non-zero. This condition is sufficient to ensure the limit in (1) exists andJπ(s) =:λπ

for all statess. Equivalently, the average cost is independent of the initial state in the system. Further, the optimal policy is time-invariant.

Towards transforming (1) into a workable form for deriving iterative model-free updates, we note that an optimal policy satisfies theaverage-cost Bellman equation[44] written as

λπ+vs= max

a∈A

(X

s⁰

ps,s⁰(a)r(s, a) +X

s⁰

ps,s⁰(a)vs⁰

) , (2) for all s ∈ S. Denote solutions to the Bellman’s equation by pairs (λ^∗, v^∗), where scalar λ^∗ = maxJπ(s) in (1) is unique and equal to the optimal average cost. The value vectorv∈R^s (which aggregates scalarsvsfor each s∈ S) is called a differential reward function and is unique up to a constant. Uniqueness is imposed by (v^∗)^Tξ^∗=0, where ξ^∗ is the stationary distribution under the optimal policy π^∗, i.e. P^π^∗ξ^∗=ξ^∗. Note that each policy π is associated with a transition probability P_s,s^π 0:=P

aπ(a|s)ps,s⁰(a) for all s, s⁰ ∈ S, and a stationary state distribution ξ^π, which is a probability distribution that remains unchanged in the Markov chain as time progresses, i.e. P^πξ^π = ξ^π. The differential reward function characterizes the transient effect of the initial state under a policyπ.

Continue then by noting that the optimal joint policyπ^∗ may be formulated as the following LP [33]:

µ∈Rmax^|S|×|A|

X

a∈A

µ(a)^Tr(a)

s. t.





 P

a∈A(I−P_a^T)µ(a) = 0, ∀s P

s∈S,a∈Aµ(s, a) = 1 µ(s, a)≥0 ∀a, s

,

(3)

where I is an identity matrix of the appropriate size and Pa ∈ R^|S|×|S| is the matrix whose (s, s⁰)-th entry equals to ps,s⁰(a). Moreover, µ(a) ∈ R^|S| denotes the unnormalized occupancy measure over the state space S for each action a ∈ A, whose stacking over the action space A is denoted as µ ∈ R^|S|×|A|. For every feasible point of the above linear program µ = (µ(a))a∈A, the ξ^π=(ξ_s^π)s∈S is the stationary state distribution where ξ_s^π=P

aµ(s, a), and P

x,aµ(x, a)r(s, a) corresponds to the average reward λπ

of policy πs. Through normalization, one may recover the associated policyπfor any feasibleµasξ^π_s=P

a∈Aµ(s, a), and π(a|s)=^P^µ(s,a)

a∈Aµ(s,a), and from the definition of ξ_s^π and π(a|s), it holds that µ(s, a) = ξ^π_sπ(a|s). Then, an optimal joint policyπ^∗can be constructed by normalizing the occupancy measures associated with the solution to the above linear program. See [5] and references therein for details.

π^∗(a|s) = µ^∗(s, a) P

aµ^∗(s, a). (4) By substituting the definition of the global reward r(s, a) in terms of the local rewards rⁱ(s, a) into(3), we obtain a multi-agent optimization problem with the global variables µ(s, a)corresponding to joint policy π:

µ∈Rmax^|S|×|A|

Xn i=1

X

a∈A

µ(a)^Trⁱ(a)

subject to:





 P

a(I−P_a^T)µ(a) = 0 ∀s∈ S P

s,aµ(s, a) = 1

µ(s, a)≥0 ∀s∈ S, a∈ A .

(5)

To solve (5), agents must cooperate in their policy search.

With each agent only exercising control over their localized policy, the globally optimal joint policy π^∗(a|s) may be obtained via (4). With the setting clarified, we next shift to developing a decentralized model-free algorithm to solve (1) upon the basis of Lagrangian relaxation.

III. RÂNDOMIZEDP^RIMAL-DÛALMÊTHOD In this section, we reformulate the multi-agent LP of (5) as a saddle point problem by considering its Lagrangian relaxation. In particular, we formulate the following saddle point problem

minv∈Vmax

µ∈UL(µ, v) :=

Xn i=1

X

a∈A

µ(a)^T[(1/n)(Pa−I)v+rⁱ(a)].

(6) Note that we have computed the transpose of the constraint to simplify the expression. Under Assumptions 3 and 4 introduced in Sec. IV, we may establish that the primal-dual optimal pair(v^∗, µ^∗)of (6) belong to the following restricted sets for the value V ⊂R^|S| and occupancy measures U ⊂ R^|S|×|A| defined as

V ={v∈R^|S| kvk∞ ≤2tmix}, (7) U =

µ= (µa)_a∈Ae^Tµ= 1, µ≥0,X

a∈A

µ(a)≥ 1

√τ|S|e ,

(5)

Algorithm 1: Randomized Multi-agent Primal-dual (RMAPD) Algorithm

1 Input: >0,S,A,tmix,τ

2 Setvⁱ= 0∈R^|S|,πi= _|A¹_i_|e ∈R^|Aⁱ^|,∀s∈ S,i∈ N

3 SetT = (τ tmix)²|S||A|,M= 4tmix+ 1

4 Setβ =_t_mix¹ q

log(|SkA|)

2|SkA|T ,α=|S|tmix

qlog(|SkA|) 2|A|T 5 foriterationt= 0,1,2, ... do

6 foragenti= 1,2, ..., N do

7 Observe the system state s,

8 Execute actionai∼πi(·|s)

9 Observe local rewardrⁱ_s,s0(a)

10 Send(µ^t_i,v^t_i)toj∈ni, receive(µ^t_j,v^t_j).

11 Compute local weighted averages [cf. (10)]

e µ^t_i=Pn

j=1w_ij^tµ^t_j,ev^t_i=Pn

j=1w^t_ijv_j^t,

12 Conduct entropic ascent w.r.t.µ^t+1_i in (13):

µ^t+_i ¹²(s, a) = µe^t_i(s, a) exp(∆^t+1_i (s, a)) P

s⁰

P

a⁰µe^t_i(s, a) exp(∆^t+1_i (s⁰, a⁰)) µ^t+1_i = argmin

µi∈U DKL(µikµ^t+

1 2

i ), with dual gradient∆^t+1_i (s, a)in (11).

13 Update value vector for agenti via (14) as v_i^t+1= Π_V[ve^t_i+d^t+1_i ]withd^t+1_i in (12)

wheretmix is the mixing time of the Markov chain which characterizes how fast the Markov decision process reaches its stationary distribution from any state under any policy (Assumption 4), and τ is a constant greater than one which characterizes how much the stationary distribution varies as the policy varies (Assumption 3). The definitions of these feasible sets is borne out of the analysis, and mirrors [24].

Next, we note that the Lagrangian of (6) is node-separable.

Specifically, by defining the local Lagrangian for agenti∈V as

Li(µ, v) :=X

a

µ(a)^T[(1/n)(Pa−I)v+rⁱ(a)], (8) wherev is a column vector with vs as itss-th component, then the Lagrangian of the multi-agent problemL(µ, v)may be decomposed into a sum over local Lagrangian Li(µ, v) as L(µ, v) = Pn

i=1Li(µ, v), which permits us to simplify the saddle point problem as

minv∈Vmax

µ∈U L(µ, v) = Xn i=1

Li(µ, v). (9) The min-max problem in (9) is convex inv and concave in µ. We note that the variables v andµ are common among all the agents in the network and we are interested in solving the problem in a distributed manner. This expression in (9) is suggestive of employing a solution methodology upon the basis of a decentralized stochastic primal-dual method, which is the focus of the following subsection.

A. Stochastic Primal-Dual Method

We propose applying stochastic primal-dual method to solve (5), which, owing to the node-separability of the Lagrangian, yields a decentralized scheme for policy optimization. In particular, in order to solve the saddle point problem, we note that agents must access estimate the global reward, but they lack access. Instead, agents only observe local rewards. To address this issue, we allow each agent to track a distinctly localized estimate vi ∈ S of the value, which are substituted in place of the global value vectorv in (6), and similarly with respect to the occupancy measureµ^t_i, which are in lieu of the global primal-dual pair (µ^t, v^t). Then, agenticooperates with other agents through a weighted averaging of its primal and the dual variables, i.e. a convex combination µe^t_i (resp.ev_i^t) of its own estimate µ^t_i (resp. v^t_i) with the estimates received from those of its neighborsj∈ni at timet:

e µ^t_i=

Xn j=1

w^t_ijµ^t_j, ev_i^t= Xn j=1

w^t_ijv^t_j. (10) Since the algorithm introduces a localized copy µei of the global µ(s, a), we could recover a marginalized version (say µˆi(s, ai) of dimension |S| × Aⁱ) of the global occupancy measure as µˆi(s, ai)=P

a₋i∈A−iµei(s, a), where A−i denotes the collection of actions of all other agents except i, i.e., for any agent i, we have the decomposition a=(ai, a₋i)∈A. One may obtain the local policy through normalizationπi(ai|s)=^P^µ^ˆⁱ^(s,aⁱ⁾

aµˆi(s,ai).

After performing the consensus step in (10), each agent takes a gradient descent (respectively, ascent) step to mini- mize (respectively, maximize) the local Lagrangian function Li, followed by a projection onto the constraint set U (respectively, V). However, since the transition dynamics model is unavailable to agent i (in the form of transition matrix Pa), it cannot to evaluate the constraint in (3). This precludes the evaluation of primal and dual gradients of the Lagrangian, which necessitates stochastic approximations of these quantities, which we present jointly with respective step-size parametersβ andαas

∇ˆ^µiLi= ∆^t+1_i =βv^t_i(s⁰)−v^t_i(s) +r_i^t(s, s⁰, a)−M e

µ^t_i(s, a) .es,a, with probability eµ^t_i(s, a) (11)

∇ˆ^vⁱLi=d^t+1_i =µ^t_i(s, a) e

µ^t_i(s, a)α(es−es⁰), (12) with probabilityµe^t_i(s, a)ps,s⁰(a), whereM:= 4tmix+ 1is a “shift parameter” which ensures sufficient decrease of a certain martingale process defined in terms of the KL divergence that arises in the analysis (to be made precise later), and the superscriptt denotes the value of the variable at timet. Moreovertmixis the mixing time of the Markov chain induced by a fixed policy (Assumption 4).

Herees,a is the indicator variable which is1 for (s, a) and null otherwise. Further,esdenotes the standard basis vector

(6)

with1 in slots and null otherwise. Note that we adopt the convention that, at time-stept, the variables with superscript t are known, and the superscript t+ 1 indicates an update direction in terms of random variables realized at timet. An additional point of note is that the gradient with respect to the value vector is−α(es⁰−es), which we swap to cancel out the negative. Then, using these update directions, stochastic primal-dual method is such that at everyt≥0, each agenti generates new estimatesµ^t+1_i ,vⁱ_t+1 as

µ^t+1_i = argmin

µi∈U

DKL(µikµ^t+_i ¹²), (13) where µ^t+_i ¹²(s, a) = µe^t_i(s, a) exp(∆^t+1_i (s, a))

P

s⁰

P

a⁰µe^t_i(s, a) exp(∆^t+1_i (s⁰, a⁰)) v_i^t+1 = Π_V[ev_i^t+d^t+1_i ], (14) whereΠ_Vis a Euclidean projection onto the setV, andd^t+1_i is given in (12). Moreover,∆^t+1_t is the gradient of the local Lagrangian with respect toµiin (11). Note that the update on µis mirror-ascent with a Kullback-Leibler (KL) divergence over the unnormalized probability simplex centered at µe^t_i, whereas the gradient step on the value vectorv is a simple projected gradient descent centered atve^t_i. The descent step onvis written in terms of an addition due to the cancellation of a negative, as mentioned after (12). We assume algorithm initialization asµi = 0andvi= 0for alli∈ N. The overall MARL policy optimization scheme based upon randomized primal-dual solutions to the LP formulation is summarized as Algorithm 1.

IV. C^ONVERGENCEA^NALYSIS

In this section, we establish the non-asymptotic convergence of the proposed algorithm in the sense that agents’

local primal-dual variables (a) achieve consensus and (b) converge to the primal-dual optimal pair of their local Lagrangians (8). As a consequence, upon the basis of local observations and information exchange with neighbors, agents are able to solve (5), and hence (1). We divide the analysis of the algorithm in two steps. First, we establish that all local estimates achieve consensus. Second, we show that the consensus vectors are in fact a pair of primal-dual optimal solution. To establish these results, we state some conditions are required on the graphG^tnext.

Assumption 1: [Strong Connectivity] There exists a pos- itive integerB such that graph(N,∪^Bl=0⁻¹Et+l)is strongly- connected for anyt≥0, i.e., every node is reachable from another in at mostB time-steps.

Assumption 2: For alli∈V andt≥0: (a) there exists a scalarη∈(0< η <1)such that w^t_ij ≥η when j∈ Ni^t, andw^t_ij = 0 otherwise; (b) Pn

j=1w_ij^t =Pn

i=1w_ij^t=1; that is, the mixing matrixW^t is doubly stochastic .

Assumption 1 ensures that after a union ofB time-slots, the network is connected, which ensures information propagates across the network. Assumption 2 ensures that an agent suf- ficiently balances the weighting of its own information with that of other agents. Assumption 2 ensures that the mixing matrices have a Perron-Frobenius eigenvalue associated with

an eigenvector whose entries are all1, i.e., the existence of a vector satisfying consensus. We also make two assumptions on the MDP stationary distribution and mixing time of the chain:

Assumption 3: [Ergodic Decision Process] The Markov decision process isτ-stationary in the sense that it is ergodic under any stationary policy π and there exists τ >1 such that ^√_τ¹_|S|e≤ξ^π≤ ^√_|S|^τe, where eis a vector of all1’s.

Assumption 4: [Fast-Mixing Markov Chains] The Markov decision process is tmix-mixing in the sense that tmix≥maxπmin

t≥1 k(P^π)^t(s, .)−ξ^πk^{T V} ≤¹4,∀s∈ S , wherek.k^{T V} is the total variation norm.

The factor τ characterizes the variability of the stationary distribution with respect to the policy.tmixdefines how fast the MDP reaches its stationary distribution from any state under any policyπ. Assumptions 1-4 are standard conditions in the analysis of policy search methods for average reward criterion [24] in single-agent settings.

A. Primal-dual Optimality

To show that the consensus vector coincides with a pair of primal-dual optimal solution, we show that, at each iteration of the algorithm, the local iterates get closer to the local primal-dual optimal pair in expectation. It turns out that to do so, we must first show that agents’ estimates reach approximate consensus, and then construct a Lyapunov function with respect to quantities defined in terms of network averages.

We proceed to doing so next,

Achieving consensus. We establish that the local iterates converge to the global mean at a specified rate in terms of the lower bound on the mixing weights, the diameter of the network, and the strong connectivity parameter. The consensus error must be characterized for both the value vector and occupancy measure estimates, which motivate the following network-aggregated averages at timet:

µ^t= 1 n

Xn i=1

µ^t_i, v^t= 1 n

Xn i=1

v_i^t. (15) It also turns out to be convenient to define the auxiliary sequence

q_i^t:= Π_V[ve^t_i+d^t+1_i ]−ve^t_i, (16) where q_i^t represent the error between the weight-averaged iteratesµe^t_i(resp.ev_i^t) and their previous update following projection/composition with a proximal operator. We note that a similar technique for minimization problems is considered in [45], but here we are considering a different minimax setting, necessitating analyzing the consensus error in both the primal and dual variables. See Lemma 3 in the appendix for the analysis of consensus error.

Lyapunov Function Construction.Next we define a decre- ment process that tracks the evolution of the averaged primal and dual iterates to the primal-dual optimal pair, which eventuates in our ability to formalize the overall convergence rate of Algorithm 1. Consider the Lyapunov functionE^tand

(7)

duality gap quantifierD^t, defined as E^t:= 1

n Xn i=1

DKL(µ^∗kµ^t_i) + 1 2|S|t²_mix

v^t−v^∗²

D^t:=λ^∗+X

a∈A

µ^t(a)^T[(I−Pa)v^∗+ra]

. (17)

The first term of E^t quantifies the sum of KL divergences between the optimal µ^∗ and local occupancy measures µ^t_i, and the second term quantifies the sub-optimality of the average sequencev^t. InD^t, we track the constraint violation of (5).

Lemma 1: With Lyapunov function E^t and duality gap quantifier D^t in(17), the iterates of Algorithm 1 exhibits approximate stochastic descent:

E[E^t+1| F^t]≤ E^t−βD^t+β²O |S||A|e t²_mix

(18) +β

n Xn i=1

X

a∈A

[ v^t−v^t_iT

(I−Pa)^T(µe^t_i(a)−µ^∗(a)) ]

+β n

Xn i=1

X

a∈A

[ eµ^t_i(a)−µ^t(a)T

[(Pa−I)v^∗+ra]], whereβ denotes the constant step size.

See Appendix I in [46] for proof. This result is a generalization of [24, Proposition 9] to multi-agent settings. Forn= 1, the last two terms are null, which simplifies to the proposition in the aforementioned reference. In generalizing it to the multi-agent setting, we note that existing analyses of multi- agent stochastic optimization methods based on consensus protocol rely on finite variance conditions [47]. However, the dual gradient to be evaluated at consensus variableµe^t_i(a) is required for unbiasedness in the gradient evaluation in (11), may cause unbounded noises to the stochastic gradient estimates. An additional complication is the joint treatment of consensus error in primal and dual variables owing to the structure of the minimax objective (6), which is not treated in any of the earlier works [36], [6], [18], [47], [45], [27].

Next, we present the main result of this subsection which upper bounds the duality gap as follows.

Theorem 1: For the time-averaged sequence of occupancy measures µˆ = _T¹ PT−1

t=0 1 n

Pn

i=1µ^t_i, after T number of iterations of Algorithm 1, with the step size selection β=Oeq

E0

√n|S||A|˜t²_mixD(Γ,ρ)T

, it holds that

E

"

X

a∈A

[v^∗−Pav^∗+ra]^Tµ(a)ˆ # +λ^∗

≤Oe ˜tmix

r√

nE⁰|S||A|D(Γ, ρ) T

!

, (19) where˜tmix= 1+4tmix,D(Γ, ρ) :=_1+Γ

1−ρ

such thatΓ=(1− η/4n²)⁻²andρ=(1−η/(4n²))^1/B,Bis the network strong connectivity parameter.

See Appendix VII in [46] for proof. Observe that the duality gap characterization is nonstandard from typical saddle point problems [48], that is, the left-hand side of (19) characterizes how the averaged dual variableµˆevaluated at the constraint

at the optimal primal variable (v^∗, λ^∗), and not an additional presence of the dual sub-optimality. This is a special structural consequence of the LP setting that breaks down for general nonlinear objectives or constraints. This upper bound characterizes the number of times the complementary slackness condition is violated on average, which facilitates deriving the sample complexity required to achieve an - optimal policy with high probability. Next we shift focus to this result.

B. From Duality Gap to Average Reward

We first derive the convergence in probability result for the proposed algorithm in next Lemma 2.

Lemma 2: Suppose Algorithm 1 is run for T = Ω

τ²˜t²_mix^√ⁿ^E⁰^|S||A|2 ^D(Γ,ρ)

iterations. Then it outputs a policyπ=ˆ _T¹ PT

t=1π^tsuch thatλ^∗−≤λπˆ, with probability 2/3, meaning, we output anoptimal policy with probability 2/3.

See Appendix VIII in [46] for proof. Lemma 2 establishes that Algorithm 1 converges to -optimal policy with probability 2/3. To boost the success probability to near 1, we develop a strategy where one runs Algorithm 1 multiple times and selects the best outcome. This procedure is formalized in Algorithm 2. With this meta-strategy in practice, we may establish that one can indeed achieve achieve comparable sample complexity to Lemma 2 but with high probability.

Algorithm 2: Meta-Randomized Multi-agent Primal- dual (M-RMAPD) Algorithm

1 Input: >0,S,A,tmix,τ

2 Run the Algorithm 1 forK number of iterations with precision ₃ and denote the output asπ⁽¹⁾,· · ·, π^(K).

3 For each output policyπ^(k), conduct the approximate value evaluation forL= ˜O ^t^mix² log ^4K_δ

time steps and obtainY^(k) which is approximate value evaluation with precision level/3 and prob. _2K^δ .

4 Outputeπ=π^(k^∗⁾ such thatk^∗= argmax_kY^(k).

Theorem 2: Under Assumptions 1-4, if we run Algorithm 2 forK= log_1/3 ^δ₂

, then we output an approximate policy e

πsuch thatλ_eπ≥λ^∗−with at least probability1−δ. Hence, the total number of samples required are given by T = Ω

τ²˜t²_mix^√^nE⁰|S||A|D(Γ,ρ)

² ·log¹_δ

We define D(Γ, ρ) :=

_1+Γ

1−ρ

such thatΓ=(1−η/4n²)⁻²andρ=(1−η/(4n²))^1/B whereB is the network strong connectivity parameter.

See Appendix IX in [46] for proof. To the best of our knowledge, the result in Theorem 1 is the first to characterize the sample complexity of MARL schemes with high probability to achieve global optimality. We accentuate that we are able to discern explicit dependence upon the mixing time tmixand network parameters with tight dependence upon the cardinalities of the state and action spaces. We emphasize that this work presents a first step towards PAC guarantees for multi-agent RL with average reward settings. We concede

(8)

(a)Grid world. (b)M=3,n=2. (c)M=2,n=3.

Fig. 1. We compare Algorithm 1 with its centralized counterpart (centralized stochastic primal-dual, or C-SPD), and with independent approximate value iteration (I-AVI). Fig. 1(a) shows the grid world environment for the experiments. Fig. 1(b) compares the average reward for all the mentioned algorithms.

It shows that RMAPD is able to learn the optimal policy equivalent to centralized technique. We note that I-AVI fails to learn the optimal policy because agents are not cooperating with each other. Fig. 1(c) shows similar result forM= 2, n= 3.

that there is a exponential blow up in the dimensions with respect to number of agents, and controlling that either via state-action features or via a k-hop neighborhood truncation is a valid scope for future work

V. EXPERIMENTS

In this section, we evaluate the practical merit of the proposed algorithm for MARL. In particular, since we are interested in cooperative multi-agent RL, we consider an experimental setting where the need for aggregating the policy learnt by individual agents via consensus or centralized training is important. Hence, we consider a cooperative navigation problem in a M ×M grid world environment shown in Fig. 1(a). Each agent is equipped with action Aⁱ={↑,→,↓,←}and observeM×Mgrid as the local state spaceSⁱ. In this environment, each agent receives a reward ri, and the common goal is to reach a state with maximum average reward across the agents. For instance, in our grid world environment as depicted in 1(a) forM= 3andn= 2, the agent1and agent 2 receive a reward of8and5in the top left grid, and a reward of5and10in the lower rightmost grid, respectively, when they reach there simultaneously, and zero otherwise. This settings ensures that the cooperative behavior would result in higher average reward as compared to a noncooperative behavior.

We solve the grid world navigation problem using the proposed RMAPD algorithm and present the average cumulative reward returns in Fig. 1. We compare the performance of the proposed decentralized algorithm with a centralized LP solver and also a variant of approximate value iteration where each agent operates independently of all others to maximize its local average reward. The plot in Fig. 1(b) shows that the proposed algorithm iterates converge to the centralized optimal solution and is significantly better than the independent learning schemes for a grid of sizeM = 3 withn= 2agents, and similarly for M = 2, n= 3in Fig.

1(c).

In the experiments, we run 20 independent iterations of all the algorithms for 10⁷ timesteps and plot the average rewards. We observe that convergence is reached sooner in RMAPD as compared to centralized training because of the

need to explore a larger state space per agent before con- verging as opposed to its decentralized counterpart. Further, the importance of the consensus mechanism of the proposed algorithm is highlighted by the much lower average reward achieved by the independent approximate value iteration.

VI. CONCLUSIONS

In this work, we considered a multi-agent reinforcement learning problem with average-reward criterion. The problem has been studied in literature in single agent settings only to date, for which randomized LP solvers achieve optimal PAC bounds in terms of the cardinality of the state and action spaces. We generalized such approaches to multi- agent settings by combining randomized LP solvers with consensus averaging, and elucidated their PAC bounds for the same. Interestingly, by the use of a novel Lyapunov function in the convergence analysis, the dependence upon the state and action space cardinality is still maintained similar to centralized counterpart, with an additional dependence on the way information propagates across the multi-agent network.

As a future direction, we will develop variants of this framework that can operate with parameterized occupancy measures and differential value vectors, such that the scaling is with respect to the parameterization rather than the state and action spaces.

REFERENCES

[1] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement learning based approach for automated lane change maneuvers,” in2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1379–1384.

[2] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al.,

“Grandmaster level in starcraft ii using multi-agent reinforcement learning,”Nature, vol. 575, no. 7782, pp. 350–354, 2019.

[3] G. Tesauro and J. O. Kephart, “Pricing in agent economies using multi- agent q-learning,”Autonomous agents and multi-agent systems, vol. 5, no. 3, pp. 289–304, 2002.

[4] J. Lussange, I. Lazarevich, S. Bourgeois-Gironde, S. Palminteri, and B. Gutkin, “Modelling stock markets by multi-agent reinforcement learning,” Computational Economics, vol. 57, no. 1, pp. 113–147, 2021.

[5] M. L. Puterman,Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[6] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- agent optimization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.

(9)

[7] T. Bas¸ar and G. J. Olsder, Dynamic noncooperative game theory.

SIAM, 1998.

[8] A. Mahajan and M. Mannan, “Decentralized stochastic control,”

Annals of Operations Research, vol. 241, no. 1-2, pp. 109–126, 2016.

[9] V. Krishnamurthy, Partially observed Markov decision processes.

Cambridge University Press, 2016.

[10] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” in NeurIPS, vol. 29, pp. 2137–2145, 2016.

[11] J. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel,

“Multi-agent reinforcement learning in sequential social dilemmas,”

inAAMAS, vol. 16. ACM, 2017, pp. 464–473.

[12] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multi-agent reinforcement learning,” inProceedings of the 34th in ICML-Volume 70, 2017, pp. 1146–1155.

[13] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” inin ICML, 2018, pp. 4295–4304.

[14] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” 1998.

[15] D. Lee, N. He, P. Kamalaruban, and V. Cevher, “Optimization for reinforcement learning: From a single agent to cooperative agents,”

IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 123–135, 2020.

[16] D. Lee, H. Yoon, V. Cichella, and N. Hovakimyan, “Stochastic primal- dual algorithm for distributed gradient temporal difference learning,”

arXiv preprint arXiv:1805.07918, 2018.

[17] T. Doan, S. Maguluri, and J. Romberg, “Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning,” inin ICML, 2019, pp. 1626–1635.

[18] S. Kar, J. M. Moura, and H. V. Poor, “Qd-learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus+ innovations,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1848–1862, 2013.

[19] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via double averaging primal-dual optimization,” inin NeurIPS, 2018, pp. 9649–9660.

[20] C. Qu, S. Mannor, H. Xu, Y. Qi, L. Song, and J. Xiong, “Value prop- agation for decentralized networked deep multi-agent reinforcement learning,” inin NeurIPS, 2019, pp. 1184–1193.

[21] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch,

“Multi-agent actor-critic for mixed cooperative-competitive environ- ments,”Neural Information Processing Systems (NIPS), 2017.

[22] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” inin ICML, 2018, pp. 5872–5881.

[23] O. Raveh and R. Meir, “Pac guarantees for cooperative multi-agent reinforcement learning with restricted communication,”arXiv preprint arXiv:1905.09951, 2019.

[24] M. Wang, “Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time,”Mathemat- ics of Operations Research, vol. 45, no. 2, pp. 517–546, 2020.

[25] Y. Xu, Z. Deng, M. Wang, W. Xu, A. M.-C. So, and S. Cui, “Voting- based multiagent reinforcement learning for intelligent iot,” IEEE Internet of Things Journal, vol. 8, no. 4, pp. 2681–2693, 2020.

[26] G. Qu, Y. Lin, A. Wierman, and N. Li, “Scalable multi-agent reinforcement learning for networked systems with average reward,”arXiv preprint arXiv:2006.06626, 2020.

[27] X. Sha, J. Zhang, K. You, K. Zhang, and T. Bas¸ar, “Fully asynchronous policy evaluation in distributed reinforcement learning over networks,”

arXiv preprint arXiv:2003.00433, 2020.

[28] P. Heredia and S. Mou, “Finite-sample analysis of multi-agent policy evaluation with kernelized gradient temporal difference,” in2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp.

5647–5652.

[29] G. Qu, Y. Lin, A. Wierman, and N. Li, “Scalable multi-agent reinforcement learning for networked systems with average reward,”arXiv preprint arXiv:2006.06626, 2020.

[30] K. Zhang, A. Koppel, H. Zhu, and T. Basar, “Global convergence of policy gradient methods to (almost) locally optimal policies,”SIAM Journal on Control and Optimization, vol. 58, no. 6, pp. 3586–3612, 2020.

[31] L. C. M. Kallenberg, Linear Programming and Finite Markovian Control Problems. CWI Mathematisch Centrum, 1983.

[32] ——, “Survey of linear programming for standard and nonstandard Markovian control problems. Part I: Theory,” Zeitschrift f¨ur Opera- tions Research, vol. 40, no. 1, pp. 1–42, 1994.

[33] D. P. De Farias and B. Van Roy, “The linear programming approach to approximate dynamic programming,”Operations research, vol. 51, no. 6, pp. 850–865, 2003.

[34] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,”IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012.

[35] A. Koppel, F. Y. Jakubiec, and A. Ribeiro, “A saddle point algorithm for networked online convex optimization,” IEEE Transactions on Signal Processing, vol. 63, no. 19, pp. 5149–5164, 2015.

[36] S. Boyd, N. Parikh, and E. Chu,Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011.

[37] H. Terelius, U. Topcu, and R. M. Murray, “Decentralized multi-agent optimization via dual decomposition,” IFAC proceedings volumes, vol. 44, no. 1, pp. 11 245–11 251, 2011.

[38] F. R. Chung and F. C. Graham,Spectral graph theory. American Mathematical Soc., 1997, no. 92.

[39] T. Eccles, Y. Bachrach, G. Lever, A. Lazaridou, and T. Graepel,

“Biases for emergent communication in multi-agent reinforcement learning,” inin NeurIPS, 2019, pp. 13 111–13 121.

[40] S. Ahilan and P. Dayan, “Correcting experience replay for multi-agent communication,”arXiv preprint arXiv:2010.01192, 2020.

[41] Y. Bachrach, R. Everett, E. Hughes, A. Lazaridou, J. Z. Leibo, M. Lanctot, M. Johanson, W. M. Czarnecki, and T. Graepel, “Nego- tiating team formation using deep reinforcement learning,”Artificial Intelligence, vol. 288, p. 103356, 2020.

[42] Y. Lin, G. Qu, L. Huang, and A. Wierman, “Distributed reinforcement learning in multi-agent networked systems,” arXiv preprint arXiv:2006.06555, 2020.

[43] D. Bertsekas and J. Tsitsiklis,Parallel and distributed computation:

numerical methods. Athena Scientific, 2015.

[44] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control. Athena scientific Bel- mont, MA, 1995, vol. 1, no. 2.

[45] S. Chen, A. Garcia, and S. Shahrampour, “On distributed non-convex optimization: Projected subgradient method for weakly convex problems in networks,”IEEE Transactions on Automatic Control, 2021.

[46] “Technical report for “convergence rates of average-reward multi- agent reinforcement learning via randomized linear programming”.”

[Online]. Available: https://tinyurl.com/2p9cv2jy

[47] J. Li, G. Li, Z. Wu, and C. Wu, “Stochastic mirror descent method for distributed multi-agent optimization,”Optimization Letters, vol. 12, no. 6, pp. 1179–1197, 2018.

[48] A. Nedi´c and A. Ozdaglar, “Subgradient methods for saddle-point problems,”Journal of Optimization Theory and Applications, vol. 142, no. 1, pp. 205–228, 2009.

[49] M. Wang,Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time, 2020, vol. 45, no. 2.