Learning with Delayed Payoffs in Population Games using Kullback-Leibler Divergence Regularization

(1)

using Kullback-Leibler Divergence Regularization

Item Type Preprint

Authors Park, Shinkyu; Leonard, Naomi Ehrich Eprint version Pre-print

Publisher arXiv

Rights This is a preprint version of a paper and has not been peer reviewed. Archived with thanks to arXiv.

Download date 21/06/2023 08:19:37

Link to Item http://hdl.handle.net/10754/692619

(2)

Learning with Delayed Payoffs in Population Games using Kullback-Leibler Divergence Regularization

Shinkyu Park and Naomi Ehrich Leonard

Abstract—We study a multi-agent decision problem in large population games. Agents across multiple populations select strategies for repeated interactions with one another. At each stage of the interactions, agents use their decision-making model to revise their strategy selections based on payoffs determined by an underlying game. Their goal is to learn the strategies of the Nash equilibrium of the game. However, when games are subject to time delays, conventional decision-making models from the population game literature result in oscillation in the strategy revision process or convergence to an equilibrium other than the Nash. To address this problem, we propose the Kullback- Leibler Divergence Regularized Learning (KLD-RL) model and an algorithm to iteratively update the model’s regularization parameter. Using passivity-based convergence analysis techniques, we show that the KLD-RL model achieves convergence to the Nash equilibrium, without oscillation, for a class of population games that are subject to time delays. We demonstrate our main results numerically on a two-population congestion game and a two-population zero-sum game.

Index Terms—Multi-agent systems, decision making, evolutionary dynamics, nonlinear systems, game theory

I. INTRODUCTION

Consider a large number of agents engaged in repeated strategic interactions. Each agent selects a strategy for the interactions but can repeatedly revise its strategy according to payoffs determined by an underlying payoff mechanism.

To learn and adopt the best strategy without knowing the structure of the payoff mechanism, the agent needs to revise its strategy selection at each stage of the interactions based on the instantaneous payoffs it receives.

This multi-agent decision problem is relevant in control engineering applications where the goal is to design decision- making models for multiple agents to learn effective strategy selections in a self-organized way. Applications include computation offloading over communication networks [1], user association in cellular networks [2], multi-robot task allocation [3], demand response in smart grids [4], [5], water distribution systems [6], [7], building temperature control [8], wireless networks [9], electric vehicle charging [10], distributed control systems [11], and distributed optimization [12].

To formalize the problem, we adopt the population game framework [13, Chapter 2]. In this framework, a payoff function defines how the payoffs are determined based on the

Park’s work was supported by funding from King Abdullah University of Science and Technology (KAUST). Leonard’s work was supported in part by ONR grant N00014-19-1-2556 and ARO grant W911NF-18-1-0325

Park is with Electrical and Computer Engineering Program, Computer, Electrical, and Mathematical Science and Engineering Division, King Abdul- lah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.[email protected]

Leonard is with the Department of Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ 08544, USA.[email protected]

agents’ strategy profile, which is the distribution of their strategy selections over a finite number of available strategies.An evolutionary dynamic model describes how individual agents revise their strategies to increase the payoffs they receive.

A key research theme is in establishing convergence of the strategy profile to the Nash equilibrium where no agent is better off by unilaterally revising its strategy selection.¹

Unlike in existing studies, we investigate scenarios where the payoff mechanism is subject to time delays. This models, for example, propagation of traffic congestion on roads in congestion games [16], delay in communication between electric power utility and demand response agents in demand response games [4], and limitations of agents in sensing link status in network games [17]. When agents revise their strategy selections based on delayed payoffs, the strategy profile does not converge to the Nash equilibrium. In fact, prior work in the game theory literature [18]–[33] suggests that when multi- agent games are subject to time delays, the strategy profile oscillates under many of existing decision-making models.

As a main contribution of this paper, we propose a new class of decision-making models called the Kullback-Leibler Divergence Regularized Learning (KLD-RL). The main idea behind the new model is to regularize the agents’ decision making using the Kullback-Leibler divergence. Such regularization makes the agent strategy revision insensitive to time delays in the payoff mechanism. This prevents the strategy profile from oscillating, and through successive updates of the model’s regularization parameter, it ensures that the agents improve their strategy selections. As a consequence, when the agents revise strategy selections based on the proposed model, their strategy profile is guaranteed to converge to the Nash equilibrium in a certain class of population games.

The logit dynamics model [34], [35] is known to converge to an equilibrium state in a large class of population games, including games subject to time delays [36], [37]. However, as discussed in [35], the equilibrium state of the logit dynamics model is a perturbed version of the Nash equilibrium. This forces the agents to select sub-optimal strategies, for instance, in potential games [38], [39] with concave payoff potentials, where the Nash equilibrium is the socially optimal strategy profile. Such a significant limitation in existing models moti- vates our investigation of a new decision-making model.

1In this work, we consider that the Nash equilibrium represents a desired distribution of the agents’ strategy selections, e.g., the distribution of route selections minimizing road congestion in congestion games (Example 1) or minimizing opponents’ maximum gain in zero-sum games (Example 2), and investigate convergence of the agents’ strategy profile to the Nash equilibrium.

However, we note that such equilibrium is not always a desideratum and would result in the worst outcome, for instance, in social dilemmas as illustrated in the prisoner’s dilemma [14]. We refer to [15] and references therein for other studies on decision model design in social dilemmas.

arXiv:2306.07535v1 [eess.SY] 13 Jun 2023

(3)

Rⁿ set ofn-dimensional real vectors.

Rⁿ+ set ofn-dimensional nonnegative real vectors.

X^k,X state spaces of populationkand the society.

TX^k, TX tangent spaces ofX^kandX.

int(X^k),int(X) interiors ofX^kandXdefined, respectively, as int(X^k) ={x^k∈X^k|x^k_i >0,1≤i≤n^k}and int(X) = int(X¹)× · · · ×int(X^M).

F, DF payoff function of a population game and its differential map.

NE(F) Nash equilibrium set of a population gameF, defined as in (1).

PNEη,θ(F) perturbed Nash equilibrium set of a population game F, defined as in (26).

D(x∥y) Kullback-Leibler divergence defined as Pn

i=1xiln^x_yⁱ

i for (element-wise) nonnegative vectorsx= (x1,· · ·, xn)andy= (y1,· · ·, yn).

TABLE I LIST OF BASIC NOTATION

Below we summarize the main contributions of this paper.

• We propose a parameterized class of KLD-RL models that generalize the existing logit dynamics model. We explain how the new model implements the idea of regularization in multi-agent decision making, and provide an algorithm that iteratively updates the model’s regularization parameter.

• Leveraging stability results from recent works on higher- order learning in large population games [36], [37], we discuss, under the KLD-RL model, the convergence of the strategy profile to the Nash equilibrium in an important class of population games, widely known as contractive population games [40].

• We present numerical simulations using multi-population games to demonstrate how the new model ensures the convergence to the Nash equilibrium, despite time delays in the games. Using simulation outcomes, we illustrate how our main convergence results are different from those of the existing logit model and highlight the importance of the proposed model in applications.

The paper is organized as follows. In Section II, we explain the multi-agent decision problem addressed in this paper. In Section III, we provide a comparative review of related works.

In Section IV, we introduce the KLD-RL model and explain how to iteratively update the model’s regularization parameter.

We present our main theorem that establishes the convergence of the strategy profile determined by the model to the Nash equilibrium in a certain class of contractive population games.

In Section V, we present simulation results that demonstrate the effectiveness of the proposed model in learning and converging to the Nash equilibrium. We discuss interesting extensions of our finding in Section VI and conclude the paper with a summary and future directions in Section VII.

II. PROBLEMDESCRIPTION

Consider asocietyconsisting ofM populationsof decision- making agents.² We denote by {1,· · ·, M} the populations constituting the society and by {1,· · · , n^k} the set of strategies available to agents in each population k. Let x^k(t) =

2We adopt materials on population games and relevant preliminaries from [13, Chapter 2].

(x^k₁(t),· · · , x^k_n_k(t)) be a n^k-dimensional nonnegative real- valued vector where each entry x^k_i(t) denotes the portion of population k adopting strategy i at time instant t. We refer to x^k(t) as the state of population k and constant m^k = Pn^k

i=1x^k_i(t), ∀t ≥ 0 as the mass of the population.

Also, by aggregating the states of all populations, we define the social state x(t) = (x¹(t),· · · , x^M(t)) which describes the strategy profiles across all M populations at time t. Let n be the total number of strategies available in the society, i.e, n=PM

k=1n^k. We denote the space of viable population states as X^k = {x^k ∈ Rⁿ

k

+

Pn^k

i=1x^k_i =m^k}. Accordingly, we defineX=X¹× · · · ×X^M as the space of viable social states. For concise presentation, without loss of generality, we assume thatm^k= 1, ∀k∈ {1,· · ·, M}.

In what follows, we review relevant definitions from the population games literature. Table I summarizes the basic notation used throughout the paper. For all variables and parameters adopted in this paper, the superscript is used to indicate their association with an indicated population.

A. Population Games and Time Delays in Payoff Mechanisms 1) Population games: We denote the payoffs assigned to each populationkat time instanttby ann^k-dimensional real- valued vectorp^k(t) = (p^k₁(t),· · ·, p^k_nk(t))∈Rⁿ

k. Eachp^k_i(t) represents the payoff given to the agents in populationkselect- ing strategyi. We denote byp(t) = (p¹(t),· · ·, p^M(t))∈Rⁿ the payoffs assigned to the agents in the entire society.

According to the conventional definition, a population game is associated with a payoff function F = (F¹,· · · ,F^M) with F^k : X→ Rⁿ

k which assigns a payoff vector to each population k as p^k(t) = F^k(x(t)), where x(t) ∈ X is the social state at time t. We adopt the definition of the Nash equilibrium of F as follows.

Definition 1 (Nash Equilibrium): An element z^NE in X is called the Nash equilibrium of the population game F if it satisfies the following condition:

(z^NE−z)^TF(z^NE)≥0, ∀z∈X. (1) Population games can have multiple Nash equilibria. We denote byNE(F)the set of all Nash equilibria ofF.

Below, we provide examples of population games and identify their unique Nash equilibrium. The examples will be used in Section V to illustrate our main results.

Example 1: Consider a congestion game with two populations (M = 2) where each population is assigned a fixed origin and destination. To reach their respective destinations, agents in each population use one of three available routes as depicted in Fig. 1. We consider the game scenario where every agent needs to repeatedly travel from its origin to destination, e.g., to commute to work on every workday. Each strategy in the game is defined as an agent taking one of the available routes.

Its associated payoff reflects the level of congestion along the selected route, which depends on the number of agents from

(4)

Fig. 1. Two-Population Congestion Game: Agents in each population k traverse from originOkto destinationDkusing one of the following routes:

O1 → A → D1 (Route 1),O1 → A → B → D1 (Route 2), and O1 →B→D1 (Route 3) for population 1; O2 →A→D2 (Route 1), O2 → B → A → D2 (Route 2), andO2 → B → D2 (Route 3) for population 2. We assume that with the same number of agents on the links, the diagonal links (O1→B, O2→A, A→D1, B→D2) are50%more congested than the horizontal links, e.g, the roads represented by the diagonal links are narrower; whereas the vertical linkA↔Bis50%less congested than the horizontal links, e.g., the road associated with the vertical link is wider. The different weights on the links reflect such assumption.

possibly both populations using the route. To formalize this, we adopt the payoff function F= (F¹,F²)defined as

F¹(x¹, x²) =−





2.5x¹₁+x¹₂ x¹₁+ 2.5x¹₂+x¹₃+ 0.5x²₂

x¹₂+ 2.5x¹₃



 (2a)

F²(x¹, x²) =−





2.5x²₁+x²₂ 0.5x¹₂+x²₁+ 2.5x²₂+x²₃

x²₂+ 2.5x²₃



. (2b) We note that (2) has the unique Nash equilibrium x^NE = (4/9,1/9,4/9,4/9,1/9,4/9)at which the average congestion level across all six routes is minimized. ■ Example 2: Consider a two-population zero-sum game whose payoff functionF = (F¹,F²)is derived from abiased Rock-Paper-Scissors (RPS) game [41] as follows:

F¹(x¹, x²) =





−0.5x²₂+x²₃ 0.5x²₁−0.1x²₃

−x²₁+ 0.1x²₂



 (3a)

F²(x¹, x²) =





−0.5x¹₂+x¹₃ 0.5x¹₁−0.1x¹₃

−x¹₁+ 0.1x¹₂



. (3b) The study of zero-sum games has an important implication in security-related applications. For example, attacker-defender (zero-sum) game formulations [42], [43] can be used to predict an attacker’s strategy at the Nash equilibrium and to design the best defense strategy.

Agents in each population k = 1,2 can select one of three strategies: rock (x^k₁), paper (x^k₂), or scissors (x^k₃). The payoffs F^k(x¹, x²) assigned to each population k reflect the chances of winning the game against its opponent population, as illustrated in Fig. 2. The agents are engaged in a multi-round game in which, based on the payoffs received in a previous round, they would revise strategy selections at each round of the game. Note that the game has the unique Nash equilibrium x^NE = (1/16,10/16,5/16,1/16,10/16,5/16) at which each population minimizes the opponent population’s maximum

Fig. 2. Two-Population Zero-Sum Game: Agents in the defender population (population 1) select defending strategies(DS1, DS2, DS3)to play against those in the attacker population (population 2) who adopt attacking strategies (AS1, AS2, AS3). The positive (negative) weight on the blue (red dotted) arrow betweenDSiandASjdenotes the reward (loss) associated with population 1 when the defenders and attackers adoptDSiandASj, respectively.

The payoffF_i¹(x¹, x²)associated withDSiof population 1 is the sum of the reward and loss whenx²is the state of population 2; whereas the payoff F_j²(x¹, x²)associated withASjof population 2 is the sum of the reward and loss whenx¹ is the state of population 1.

gain (or equivalently maximizes its worst-case (minimum)

gain). ■

We make the following assumption on payoff functionF. Assumption 1:The differential mapDF ofF exists and is continuous onX, and bothF andDF are bounded: there are B_F andB_DFsatisfyingB_F = max_z∈X∥F(z)∥²andB_DF = maxz∈X∥DF(z)∥², respectively.

Note that any affine payoff functionF(x) =F x+b with F ∈ R^n×n, b ∈ Rⁿ, e.g., (2), (3), satisfies Assumption 1.

Contractive population games are defined as follows.³ Definition 2 (Contractive Population Game [40]):A population gameF is calledcontractive if it holds that

(w−z)^T(F(w)− F(z))≤0, ∀w, z∈X. (4) In particular, if the equality holds if and only if w=z,F is calledstrictly contractive.

If a population gameF is contractive, then its Nash equilibrium set NE(F)is convex; moreover, if F is strictly contractive, then it has a unique Nash equilibrium [40, Theorem 13.9]. For the affine payoff function F(x) = F x+b, the requirement (4) is equivalent to the negative semi-definiteness ofF in the tangent spaceTXofX. Both games in Examples 1 and 2 are contractive. We callF a contractive potential game if there is aconcave potential functionf satisfyingF =∇f in which case the Nash equilibrium attains the minimum off. The congestion game (2) is a contractive potential game and the Nash equilibrium attains the minimum average congestion.

2) Time delays in payoff mechanisms: Unlike in the stan- dard formulation explained in Section II-A1, in this work, the payoff vector at each time instant depends not only on the current but also past social states to capture time delays in the payoff mechanism underlying a population game. Adopting [45, Definition 1.1.3], we denote such dependency by

p(t) = (Gx) (t), (5)

3Contractive games are previously referred to as stable games [44]. We adopt the latest naming convention.

(5)

where G is a causal mapping. We require that when x(t) converges, i.e., lim_t→∞x(t) = x, so does¯ p(t), i.e., lim_t→∞p(t) = F(¯x), where F is the payoff function of an underlying population game. Following the same naming convention as in [37], we refer to (5) as the payoff dynamics model (PDM). However, unlike the original definition of the PDM, described as a finite-dimensional dynamical system, (5) expands the existing PDM definition to include a certain type of infinite-dimensional systems such as (6) given below. In what follows, we provide two cases of population games with time delays that can be represented using (5).

a) Payoff Function with a Time Delay: Consider that the payoff vector p(t)at time t depends on the past social state x(t−d)at timet−d:

p(t) =F(x(t−d)), t≥d, (6) where the positive constant d denotes a time delay and it holds that p(t) = F(x(0)), 0 ≤ t < d.⁴ By the continuity of F, when the social statex(t) converges tox, so does the¯ payoff vector p(t)toF(¯x). Note that (6) can be regarded as an infinite-dimensional dynamical system model.

We assume that d is unknown to the agents, but they have an (estimated) upper bound Bd of d. For instance, in the congestion game (Example 1), each agent’s gathering of information about the congestion level of available routes is subject to an unknown time delay, but the agent can make a good estimate of an upper bound on the time delay.

b) Smoothing Payoff Dynamics Model: We adopt similar arguments as in [36, Section V] to derive the smoothing PDM [37]. Suppose that the opportunity for the strategy revision of each agent in population k occurs at each jump time of an independent and identically distributed Poisson process with parameter 1/N^k, where N^k is the number of agents in the population. At each strategy revision time, the agent receives a payoff associated with its revised strategy.

Lett andt+hbe two consecutive strategy revision times.

Note that by the definitions of the Poisson process and strategy revision time, hgoes to zero as N^k tends to infinity. Given that the agent revises to strategy j at timet+hand receives a payoff Fj^k(x(t+h)), the population updates its payoff estimatesfor all available strategies as follows:⁵

P_i^k(t+h) =

(P_i^k(t)+hλ ^Fⁱ^k^(x(t+h))−Pⁱ^k^(t)

P(agent selects strategyi) ifi=j

P_i^k(t) otherwise.

(7) The variable P_i^k(t) is the estimate of Fi^k(x(t)) and the parameterλis the estimation gain. In expectation, (7) satisfies

E

P_i^k(t+h)−P_i^k(t) h

=−λE P_i^k(t)−Fi^k(x(t+h)) . (8)

4Eq. (6) can be extended to a payoff function with multiple time delays as we explain in Section VI-A. For concise presentation, we proceed with the payoff function with a single time delayd.

5The estimation of the payoffs(F₁^k(x(t)),· · ·,F^k

n^k(x(t))) is required as the population receives the payoff associated with only one of the strategies that its agent selects at each revision time t. The denominator P(agent selects strategyi) of the second term can be computed using the agent’s decision-making model (11), which will be explained in Section II-B.

For a large number of agents, i.e.,N^k tends to infinity, we can approximate (8) with the following ordinary differential equation:

˙

p^k_i(t) =−λ p^k_i(t)− Fi^k(x(t))

. (9)

The variable p^k_i(t) can be viewed as an approximation of E(P_i^k(t)). We refer to (9) as thesmoothing PDM. Note that (9) can be interpreted as a low-pass filter applied to the signal Fi^k(x(t)), t ≥ 0 and the filtering causes a time delay in computing the payoff estimates. Consequently, the filter output p^k_i(t)lags behind the input Fi^k(x(t)).

We make the following assumption on (5).

Assumption 2: The PDM (5) satisfies the technical conditions stated below.

1) Given that F is the payoff function of an underlying population game, (5) satisfies

t→∞lim ∥x(t)˙ ∥2= 0 =⇒ lim

t→∞∥p(t)−F(x(t))∥2= 0, wherep(t)is the payoff vector determined by (5) given a social state trajectoryx(t), t≥0.

2) Given a social state trajectoryx(t), t≥0, (5) computes a unique payoff vector trajectory p(t), t ≥0. In other words, for any pair of social state trajectoriesx(t), t≥0 andy(t), t≥0, it holds that

x(t) =y(t), ∀t≥0

=⇒ (Gx) (t) = (Gy) (t), ∀t≥0. (10) 3) If the social state trajectory x(t), t ≥ 0 is differentiable, so is the resulting payoff vector trajectory p(t), t≥0. Given that x(t)˙ is bounded, bothp(t)and

˙

p(t)are bounded, i.e., there existB_p andB_p_˙ satisfying

∥p(t)∥2 ≤ B_p and ∥p(t)˙ ∥2 ≤ B_p_˙ for all t ≥ 0, respectively.

Note that both payoff function with a time delay (6) and smoothing PDM (9) satisfy Assumption 2.⁶ In light of As- sumption 2-1, as originally suggested in [46], we can view the PDM (5) as a dynamic modification of the conventional population game model.

B. Strategy Revision and Evolutionary Dynamics Model By the same strategy revision process described in Sec- tion II-A2b, suppose an agent in population k revises its strategy selection at each jump time of a Poisson process with parameter 1/N^k in which the strategy revision depends on the payoff vectorp^k(t)and population statex^k(t)at the jump time t. We adopt the evolutionary dynamics framework [13, Part II] in which the following ordinary differential equation describes the change of the population state x^k(t) when the number of agents N^k in the population tends to infinity: For iin{1,· · ·, n^k} andkin{1,· · ·, M},

˙

x^k_i(t) =Pn^k

j=1x^k_j(t)Tji^k(x^k(t), p^k(t))

−x^k_i(t)Pn^k

j=1Tij^k(x^k(t), p^k(t)), (11)

6In particular, Assumptions 2-1 and 3 for (6) can be verified by the mean value theorem and Assumption 1. Also we can validate that (9) satisfies Assumptions 2-1 and 3 using the same arguments used in the proof of [36, Proposition 6].

(6)

e

1

e

₂

e

₃

(a) η= 0.1

e

1

e

₂

e

₃

(b)η= 4.5

Fig. 3. State trajectories of population1under the logit protocol (12) with η= 0.1,4.5in the congestion game (2). The payoff vector is determined by (6) subject to a unit time delay (d= 1). The red circle in both (a) and (b) marks the Nash equilibrium and the red X mark in (b) denotes the unique limit point of all the trajectories.

where the payoff vector p^k(t) is determined by the PDM (5). The strategy revision protocol Tji^k(z^k, r^k) defines the probability that each agent in populationkswitches its strategy from j toi where z^k ∈ X^k and r^k ∈Rⁿ

k.⁷ As in [37], we refer to (11) as the evolutionary dynamics model (EDM).

Among existing strategy revision protocols, the most relevant to our study is thelogit protocol defined as

Ti^Logit(r^k) = exp(η⁻¹r_i^k) Pn^k

l=1exp(η⁻¹r_l^k)

, (12)

whereη is a positive constant and r^k= (r^k₁,· · ·, r^k_n_k)∈Rⁿ

k

is the value of population k’s payoff vector. The agents adopting the logit protocol, i.e., Tji^k(z^k, r^k) = Ti^Logit(r^k), revise their strategy choices based only on payoffs and the probability of switching to strategyiis independent of current strategy j.

As discussed in [35], the logit protocol is regarded as a perturbed version of thebest response protocol, where the level of perturbation is quantified by the constant η. In particular, T^Logit= (T1^Logit,· · ·,Tn^Logit^k )can be expressed as

T^Logit(r^k) = arg max

z^k∈int(X^k)

((z^k)^Tr^k−ηh(z^k)), (13) whereh(z^k) =Pn^k

i=1z_i^klnz_i^kis the negative of the entropy of z^k. The term −ηh(z^k)can be viewed as regularization in the maximization (13) that incentivizes population k to maintain diversity, quantified as−h(z^k), in its strategy selection. Note that as illustrated in [47, Section V-B], by tuning the parameter η, the EDM (11) defined by the logit protocol ensures the convergence of the social state in a larger class of population games as compared to other protocols.

When the payoffs are subject to time delays as in (6) or (9), under existing EDMs, the social state tends to oscillate or converge to equilibrium points that are different from the Nash equilibrium. To illustrate this, using the logit protocol (12) with two different values ofη, Figs. 3 and 4 depict social state trajectories derived in the congestion game (2) where the payoff function is subject to a unit time delay (6) and

7The reference [13, Chapter 5] summarizes well-known protocols developed in the game theory literature. Also we refer the interested reader to [13, Chapter 10] and [37, Section IV] for the derivation of (11) using strategy revision protocols.

e

1

e

₂

e

₃

(a)η= 0.1

e

1

e

₂

e

₃

(b)η= 0.6

Fig. 4. State trajectories of population1under the logit protocol (12) with η= 0.1,0.6in the zero-sum game (3). The payoff vector is determined by (9) withλ= 1. The red circle in both (a) and (b) marks the Nash equilibrium and the red X mark in (b) denotes the unique limit point of all the trajectories.

in the zero-sum game (3) where the payoff vector is defined by the smoothing PDM (9), respectively. We observe that whenη is small, the resulting social state trajectories oscillate, whereas with a sufficiently large η, the trajectories converge to an equilibrium point which is located away from the Nash equilibrium of the respective games.

To improve the limitations of existing protocols, we propose a new strategy revision protocol and analyze its convergence properties to rigorously show that the new protocol allows all agents to asymptotically attain the Nash equilibrium even with time delays in payoff mechanisms. We emphasize that such a convergence result for the new model is a distinct contribution.

As we illustrated in Figs. 3 and 4, the same convergence cannot be attained with existing models. We formally state the main problem as follows.

Problem 1:Design a strategy revision protocolTji^k and find conditions on the PDM (5) and EDM (11) under which the social statex(t)converges to the Nash equilibrium setNE(F):

t→∞lim inf

z∈NE(F)∥x(t)−z∥²= 0.

III. LITERATUREREVIEW

We survey some of the relevant publications in the multi- agent games literature, discuss the effect of time delays in payoff mechanisms, and review works that explore similar ideas of adopting regularization in modeling multi-agent decision making. We then explain how our contributions are distinct.

Multi-agent decision problems formalized by the population games and evolutionary dynamics framework have been one major research theme in control systems and neighboring research communities due to their importance in a wide range of applications [3], [4], [6], [8]–[10], [12]. As it has been well-documented in [13], [48], the framework has a long and concrete history of research endeavors in developing Lyapunov stability-based techniques to analyze long-term behavior of decision-making models.

The authors of [46] present pioneering work in generalizing the conventional population games formalism to include dynamic payoff mechanisms, an earlier form of the PDM, and in exploring the use of passivity from dynamical system

(7)

theory [49] for stability analysis. In subsequent works, such as [50], [51], the PDM formalism has been adopted to model time delays in payoff mechanisms, as in our work, and also to design payoff mechanisms that incentivize decision-making agents to learn and attain the generalized Nash equilibrium, i.e., the Nash equilibrium satisfying given constraints on the agent decision making.

Passivity-based stability analysis presented in [46] unifies notable stability results in the game theory literature, e.g., [44].

Further studies have led to more concrete characterization of stability and development of passivity-based tools for convergence analysis in population games. The tutorial article [37]

and its supplementary material [36] detail such formalization and technical discussions on δ-passivity in population games and a wide class of EDMs. The authors of [52] discuss a more general framework – dissipativity tools – for the convergence analysis and explain its importance in analyzing road congestion with mixed autonomy.⁸ We also refer the interested reader to [53], [54] for different applications of passivity/dissipativity theory in finite games.

There is a substantial body of literature that investigates the effect of time delays in multi-agent games [18]–[33].

These references, as we also illustrated in Figs. 3(a) and 4(a), explain that such time delays result in oscillation of state trajectories. In particular, the references [18], [23]–[25], [28], [29], [31] discuss stability of the replicator dynamics in population games defined by affine payoff functions that are subject to time delays. Notably, [25], [26], [30] adopt Hopf bifurcation analysis to rigorously show that oscillation of state trajectories emerges as the time delay increases. Stability and bifurcation analysis on other types of EDMs, such as the best response dynamics and imitation dynamics, in population games with time delays are investigated in [19], [24], [26], [27], [32], [33]. Whereas these works regard the time delays as deterministic parameters in their stability analysis, others [21], [22], [32] study stability of the Nash equilibrium when the time delays are defined by random variables with exponential, uniform, or discrete distributions.

Regularization in designing agent decision-making models has also been explored in multi-agent games. References [53]–

[58] adopt regularization to design reinforcement learning models and discuss convergence to the Nash equilibrium in finite games. [55] presents an earlier work of so-called exponentially discounted reinforcement learning – later further developed in [54], [58] – and discusses how the regularization improves the convergence of the learning dynamics.

The authors of [56] provide extensive discussions on reinforcement learning models in finite games including the exponentially discounted reinforcement learning model. They rigorously explain convergence properties of regularization- based reinforcement learning models, and also investigate a wide range of control costs to specify the regularization.

[57] explores the use of regularization in population game settings and proposes Riemannian game dynamics, where the regularization is defined by a Riemannian metric such as the

8Although adopting the dissipativity tool of [52] would lead to more general discussions on convergence analysis, for conciseness, we adopt the passivity- based approaches [37], [46] as these are sufficient to establish our main results.

Euclidean norm. The authors explain how the Riemannian game dynamics generalize some existing models, such as the replicator dynamics and projection dynamics, and present stability results for their model.

More recent works [37], [53], [54] explain the benefit of the regularization using passivity theory, rigorously showing that the regularization in agent decision-making models enhances the models’ passivity measure. When such decision-making models are interconnected with PDMs, the excess of passivity in the former compensates a shortage of passivity in the latter. Consequently, the feedback interconnection results in the convergence of the social state to an equilibrium state.

Unlike existing studies on the effect of time delays in population games, which focus on identifying technical conditions under which oscillation of state trajectories emerges, we propose a new model that guarantees the convergence of the social state to the Nash equilibrium in a class of contractive population games that are subject to time delays. Although the idea of the regularization and passivity-based analysis have been reported in the multi-agent games literature, all previous results only establish the convergence to the “perturbed” Nash equilibrium. This paper substantially extends our earlier work [50] by generalizing convergence results for the new model to multi-population scenarios and to a more general class of PDMs such as the smoothing PDM.

IV. LEARNINGNASHEQUILIBRIUM WITH

DELAYEDPAYOFFS

Given z^k = (z₁^k,· · ·, z^k_n_k) and θ^k = (θ^k₁,· · · , θ_n^k_k), both belonging to the interior of the population state spaceint(X^k), we define the Kullback-Leibler divergence (KLD) as

D(z^k∥θ^k) =Pn^k

i=1z^k_i ln^z_θ^kⁱ_k

i

. (14)

We compute the gradient of (14) in X^k (with respect to the first argumentz^k) as

∇D(z^k∥θ^k) =

ln^z_θ¹^kk

1 · · · ln^z

k nk

θ^k

nk

T

. (15) Note that (14) is a convex function ofz^k. For notational conve- nience, we useD(z∥θ) =PM

k=1D(z^k∥θ^k)and∇D(z∥θ) = (∇D(z¹∥θ¹),· · · ,∇D(z^M∥θ^M)).

For given θ^k ∈ int(X^k), using (14), we define the KLD Regularized Learning (KLD-RL) protocolT^KLD-RL(θ^k, r^k) = (T1^KLD-RL(θ^k, r^k),· · ·,Tn^KLD-RL^k (θ^k, r^k))that maximizes areg- ularized average payoff:

T^KLD-RL(θ^k, r^k) = arg max

z^k∈int(X^k)

((z^k)^Tr^k−ηD(z^k∥θ^k)), (16) wherer^k ∈Rⁿ

k is the value of populationk’s payoff vector and η > 0 is a weight on the regularization. Under the protocol, the agents revise their strategies to maximize the cost in (16) that combines the average payoff (z^k)^Tr^k and the regularizationD(z^k∥θ^k)weighted by η.

By a similar argument as in [34], we can find a unique solution to (16) as

Ti^KLD-RL(θ^k, r^k) = θ_i^kexp(η⁻¹r^k_i) Pn^k

l=1θ_l^kexp(η⁻¹r^k_l)

. (17)

(8)

Fig. 5. A feedback interconnection illustrating the payoff dynamics model (PDM) and the KLD-RL model along with an algorithm for updating the regularization parameterθ.

One key aspect of the KLD-RL protocol (17) is in usingθ^kas a tuning parameter. As a special case, when we assignθ^k =x^k, (17) becomes the imitative logit protocol [13, Example 5.4.7].

In Section IV-B, we propose an algorithm to compute an appropriate value of θ^k that guarantees the convergence of the social state to the Nash equilibrium set.

To further discuss the effect of θ^k on the agents’ strategy revision, let us consider the following two special cases.

• Case I: If r₁^k=· · ·=r_n^k_k then Ti^KLD-RL(θ^k, r^k) =θ^k_i.

• Case II: Ifθ₁^k=· · ·=θ_n^kk then

Ti^KLD-RL(θ^k, r^k) = exp(η⁻¹r^k_i) Pn^k

l=1exp(η⁻¹r^k_l) .

From (Case I), we observe that θ^k serves as a bias in the agents’ strategy revision. When the payoffs are identical across all strategies, the agents tend to select a strategy with higher value ofθ^k_i. When there is no bias (Case II), (17) is equivalent to the logit protocol (12).

To study asymptotic behavior of the EDM (11) defined by KLD-RL protocol (17), we express the state equation of the closed-loop model consisting of (5) and (11) as follows: For i in{1,· · · , n^k}andk in{1,· · ·, M},

˙

x^k_i(t) = θ^k_i exp(η⁻¹p^k_i(t)) Pn^k

l=1θ^k_l exp(η⁻¹p^k_l(t))−x^k_i(t) (18a)

p(t) = (Gx)(t). (18b)

We assume that given an initial condition (x(0), p(0))∈X× Rⁿ, there is a unique solution to (18). Sincep(t)is bounded by Assumption 2-3, social statex(t)belongs toint(X)for all t≥0. Fig. 5 illustrates our framework consisting of (18) and a parameter update algorithm forθ= (θ¹,· · · , θ^M).

In Section IV-A, we present preliminary convergence analysis for (18) with parameter θ fixed. Then, in Section IV-B, we propose a parameter update algorithm that ensures the convergence of the social state x(t) derived by (18) to the Nash equilibrium set.

A. Preliminary Convergence Analysis

Our analysis hinges on the passivity technique developed in [36], [37], [46]. We begin by reviewing two notions of passivity –weakδ-antipassivityandδ-passivity– adopted for (5) and (11), respectively. We then establish stability of the closed-loop model (18).

Definition 3 (Weak δ-Antipassivity with Deficit ν^∗ [37]):

The PDM (5) isweakly δ-antipassive with deficitν^∗ if there is a positive and bounded functionαx,p:R+→R+for which

αx,p(t0)≥ Z t

t0

˙

x^T(τ) ˙p(τ)−νx˙^T(τ) ˙x(τ) dτ,

∀t≥t₀≥0 (19) holds for every social state trajectory x(t), t ≥ 0 and for every nonnegative constant ν > ν^∗, where the payoff vector trajectory p(t), t≥0 is determined by (5) and given x(t), t≥0. The function α_x,p satisfies lim_t→∞α_x,p(t) = 0 whenever lim_t→∞∥x(t)˙ ∥²= 0.⁹

The constant ν^∗ is a measure of passivity deficit in (5).

Viewing (5) as a dynamical system, according to [49], the functionαx,p is an estimate of thestored energyof (5).

In the following lemmas, we establish weakδ-antipassivity of the payoff function with a time delay (6) and the smoothing PDM (9). The proofs of the lemmas are provided in Ap- pendix A.

Lemma 1: The payoff function with a time delay (6) is weakly δ-antipassive with positive deficit ν^∗ =B_DF, where B_DF is the upper bound on DF as defined in Assumption 1 andαx,p(t0) in (19) is defined as

αx,p(t0) =BDF

2 Z t₀

t₀−d∥x(τ)˙ ∥²2dτ. (20) By (20), it holds that lim_t→∞∥x(t)˙ ∥² = 0 implies lim_t→∞αx,p(t) = 0.

Lemma 2: Consider the smoothing PDM (9) where its underlying payoff function is contractive and defined by an affine mappingF(x) =F x+b. The smoothing PDM is weakly δ-antipassive with positive deficit ν^∗ given by

ν^∗= 1

4∥F−F^T∥² andαx,p(t0) in (19) is defined as

α_x,p(t₀) =√

n∥p(t₀)−F x(t₀)−b∥², (21) wheren=PM

k=1n^k.

Since (9) satisfies Assumption 2-1, lim_t→∞∥x(t)˙ ∥² = 0 implieslim_t→∞α_x,p(t) = 0. If F is symmetric, thenν^∗= 0 and the smoothing PDM (9) becomes weakly δ-antipassive without any deficit (ν^∗ = 0), which coincides with [36, Proposition 7-i)]. Hence, Lemma 2 extends [36, Proposition 7- i)] to the case whereF is non-symmetric.

9We use the subscript to indicate the dependency of the functionαx,p on both of the trajectoriesx(t), t ≥ 0 and p(t), t ≥ 0. We note that the requirement onαx,p, which we impose to establish our stability results, is not part of the original definition of weakδ-antipassivity given in [37].

(9)

Definition 4 (δ-Passivity with Surplus η^∗ [37]):The EDM (11) is δ-passive with surplus η^∗ if there is a continuously differentiable function S:X×Rⁿ →R+ for which

S(x(t), p(t))− S(x(t0), p(t0))≤ Z t

t₀

˙

x^T(τ) ˙p(τ)−ηx˙^T(τ) ˙x(τ)

dτ, ∀t≥t0≥0 (22) holds for every payoff vector trajectory p(t), t ≥ 0 and for every nonnegative constant η < η^∗, where the social state trajectory x(t), t ≥0 is determined by (11) and given p(t), t≥0. We refer toS as theδ-storage function. We call S informativeif the function satisfies

S(z, r) = 0 ⇐⇒ V(z, r) = 0 ⇐⇒ ∇^TzS(z, r)V(z, r) = 0, where V = V¹,· · ·,V^M

denotes the vector field of (11) defining x˙^k(t) =V^k(x^k(t), p^k(t)).

The constant η^∗ is a measure of passivity surplus in (11).

Compared to weakδ-antipassivity defined for (5), Definition 4 states a stronger notion of passivity since it requires the existence of theδ-storage function S.

In the following lemma, we establish δ-passivity of the KLD-RL EDM (18a). The proof of the lemma is provided in Appendix B.

Lemma 3: Given fixed weight η > 0 and regularization parameter θ ∈int(X), the KLD-RL EDM (18a) isδ-passive with surplusη and has an informativeδ-storage functionS^θ: X×Rⁿ →R+ expressed as¹⁰

S^θ(z, r) = max

z∈int(¯ X)(¯z^Tr−ηD(¯z∥θ))−(z^Tr−ηD(z∥θ)), (23) wherez= (z¹,· · ·,z^M),r= (r¹,· · ·,r^M), andθ= (θ¹,· · ·,θ^M).

By Assumption 2, (16), and [35, Lemma A.1], the stationary point(¯x,p)¯ of (18) satisfies

max

z∈X(z−x)¯ ^T(F(¯x)−η∇D(¯x∥θ)) = 0 (24a)

¯

p=F(¯x). (24b)

Following a similar argument as in [35], x¯ is the Nash equilibrium of the virtual payoff F˜^η,θ defined as

F˜^η,θ(z) =F(z)−η∇D(z∥θ). (25) The statex¯is often referred to as theperturbedNash equilibrium of F. Let PNEη,θ(F) be the set of all perturbed Nash equilibria of F, formally defined as follows.

Definition 5: Givenη > 0 and θ ∈int (X), define the set PNEη,θ(F)of the perturbed Nash equilibria ofF as

PNEη,θ(F) =n

¯ z∈X

(¯z−z)^TF˜^η,θ(¯z)≥0,∀z∈X o

, (26) whereF˜^η,θ is the virtual payoff (25) associated withF.

Proposition 1:Consider the closed-loop model (18) consisting of the KLD-RL EDM (18a) and PDM (18b). Suppose that (18b) is weakly δ-antipassive with deficitν^∗ and the weight

10We use the subscript to specify the dependency ofS_θonθ. Also, as we discussed in Section IV, there is a unique solution, given as in (17), for the maximization in (23).

η of (18a) satisfies η > ν^∗. The social state x(t) of (18) converges to PNEη,θ(F):

t→∞lim inf

z∈PNEη,θ(F)∥x(t)−z∥²= 0. (27) Proof: The proof follows from [36, Lemma 1] provided that η is greater than ν^∗. The original statement of [36, Lemma 1] was established for closed-loop models that can be expressed as a finite-dimensional dynamical system. The model (18) may be infinite-dimensional, for instance, when (18b) is the payoff function with a time delay (6). However, given that (18) is well-defined with a unique solution, under Assumption 2, the technical arguments used in the proof of [36, Lemma 1] can be applied to infinite-dimensional models

including (6). ■

Proposition 1 implies that if the surplus of passivity in (18a) exceeds the lack of passivity in (18b), the social state, derived by the closed-loop model (18), converges to the perturbed Nash equilibrium set (26). As a consequence, using Lemmas 1 and 2, and Proposition 1, we can establish convergence to the perturbed Nash equilibrium set for the payoff function with a time delay (6) and smoothing PDM (9).

Corollary 1: Suppose the PDM (18b) of the closed-loop model (18) is defined by (6). The social state trajectory, x(t), t ≥ 0, determined by (18) converges to PNEη,θ(F) if it holds thatη > B_DF.

Corollary 2: Suppose the PDM (18b) of the closed-loop model (18) is defined by (9) with an affine payoff function F(x) = F x+b. The social state trajectory, x(t), t ≥ 0, determined by (18) converges to PNEη,θ(F) if it holds that η > ¹₄

F−F^T ₂.

B. Iterative KLD Regularization and Convergence Guarantee Suppose the population gameFunderlying the PDM (5) has the Nash equilibriumx^NE belonging to int(X). Ifθcoincides withx^NE, thenPNEη,θ(F) ={x^NE}and under the conditions of Proposition 1, the social state x(t) converges to x^NE. Therefore, to achieve convergence to the Nash equilibrium set, the key requirement is to attainθ=x^NE. In this section, we discuss the design of a parameter update algorithm that specifies how the agents updateθto asymptotically attain the Nash equilibrium.

Let the social state x(t) evolve according to (18) and the regularization parameterθbe iteratively updated at each time instant of a discrete-time sequence{tl}^∞l=1, determined by a procedure described in Algorithm 1 below, as θ = x(t_l) ∈ int(X), i.e.,θ is reset to the current social statex(t_l)at each time t_l. Let {θ_l}^∞l=1 be the resulting sequence of parameter updates. Suppose the following two conditions hold:

maxz∈X

(z−θl+1)^T(F(θl+1)−η∇D(θl+1∥θl))

≤ η

2D(θl+1∥θl) (28a)

l→∞lim D(θl+1∥θl) = 0

=⇒ lim

l→∞α_x,p(t_l) = 0and lim

l→∞∥p(t_l)− F(θ_l)∥²= 0, (28b)

(10)

where α_x,p is the function defined as in (19) for the PDM (18b). According to (24a) and (26), the condition (28a) means that θ is updated to a new value, i.e., θ_l+1 = x(t_l+1), when the state x(tl+1) is sufficiently close to PNE^η,θl(F).

The condition (28b) implies that as the sequence {θl}^∞l=1

converges, the estimated stored energy in the PDM (18b) dissipates and the payoff vectorp(tl)converges toF(θl).

The following lemma states the convergence of the social statex(t)to the Nash equilibrium setNE(F)if the sequences {tl}^∞l=1 and {θl}^∞l=1 satisfy (28). The proof of the lemma is given in Appendix C.

Lemma 4: Consider that the social state x(t) and payoff vector p(t) evolve according to the closed-loop model (18), the PDM (18b) is weakly δ-antipassive with deficit ν^∗, and the regularization parameter η of the KLD-RL EDM (18a) satisfiesη > ν^∗. Suppose the parameterθof (18a) is iteratively updated according to θl =x(tl), l∈Nsuch that (28) holds, and one of the following two conditions holds.¹¹

(C1) F is contractive and has a Nash equilibrium inint(X).

(C2) F is strictly contractive.

Then, the statex(t)converges to the Nash equilibrium set:

t→∞lim inf

z∈NE(F)∥x(t)−z∥²= 0. (29) Lemma 4 suggests that if the parameter θ is updated in a way that the resulting sequence of the parameter updates satisfies (28), the convergence to the Nash equilibrium set is guaranteed. Unlike in Proposition 1, the underlying population game F needs to be contractive to establish the convergence.

To evaluate (28a) for the parameter update, since the agents may not have access to the quantity F(x(t)), they need to estimate it using the payoff vectorp(t). Suppose the estimation error∥p(t)− F(x(t))∥²is bounded by a functionβ_x,p(t), i.e.,

∥p(t)− F(x(t))∥²≤βx,p(t), (30) for which lim_t→∞βx,p(t) = 0holds whenever the trajectory x(t), t≥0satisfieslim_t→∞∥x(t)˙ ∥²= 0. Note that according to Assumption 2-1 such a function βx,p always exists. Thus, we derive the following relation.

maxz∈X

(z−θ_l+1)^T(F(θ_l+1)−η∇D(θ_l+1∥θ_l))

≤max

z∈X(z−θ_l+1)^T(p(t_l+1)−η∇D(θ_l+1∥θ_l)) + max

z∈X(z−θl+1)^T(F(θl+1)−p(tl+1))

≤max

z∈X

(z−θl+1)^T(p(tl+1)−η∇D(θl+1∥θl)) +√

2M∥p(tl+1)− F(θl+1)∥². (31) Using (31), we can verify that (28) holds if the parameter is updated as θl+1=x(tl+1)at eachtl+1 satisfying

maxz∈X

(z−θl+1)^T(p(tl+1)−η∇D(θl+1∥θl)) +αx,p(tl+1) +√

2M βx,p(tl+1)≤ η

2D(θl+1∥θl). (32)

11The condition (C1) implies that Nash equilibria of a contractive gameF can be located at any location inXas long as at least one of them belongs toint(X).

In what follows, we discuss whether such time instantt_l+1 always exists. Suppose the parameterθ is fixed toθ_l=x(t_l).

According to Proposition 1 and the definition of the KLD-RL EDM (18a), the social state x(t) converges to the perturbed Nash equilibrium setPNEη,θ_l(F) which implies

t→∞lim ∥x(t)˙ ∥²→0 (33a)

t→∞lim maxz∈X(z−x(t))^T(p(t)−η∇D(x(t)∥θl))→0. (33b) Recall thatlim_t→∞∥x(t)˙ ∥²= 0 implieslim_t→∞α_x,p(t) = 0 andlim_t→∞β_x,p(t) = 0. Hence, by (33), the following term vanishes ast tends to infinity.

maxz∈X

(z−x(t))^T(p(t)−η∇D(x(t)∥θl)) +αx,p(t) +√

2M βx,p(t). (34) Consequently, either we can find tl+1 satisfying (32) or the state x(t) converges to θl, i.e., lim_t→∞D(x(t)∥θl) = 0.

However, by Proposition 1 and Definition 5, the latter case implies that θl needs to be the Nash equilibrium and, hence, the limit point of x(t). In conclusion, for both of the cases, resorting to Lemma 4, the convergence of the social statex(t) to the Nash equilibrium set is guaranteed. In what follows, we only consider the case where the parameter update rule (32) yields an infinite sequence.

Inspired by (32), we propose an algorithm to realize such parameter update for the cases where the PDM (18b) is the payoff function with a time delay (6) or smoothing PDM (9).

Algorithm 1: Suppose initial values of the parameter θ∈int (X)and a time instant variablet0 are given. Updateθ andt₀asθ=x(t₁)andt₀=t₁, respectively, if the following conditions hold at any time instantt₁> t₀.

1) Payoff function with a time delay (6):

maxz∈X

(z−x(t1))^T(p(t1)−η∇D(x(t1)∥θ)) +√

2M B_DFBd max

τ∈[t1−Bd,t₁]∥x(τ)˙ ∥²

≤ η

2D(x(t₁)∥θ) (35)

2) Smoothing PDM(9):

maxz∈X

(z−x(t₁))^T(p(t₁)−η∇D(x(t₁)∥θ)) +√

2M

(∥p(γt1)∥2+B_F) exp(−λ(1−γ)t1) +B_DF

Z t1

γt1

exp(−λ(t₁−τ))∥x(τ)˙ ∥2 dτ

≤ η

2D(x(t1)∥θ) (36)

wherex(t) is the social state at time instantt,γ is any fixed real number in(0,1), andx(τ)˙ can be computed using (18a).

To realize Algorithm 1, the agents only need to know the upper bounds Bd, BDF on d, DF for (6), or the bounds BF, BDF onF, DF for (9). The motivation behind adopting the algorithm is analogous to iterative regularization techniques that have been frequently used in optimization and