Competitive gradient descent - Competitive Gradient Descent

Chapter VII: Competitive Gradient Descent

7.2 Competitive gradient descent

sian mixture model. Even in this simple example, trying five different (constant) stepsizes under RMSProp, the existing methods diverge. The typical solution would be to decay the learning rate. However, even with a constant learning rate, CGD succeeds with all these stepsize choices to approximate the main features of the target distribution. In fact, throughout our experiments weneversaw CGD diverge.

In order to measure the convergence speed more quantitatively, we next consider a nonconvex matrix estimation problem, measuring computational complexity in terms of the number of gradient computations performed. We observe that all methods show improved speed of convergence for larger stepsizes, with CGD roughly matching the convergence speed of optimistic gradient descent [61], at the same stepsize. However, as we increase the stepsize, other methods quickly start diverg- ing, whereas CGD continues to improve, thus being able to attain significantly better convergence rates (more than two times as fast as the other methods in the noiseless case, with the ratio increasing for larger and more difficult problems). For small stepsize or games with weak interactions, on the other hand, CGD automatically invests less computational time per update, thus gracefully transitioning to a cheap correction of GDA at minimal computational overhead. We furthermore present an application to constrained optimization problems arising in model-based reinforcement learning and refer to [198] for recent applications of CGD to multi-agent reinforcement learning.

136 How to linearize a game. To motivate this algorithm, we remind ourselves that gradient descent with stepsize𝜂applied to the function 𝑓 :R^𝑚 −→Rcan be written as

𝑥_𝑘+1=arg min

𝑥∈R^𝑚

(𝑥^> −𝑥^>

𝑘)∇𝑥𝑓(𝑥_𝑘) + 1 2𝜂

k𝑥−𝑥_𝑘k². (7.2) This can be interpreted as a (single) player solving a local linear approximation of the (minimization) game, subject to a quadratic penalty that expresses her limited confidence in the global accuracy of the model. The natural generalization of this idea to the competitive case should then be given by the two players solving a local approximation of the true game, both subject to a quadratic penalty that expresses their limited confidence in the accuracy of the local approximation.

To implement this idea, we need to generalize the linear approximation in the single-agent setting to the competitive setting: How to linearize a game?

Linear or Multilinear. GDA answers the above question by choosing a linear approximation of 𝑓 , 𝑔 : R^𝑚 × R^𝑛 −→ R. This seemingly natural choice has the flaw that linear functions cannot express any interaction between the two players and are thus unable to capture the competitive nature of the underlying problem.

From this point of view, it is not surprising that the convergent modifications of GDA are, implicitly or explicitly, based on higher-order approximations (see also [149]). An equally valid generalization of the linear approximation in the single- player setting is to use a bilinear approximation in the two-player setting. Since the bilinear approximation is the lowest order approximation that can capture some interaction between the two players, we argue that the natural generalization of gradient descent to competitive optimization is not GDA, but rather the update rule (𝑥_𝑘₊₁, 𝑦_𝑘₊₁) = (𝑥_𝑘, 𝑦_𝑘) + (𝑥 , 𝑦), where(𝑥 , 𝑦) is a Nash equilibrium of the game

min

𝑥∈R^𝑚

𝑥^>∇𝑥𝑓 +𝑥^>𝐷²

𝑥 𝑦𝑓 𝑦+𝑦^>∇𝑦𝑓 + 1 2𝜂

𝑥^>𝑥 min

𝑦∈R^𝑛

𝑦^>∇𝑦𝑔+𝑦^>𝐷²

𝑦𝑥𝑔𝑥 +𝑥^>∇𝑥𝑔+ 1 2𝜂

𝑦^>𝑦 .

(7.3)

Indeed, the (unique) Nash equilibrium of (7.3) can be computed in closed form.

Theorem 23. Among all (possibly randomized) strategies with finite first moment, the only Nash equilibrium of the Game(7.3)is given by

𝑥 =−𝜂

Id−𝜂²𝐷²

𝑥 𝑦𝑓 𝐷²

𝑦𝑥𝑔 −1

∇𝑥𝑓 −𝜂 𝐷²

𝑥 𝑦𝑓∇𝑦𝑔

(7.4) 𝑦 =−𝜂

Id−𝜂²𝐷²

𝑦𝑥𝑔 𝐷²

𝑥 𝑦𝑓 ⁻1

∇𝑦𝑔−𝜂 𝐷²

𝑦𝑥𝑔∇𝑥𝑓

, (7.5)

given that the matrix inverses in the above expression exist.2

Proof. Let𝑋 , 𝑌 be the randomized strategies of the two agents. By subtracting and addingE[𝑋]²/(2𝜂),E[𝑌]²/(2𝜂), and taking expectations, we rewrite the game as

min

E[𝑋]∈R^𝑚E[𝑋]^>∇𝑥𝑓 +E[𝑋]^>𝐷²

𝑥 𝑦𝑓 E[𝑌] +E[𝑌]^>∇𝑦𝑓 + 1

2𝜂E[𝑋]^>E[𝑋] + 1

2𝜂Var[𝑋] (7.6) min

E[𝑌]∈R^𝑛E[𝑌]^>∇𝑦𝑔+E[𝑌]^>𝐷²

𝑦𝑥𝑔E[𝑋] +E[𝑋]^>∇𝑥𝑔+ 1

2𝜂E[𝑌]^>E[𝑌] + 1

2𝜂Var[𝑌]. (7.7) Thus, the objective value for both players can always be improved by decreasing the variance while keeping the expectation the same, meaning that the optimal value will always (and only) be achieved by a deterministic strategy. We can then replace theE[𝑋],E[𝑌] with𝑥 , 𝑦, set the derivative of the first expression with respect to𝑥 and of the second expression with respect to𝑦to zero, and solve the resulting system of two equations for the Nash equilibrium(𝑥 , 𝑦). According to Theorem23, the Game (7.3) has exactly one optimal pair of strategies, which is deterministic. Thus, we can use these strategies as an update rule, gener- alizing the idea of local optimality from the single– to the multi-agent setting and obtaining Algorithm16.

What I think that they think that I think ... that they do. Another game- theoretic interpretation of CGD follows from the observation that its update rule can be written as

Δ𝑥 Δ𝑦

=− Id 𝜂 𝐷²

𝑥 𝑦𝑓 𝜂 𝐷²

𝑦𝑥𝑔 Id

!−1

∇𝑥𝑓

∇𝑦𝑔

. (7.8)

Applying the expansion 𝜆_max(𝐴) < 1 ⇒ (Id−𝐴)⁻¹ = lim𝑁→∞Í𝑁

𝑘=0𝐴^𝑘 to the above equation, we observe that the first partial sum (𝑁 = 0) corresponds to the optimal strategy if the other player’s strategy stays constant (GDA). The second partial sum (𝑁 = 1) corresponds to the optimal strategy if the other player thinks that the other player’s strategy stays constant (LCGD, see Figure 7.3). The third partial sum (𝑁 = 2) corresponds to the optimal strategy if the other player thinks that the other player thinks that the other player’s strategy stays constant, and so forth, until the Nash equilibrium is recovered in the limit (see Figure7.2). For small

2We note that the matrix inverses exist for almost all values of𝜂, and for all𝜂in the case of a zero-sum game.

138

Figure 7.2: What I think that they think that I think... The partial sums of a Neumann-series representation of Equation (7.8) represent different orders of opponent-awareness, recovering the Nash-equilibrium in the limit.

enough 𝜂, we could use the above series expansion to solve for (Δ𝑥 ,Δ𝑦), which is known as Richardson iteration and would recover high order lookahead [86].

However, expressing it as a matrix inverse will allow us to use optimal Krylov sub- space methods to obtain far more accurate solutions with fewer gradient evaluations.

Rigorous results on convergence and local stability. We now present some basic convergence results for CGD, the proofs of which can be found in the appendix.

Our results are restricted to the case of a zero-sum game (𝑓 = −𝑔), but we expect that they can be extended to games that are dominated by competition. To simplify notation, we define

𝐷 B (Id+𝜂²𝐷²

𝑥 𝑦𝑓 𝐷²

𝑦𝑥𝑓)⁻¹𝜂²𝐷²

𝑥 𝑦𝑓 𝐷²

𝑦𝑥𝑓 , 𝐷˜ B (Id+𝜂²𝐷²

𝑦𝑥𝑓 𝐷²

𝑥 𝑦𝑓)⁻¹𝜂²𝐷²

𝑦𝑥𝑓 𝐷²

𝑥 𝑦𝑓 . (7.9) We furthermore define the spectral function𝜙(𝜆) B2𝜆− |𝜆|.

Theorem 24. If 𝑓 is two times differentiable with 𝐿-Lipschitz continuous Hessian and the diagonal blocks of its Hessian are bounded as𝜂k𝐷²

𝑥 𝑥𝑓k, 𝜂k𝐷²

𝑦 𝑦𝑓k ≤1/18, we have

k∇𝑥𝑓 (𝑥+𝑥_𝑘, 𝑦+𝑦_𝑘) k²+

∇𝑦𝑓 (𝑥+𝑥_𝑘, 𝑦+𝑦_𝑘)

2− k∇𝑥𝑓k²− k∇𝑦𝑓k²

≤ − ∇𝑥𝑓^>

2𝜂 ℎ_±

𝐷²

𝑥 𝑥𝑓

+ 1

3𝐷¯ −32𝜂²𝐿 k∇𝑥𝑓k + k∇𝑦𝑓k

−768𝜂⁴𝐿²

∇𝑥𝑓

− ∇𝑦𝑓^>

2𝜂 ℎ_±

−𝐷²

𝑦 𝑦𝑓

+ 1

3𝐷˜ −32𝜂²𝐿 k∇𝑥𝑓k + k∇𝑦𝑓k

−768𝜂⁴𝐿²

∇𝑦𝑓 .

Under suitable assumptions on the curvature of 𝑓, Theorem 24implies results on the convergence of CGD.

Corollary 1. Under the assumptions of Theorem24, if for𝛼 >0

2𝜂 ℎ_±

𝐷²

𝑥 𝑥𝑓

+ 1

3𝐷¯ −32𝜂²𝐿 k∇𝑥𝑓(𝑥₀, 𝑦₀) k + k∇𝑦𝑓(𝑥₀, 𝑦₀) k

−768𝜂⁴𝐿²

𝛼Id

2𝜂 ℎ_±

−𝐷²

𝑦 𝑦𝑓

+ 1

3𝐷˜ −32𝜂²𝐿 k∇𝑥𝑓(𝑥₀, 𝑦₀) k + k∇𝑦𝑓(𝑥₀, 𝑦₀) k

−768𝜂⁴𝐿²

𝛼Id for all (𝑥 , 𝑦) ∈ R^𝑚⁺^𝑛, then CGD started in (𝑥₀, 𝑦₀) converges at exponential rate with exponent𝛼to a critical point.

Furthermore, we can deduce the following local stability result.

Theorem 25. Let (𝑥^∗, 𝑦^∗) be a critical point ((∇𝑥𝑓 ,∇𝑦𝑓) = (0,0)) and assume furthermore that

𝜆_minBmin

𝜆_min

2𝜂 ℎ_±

𝐷²

𝑥 𝑥𝑓

+ 1

3𝐷¯ −768𝜂⁴𝐿²

, 𝜆_min

2𝜂 ℎ_±

−𝐷²

𝑦 𝑦𝑓

+ 1

3𝐷˜ −768𝜂⁴𝐿² >0 and 𝑓 ∈ 𝐶²(R^𝑚⁺^𝑛) with Lipschitz continuous Hessian. Then there exists a neigh-

borhoodU of(𝑥^∗, 𝑦^∗), such that CGD started in (𝑥₁, 𝑦₁) ∈ U converges to a point inU at an exponential rate that depends only on𝜆_min.

The results on local stability for existing modifications of GDA, including those of [61,172,173] (see also [154]) all require the stepsize to be chosen inversely propor- tional to an upper bound on𝜎_max(𝐷²

𝑥 𝑦𝑓), and indeed we will see in our experiments that the existing methods are prone to divergence under strong interactions between the two players (large 𝜎_max(𝐷²

𝑥 𝑦𝑓)). In contrast to these results, our convergence resultsonly improveas the interaction between the players becomes stronger.

Why not use𝐷²

𝑥 𝑥𝑓 and𝐷²

𝑦 𝑦𝑔? The use of a bilinear approximation that contains some, but not all second order terms is unusual and begs the question why we do not include the diagonal blocks of the Hessian in Equation (7.8) resulting in the damped and regularized Newton’s method

Δ𝑥 Δ𝑦

=− Id+𝜂 𝐷²

𝑥 𝑥𝑓 𝜂 𝐷²

𝑥 𝑦𝑓 𝜂 𝐷²

𝑦𝑥𝑔 Id+𝜂 𝐷²

𝑦 𝑦𝑔

!−1

∇𝑥𝑓

∇𝑦𝑔

. (7.10)

For the following reasons, we believe that the bilinear approximation is preferable both from a practical and conceptual point of view.

• Conditioning of matrix inverse: One advantage of competitive gradient descent is that in many cases, including all zero-sum games, the condition number of the

140 matrix inverse in Algorithm 16is bounded above by𝜂²k𝐷_{𝑥 𝑦}k². If we include the diagonal blocks of the Hessian in a non-convex-concave problem, the matrix can even be singular as soon as𝜂k𝐷²

𝑥 𝑥𝑓k ≥ 1 or𝜂k𝐷²

𝑦 𝑦𝑔k ≥ 1.

• Irrational updates: We can only expect the update rule (7.10) to correspond to a local Nash equilibrium if the problem is convex-concave or𝜂k𝐷²

𝑥 𝑥𝑓k, 𝜂k𝐷²

𝑦 𝑦𝑔k < 1.

If these conditions are violated, it can instead correspond to the players playing their worstas opposed to best strategy based on the quadratic approximation, leading to behavior that contradicts the game-interpretation of the problem.

• Lack of regularity: For the inclusion of the diagonal blocks of the Hessian to be helpful at all, we need to make additional assumptions on the regularity of 𝑓, for example by bounding the Lipschitz constants of 𝐷²

𝑥 𝑥𝑓 and 𝐷²

𝑦 𝑦𝑔. Otherwise, their value at a given point can be totally uninformative about the global structure of the loss functions (consider as an example the minimization of𝑥 ↦→ 𝑥²+𝜖^3/2sin(𝑥/𝜖) for𝜖 1). Many problems in competitive optimization, including GANs, have the form 𝑓(𝑥 , 𝑦) = Φ(G (𝑥),D (𝑦)), 𝑔(𝑥 , 𝑦) = Θ(G (𝑥),D (𝑦)), whereΦ,Θaresmooth and simple, but G and D might only have first order regularity. In this setting, the bilinear approximation has the advantage of fully exploiting the first order information of G and D, without assuming them to have higher order regularity.

This is because the bilinear approximations of 𝑓 and𝑔 then contains only the first derivatives of G and D, while the quadratic approximation contains the second derivatives 𝐷²

𝑥 𝑥G and 𝐷²

𝑦 𝑦D and therefore needs stronger regularity assumptions onGandD to be effective.

• No spurious symmetry: One reason to favor full Taylor approximations of a certain order in single-player optimization is that they are invariant under changes of the coordinate system. For competitive optimization, a change of coordinates of (𝑥 , 𝑦) ∈ R^𝑚⁺^𝑛 can correspond, for instance, to taking a decision variable of one player and giving it to the other player. This changes the underlying game significantly and thus we donotwant our approximation to be invariant under this transformation. Instead, we want our local approximation to only be invariant to coordinate changes of𝑥 ∈ R^𝑚 and 𝑦 ∈ R^𝑛 in separation, that is to block-diagonal coordinate changes on R^𝑚+𝑛. Mixed order approximations (bilinear, biquadratic, etc.) have exactly this invariance property and thus are the natural approximation for two-player games.

While we are convinced that the right notion of first order competitive optimization is given by quadratically regularized bilinear approximations, we believe that the right notion of second order competitive optimization is given by cubicallyregularized biquadraticapproximations, in the spirit of [182].

Dalam dokumen Inference, Computation, and Games (Halaman 158-164)