Chapter VII: Competitive Gradient Descent
7.2 Competitive gradient descent
sian mixture model. Even in this simple example, trying five different (constant) stepsizes under RMSProp, the existing methods diverge. The typical solution would be to decay the learning rate. However, even with a constant learning rate, CGD succeeds with all these stepsize choices to approximate the main features of the target distribution. In fact, throughout our experiments weneversaw CGD diverge.
In order to measure the convergence speed more quantitatively, we next consider a nonconvex matrix estimation problem, measuring computational complexity in terms of the number of gradient computations performed. We observe that all meth- ods show improved speed of convergence for larger stepsizes, with CGD roughly matching the convergence speed of optimistic gradient descent [61], at the same stepsize. However, as we increase the stepsize, other methods quickly start diverg- ing, whereas CGD continues to improve, thus being able to attain significantly better convergence rates (more than two times as fast as the other methods in the noiseless case, with the ratio increasing for larger and more difficult problems). For small stepsize or games with weak interactions, on the other hand, CGD automatically invests less computational time per update, thus gracefully transitioning to a cheap correction of GDA at minimal computational overhead. We furthermore present an application to constrained optimization problems arising in model-based rein- forcement learning and refer to [198] for recent applications of CGD to multi-agent reinforcement learning.
136 How to linearize a game. To motivate this algorithm, we remind ourselves that gradient descent with stepsizeπapplied to the function π :Rπ ββRcan be written as
π₯π+1=arg min
π₯βRπ
(π₯> βπ₯>
π)βπ₯π(π₯π) + 1 2π
kπ₯βπ₯πk2. (7.2) This can be interpreted as a (single) player solving a local linear approximation of the (minimization) game, subject to a quadratic penalty that expresses her limited confidence in the global accuracy of the model. The natural generalization of this idea to the competitive case should then be given by the two players solving a local approximation of the true game, both subject to a quadratic penalty that expresses their limited confidence in the accuracy of the local approximation.
To implement this idea, we need to generalize the linear approximation in the single-agent setting to the competitive setting: How to linearize a game?
Linear or Multilinear. GDA answers the above question by choosing a linear approximation of π , π : Rπ Γ Rπ ββ R. This seemingly natural choice has the flaw that linear functions cannot express any interaction between the two players and are thus unable to capture the competitive nature of the underlying problem.
From this point of view, it is not surprising that the convergent modifications of GDA are, implicitly or explicitly, based on higher-order approximations (see also [149]). An equally valid generalization of the linear approximation in the single- player setting is to use a bilinear approximation in the two-player setting. Since the bilinear approximation is the lowest order approximation that can capture some interaction between the two players, we argue that the natural generalization of gradient descent to competitive optimization is not GDA, but rather the update rule (π₯π+1, π¦π+1) = (π₯π, π¦π) + (π₯ , π¦), where(π₯ , π¦) is a Nash equilibrium of the game
min
π₯βRπ
π₯>βπ₯π +π₯>π·2
π₯ π¦π π¦+π¦>βπ¦π + 1 2π
π₯>π₯ min
π¦βRπ
π¦>βπ¦π+π¦>π·2
π¦π₯ππ₯ +π₯>βπ₯π+ 1 2π
π¦>π¦ .
(7.3)
Indeed, the (unique) Nash equilibrium of (7.3) can be computed in closed form.
Theorem 23. Among all (possibly randomized) strategies with finite first moment, the only Nash equilibrium of the Game(7.3)is given by
π₯ =βπ
Idβπ2π·2
π₯ π¦π π·2
π¦π₯π β1
βπ₯π βπ π·2
π₯ π¦πβπ¦π
(7.4) π¦ =βπ
Idβπ2π·2
π¦π₯π π·2
π₯ π¦π β1
βπ¦πβπ π·2
π¦π₯πβπ₯π
, (7.5)
given that the matrix inverses in the above expression exist.2
Proof. Letπ , π be the randomized strategies of the two agents. By subtracting and addingE[π]2/(2π),E[π]2/(2π), and taking expectations, we rewrite the game as
min
E[π]βRπE[π]>βπ₯π +E[π]>π·2
π₯ π¦π E[π] +E[π]>βπ¦π + 1
2πE[π]>E[π] + 1
2πVar[π] (7.6) min
E[π]βRπE[π]>βπ¦π+E[π]>π·2
π¦π₯πE[π] +E[π]>βπ₯π+ 1
2πE[π]>E[π] + 1
2πVar[π]. (7.7) Thus, the objective value for both players can always be improved by decreasing the variance while keeping the expectation the same, meaning that the optimal value will always (and only) be achieved by a deterministic strategy. We can then replace theE[π],E[π] withπ₯ , π¦, set the derivative of the first expression with respect toπ₯ and of the second expression with respect toπ¦to zero, and solve the resulting system of two equations for the Nash equilibrium(π₯ , π¦). According to Theorem23, the Game (7.3) has exactly one optimal pair of strategies, which is deterministic. Thus, we can use these strategies as an update rule, gener- alizing the idea of local optimality from the singleβ to the multi-agent setting and obtaining Algorithm16.
What I think that they think that I think ... that they do. Another game- theoretic interpretation of CGD follows from the observation that its update rule can be written as
Ξπ₯ Ξπ¦
!
=β Id π π·2
π₯ π¦π π π·2
π¦π₯π Id
!β1
βπ₯π
βπ¦π
!
. (7.8)
Applying the expansion πmax(π΄) < 1 β (Idβπ΄)β1 = limπββΓπ
π=0π΄π to the above equation, we observe that the first partial sum (π = 0) corresponds to the optimal strategy if the other playerβs strategy stays constant (GDA). The second partial sum (π = 1) corresponds to the optimal strategy if the other player thinks that the other playerβs strategy stays constant (LCGD, see Figure 7.3). The third partial sum (π = 2) corresponds to the optimal strategy if the other player thinks that the other player thinks that the other playerβs strategy stays constant, and so forth, until the Nash equilibrium is recovered in the limit (see Figure7.2). For small
2We note that the matrix inverses exist for almost all values ofπ, and for allπin the case of a zero-sum game.
138
Figure 7.2: What I think that they think that I think... The partial sums of a Neumann-series representation of Equation (7.8) represent different orders of opponent-awareness, recovering the Nash-equilibrium in the limit.
enough π, we could use the above series expansion to solve for (Ξπ₯ ,Ξπ¦), which is known as Richardson iteration and would recover high order lookahead [86].
However, expressing it as a matrix inverse will allow us to use optimal Krylov sub- space methods to obtain far more accurate solutions with fewer gradient evaluations.
Rigorous results on convergence and local stability. We now present some basic convergence results for CGD, the proofs of which can be found in the appendix.
Our results are restricted to the case of a zero-sum game (π = βπ), but we expect that they can be extended to games that are dominated by competition. To simplify notation, we define
Β―
π· B (Id+π2π·2
π₯ π¦π π·2
π¦π₯π)β1π2π·2
π₯ π¦π π·2
π¦π₯π , π·Λ B (Id+π2π·2
π¦π₯π π·2
π₯ π¦π)β1π2π·2
π¦π₯π π·2
π₯ π¦π . (7.9) We furthermore define the spectral functionπ(π) B2πβ |π|.
Theorem 24. If π is two times differentiable with πΏ-Lipschitz continuous Hessian and the diagonal blocks of its Hessian are bounded asπkπ·2
π₯ π₯πk, πkπ·2
π¦ π¦πk β€1/18, we have
kβπ₯π (π₯+π₯π, π¦+π¦π) k2+
βπ¦π (π₯+π₯π, π¦+π¦π)
2β kβπ₯πk2β kβπ¦πk2
β€ β βπ₯π>
2π βΒ±
π·2
π₯ π₯π
+ 1
3π·Β― β32π2πΏ kβπ₯πk + kβπ¦πk
β768π4πΏ2
βπ₯π
β βπ¦π>
2π βΒ±
βπ·2
π¦ π¦π
+ 1
3π·Λ β32π2πΏ kβπ₯πk + kβπ¦πk
β768π4πΏ2
βπ¦π .
Under suitable assumptions on the curvature of π, Theorem 24implies results on the convergence of CGD.
Corollary 1. Under the assumptions of Theorem24, if forπΌ >0
2π βΒ±
π·2
π₯ π₯π
+ 1
3π·Β― β32π2πΏ kβπ₯π(π₯0, π¦0) k + kβπ¦π(π₯0, π¦0) k
β768π4πΏ2
πΌId
2π βΒ±
βπ·2
π¦ π¦π
+ 1
3π·Λ β32π2πΏ kβπ₯π(π₯0, π¦0) k + kβπ¦π(π₯0, π¦0) k
β768π4πΏ2
πΌId for all (π₯ , π¦) β Rπ+π, then CGD started in (π₯0, π¦0) converges at exponential rate with exponentπΌto a critical point.
Furthermore, we can deduce the following local stability result.
Theorem 25. Let (π₯β, π¦β) be a critical point ((βπ₯π ,βπ¦π) = (0,0)) and assume furthermore that
πminBmin
πmin
2π βΒ±
π·2
π₯ π₯π
+ 1
3π·Β― β768π4πΏ2
, πmin
2π βΒ±
βπ·2
π¦ π¦π
+ 1
3π·Λ β768π4πΏ2 >0 and π β πΆ2(Rπ+π) with Lipschitz continuous Hessian. Then there exists a neigh-
borhoodU of(π₯β, π¦β), such that CGD started in (π₯1, π¦1) β U converges to a point inU at an exponential rate that depends only onπmin.
The results on local stability for existing modifications of GDA, including those of [61,172,173] (see also [154]) all require the stepsize to be chosen inversely propor- tional to an upper bound onπmax(π·2
π₯ π¦π), and indeed we will see in our experiments that the existing methods are prone to divergence under strong interactions between the two players (large πmax(π·2
π₯ π¦π)). In contrast to these results, our convergence resultsonly improveas the interaction between the players becomes stronger.
Why not useπ·2
π₯ π₯π andπ·2
π¦ π¦π? The use of a bilinear approximation that contains some, but not all second order terms is unusual and begs the question why we do not include the diagonal blocks of the Hessian in Equation (7.8) resulting in the damped and regularized Newtonβs method
Ξπ₯ Ξπ¦
!
=β Id+π π·2
π₯ π₯π π π·2
π₯ π¦π π π·2
π¦π₯π Id+π π·2
π¦ π¦π
!β1
βπ₯π
βπ¦π
!
. (7.10)
For the following reasons, we believe that the bilinear approximation is preferable both from a practical and conceptual point of view.
β’ Conditioning of matrix inverse: One advantage of competitive gradient descent is that in many cases, including all zero-sum games, the condition number of the
140 matrix inverse in Algorithm 16is bounded above byπ2kπ·π₯ π¦k2. If we include the diagonal blocks of the Hessian in a non-convex-concave problem, the matrix can even be singular as soon asπkπ·2
π₯ π₯πk β₯ 1 orπkπ·2
π¦ π¦πk β₯ 1.
β’ Irrational updates: We can only expect the update rule (7.10) to correspond to a local Nash equilibrium if the problem is convex-concave orπkπ·2
π₯ π₯πk, πkπ·2
π¦ π¦πk < 1.
If these conditions are violated, it can instead correspond to the players playing their worstas opposed to best strategy based on the quadratic approximation, leading to behavior that contradicts the game-interpretation of the problem.
β’ Lack of regularity: For the inclusion of the diagonal blocks of the Hessian to be helpful at all, we need to make additional assumptions on the regularity of π, for example by bounding the Lipschitz constants of π·2
π₯ π₯π and π·2
π¦ π¦π. Otherwise, their value at a given point can be totally uninformative about the global structure of the loss functions (consider as an example the minimization ofπ₯ β¦β π₯2+π3/2sin(π₯/π) forπ 1). Many problems in competitive optimization, including GANs, have the form π(π₯ , π¦) = Ξ¦(G (π₯),D (π¦)), π(π₯ , π¦) = Ξ(G (π₯),D (π¦)), whereΞ¦,Ξaresmooth and simple, but G and D might only have first order regularity. In this setting, the bilinear approximation has the advantage of fully exploiting the first order information of G and D, without assuming them to have higher order regularity.
This is because the bilinear approximations of π andπ then contains only the first derivatives of G and D, while the quadratic approximation contains the second derivatives π·2
π₯ π₯G and π·2
π¦ π¦D and therefore needs stronger regularity assumptions onGandD to be effective.
β’ No spurious symmetry: One reason to favor full Taylor approximations of a certain order in single-player optimization is that they are invariant under changes of the coordinate system. For competitive optimization, a change of coordinates of (π₯ , π¦) β Rπ+π can correspond, for instance, to taking a decision variable of one player and giving it to the other player. This changes the underlying game significantly and thus we donotwant our approximation to be invariant under this transformation. Instead, we want our local approximation to only be invariant to coordinate changes ofπ₯ β Rπ and π¦ β Rπ in separation, that is to block-diagonal coordinate changes on Rπ+π. Mixed order approximations (bilinear, biquadratic, etc.) have exactly this invariance property and thus are the natural approximation for two-player games.
While we are convinced that the right notion of first order competitive optimization is given by quadratically regularized bilinear approximations, we believe that the right notion of second order competitive optimization is given by cubicallyregularized biquadraticapproximations, in the spirit of [182].