• Tidak ada hasil yang ditemukan

Competitive gradient descent

Dalam dokumen Inference, Computation, and Games (Halaman 158-164)

Chapter VII: Competitive Gradient Descent

7.2 Competitive gradient descent

sian mixture model. Even in this simple example, trying five different (constant) stepsizes under RMSProp, the existing methods diverge. The typical solution would be to decay the learning rate. However, even with a constant learning rate, CGD succeeds with all these stepsize choices to approximate the main features of the target distribution. In fact, throughout our experiments weneversaw CGD diverge.

In order to measure the convergence speed more quantitatively, we next consider a nonconvex matrix estimation problem, measuring computational complexity in terms of the number of gradient computations performed. We observe that all meth- ods show improved speed of convergence for larger stepsizes, with CGD roughly matching the convergence speed of optimistic gradient descent [61], at the same stepsize. However, as we increase the stepsize, other methods quickly start diverg- ing, whereas CGD continues to improve, thus being able to attain significantly better convergence rates (more than two times as fast as the other methods in the noiseless case, with the ratio increasing for larger and more difficult problems). For small stepsize or games with weak interactions, on the other hand, CGD automatically invests less computational time per update, thus gracefully transitioning to a cheap correction of GDA at minimal computational overhead. We furthermore present an application to constrained optimization problems arising in model-based rein- forcement learning and refer to [198] for recent applications of CGD to multi-agent reinforcement learning.

136 How to linearize a game. To motivate this algorithm, we remind ourselves that gradient descent with stepsizeπœ‚applied to the function 𝑓 :Rπ‘š βˆ’β†’Rcan be written as

π‘₯π‘˜+1=arg min

π‘₯∈Rπ‘š

(π‘₯> βˆ’π‘₯>

π‘˜)βˆ‡π‘₯𝑓(π‘₯π‘˜) + 1 2πœ‚

kπ‘₯βˆ’π‘₯π‘˜k2. (7.2) This can be interpreted as a (single) player solving a local linear approximation of the (minimization) game, subject to a quadratic penalty that expresses her limited confidence in the global accuracy of the model. The natural generalization of this idea to the competitive case should then be given by the two players solving a local approximation of the true game, both subject to a quadratic penalty that expresses their limited confidence in the accuracy of the local approximation.

To implement this idea, we need to generalize the linear approximation in the single-agent setting to the competitive setting: How to linearize a game?

Linear or Multilinear. GDA answers the above question by choosing a linear approximation of 𝑓 , 𝑔 : Rπ‘š Γ— R𝑛 βˆ’β†’ R. This seemingly natural choice has the flaw that linear functions cannot express any interaction between the two players and are thus unable to capture the competitive nature of the underlying problem.

From this point of view, it is not surprising that the convergent modifications of GDA are, implicitly or explicitly, based on higher-order approximations (see also [149]). An equally valid generalization of the linear approximation in the single- player setting is to use a bilinear approximation in the two-player setting. Since the bilinear approximation is the lowest order approximation that can capture some interaction between the two players, we argue that the natural generalization of gradient descent to competitive optimization is not GDA, but rather the update rule (π‘₯π‘˜+1, π‘¦π‘˜+1) = (π‘₯π‘˜, π‘¦π‘˜) + (π‘₯ , 𝑦), where(π‘₯ , 𝑦) is a Nash equilibrium of the game

min

π‘₯∈Rπ‘š

π‘₯>βˆ‡π‘₯𝑓 +π‘₯>𝐷2

π‘₯ 𝑦𝑓 𝑦+𝑦>βˆ‡π‘¦π‘“ + 1 2πœ‚

π‘₯>π‘₯ min

π‘¦βˆˆR𝑛

𝑦>βˆ‡π‘¦π‘”+𝑦>𝐷2

𝑦π‘₯𝑔π‘₯ +π‘₯>βˆ‡π‘₯𝑔+ 1 2πœ‚

𝑦>𝑦 .

(7.3)

Indeed, the (unique) Nash equilibrium of (7.3) can be computed in closed form.

Theorem 23. Among all (possibly randomized) strategies with finite first moment, the only Nash equilibrium of the Game(7.3)is given by

π‘₯ =βˆ’πœ‚

Idβˆ’πœ‚2𝐷2

π‘₯ 𝑦𝑓 𝐷2

𝑦π‘₯𝑔 βˆ’1

βˆ‡π‘₯𝑓 βˆ’πœ‚ 𝐷2

π‘₯ π‘¦π‘“βˆ‡π‘¦π‘”

(7.4) 𝑦 =βˆ’πœ‚

Idβˆ’πœ‚2𝐷2

𝑦π‘₯𝑔 𝐷2

π‘₯ 𝑦𝑓 βˆ’1

βˆ‡π‘¦π‘”βˆ’πœ‚ 𝐷2

𝑦π‘₯π‘”βˆ‡π‘₯𝑓

, (7.5)

given that the matrix inverses in the above expression exist.2

Proof. Let𝑋 , π‘Œ be the randomized strategies of the two agents. By subtracting and addingE[𝑋]2/(2πœ‚),E[π‘Œ]2/(2πœ‚), and taking expectations, we rewrite the game as

min

E[𝑋]∈Rπ‘šE[𝑋]>βˆ‡π‘₯𝑓 +E[𝑋]>𝐷2

π‘₯ 𝑦𝑓 E[π‘Œ] +E[π‘Œ]>βˆ‡π‘¦π‘“ + 1

2πœ‚E[𝑋]>E[𝑋] + 1

2πœ‚Var[𝑋] (7.6) min

E[π‘Œ]∈R𝑛E[π‘Œ]>βˆ‡π‘¦π‘”+E[π‘Œ]>𝐷2

𝑦π‘₯𝑔E[𝑋] +E[𝑋]>βˆ‡π‘₯𝑔+ 1

2πœ‚E[π‘Œ]>E[π‘Œ] + 1

2πœ‚Var[π‘Œ]. (7.7) Thus, the objective value for both players can always be improved by decreasing the variance while keeping the expectation the same, meaning that the optimal value will always (and only) be achieved by a deterministic strategy. We can then replace theE[𝑋],E[π‘Œ] withπ‘₯ , 𝑦, set the derivative of the first expression with respect toπ‘₯ and of the second expression with respect to𝑦to zero, and solve the resulting system of two equations for the Nash equilibrium(π‘₯ , 𝑦). According to Theorem23, the Game (7.3) has exactly one optimal pair of strategies, which is deterministic. Thus, we can use these strategies as an update rule, gener- alizing the idea of local optimality from the single– to the multi-agent setting and obtaining Algorithm16.

What I think that they think that I think ... that they do. Another game- theoretic interpretation of CGD follows from the observation that its update rule can be written as

Ξ”π‘₯ Δ𝑦

!

=βˆ’ Id πœ‚ 𝐷2

π‘₯ 𝑦𝑓 πœ‚ 𝐷2

𝑦π‘₯𝑔 Id

!βˆ’1

βˆ‡π‘₯𝑓

βˆ‡π‘¦π‘”

!

. (7.8)

Applying the expansion πœ†max(𝐴) < 1 β‡’ (Idβˆ’π΄)βˆ’1 = limπ‘β†’βˆžΓπ‘

π‘˜=0π΄π‘˜ to the above equation, we observe that the first partial sum (𝑁 = 0) corresponds to the optimal strategy if the other player’s strategy stays constant (GDA). The second partial sum (𝑁 = 1) corresponds to the optimal strategy if the other player thinks that the other player’s strategy stays constant (LCGD, see Figure 7.3). The third partial sum (𝑁 = 2) corresponds to the optimal strategy if the other player thinks that the other player thinks that the other player’s strategy stays constant, and so forth, until the Nash equilibrium is recovered in the limit (see Figure7.2). For small

2We note that the matrix inverses exist for almost all values ofπœ‚, and for allπœ‚in the case of a zero-sum game.

138

Figure 7.2: What I think that they think that I think... The partial sums of a Neumann-series representation of Equation (7.8) represent different orders of opponent-awareness, recovering the Nash-equilibrium in the limit.

enough πœ‚, we could use the above series expansion to solve for (Ξ”π‘₯ ,Δ𝑦), which is known as Richardson iteration and would recover high order lookahead [86].

However, expressing it as a matrix inverse will allow us to use optimal Krylov sub- space methods to obtain far more accurate solutions with fewer gradient evaluations.

Rigorous results on convergence and local stability. We now present some basic convergence results for CGD, the proofs of which can be found in the appendix.

Our results are restricted to the case of a zero-sum game (𝑓 = βˆ’π‘”), but we expect that they can be extended to games that are dominated by competition. To simplify notation, we define

Β―

𝐷 B (Id+πœ‚2𝐷2

π‘₯ 𝑦𝑓 𝐷2

𝑦π‘₯𝑓)βˆ’1πœ‚2𝐷2

π‘₯ 𝑦𝑓 𝐷2

𝑦π‘₯𝑓 , 𝐷˜ B (Id+πœ‚2𝐷2

𝑦π‘₯𝑓 𝐷2

π‘₯ 𝑦𝑓)βˆ’1πœ‚2𝐷2

𝑦π‘₯𝑓 𝐷2

π‘₯ 𝑦𝑓 . (7.9) We furthermore define the spectral functionπœ™(πœ†) B2πœ†βˆ’ |πœ†|.

Theorem 24. If 𝑓 is two times differentiable with 𝐿-Lipschitz continuous Hessian and the diagonal blocks of its Hessian are bounded asπœ‚k𝐷2

π‘₯ π‘₯𝑓k, πœ‚k𝐷2

𝑦 𝑦𝑓k ≀1/18, we have

kβˆ‡π‘₯𝑓 (π‘₯+π‘₯π‘˜, 𝑦+π‘¦π‘˜) k2+

βˆ‡π‘¦π‘“ (π‘₯+π‘₯π‘˜, 𝑦+π‘¦π‘˜)

2βˆ’ kβˆ‡π‘₯𝑓k2βˆ’ kβˆ‡π‘¦π‘“k2

≀ βˆ’ βˆ‡π‘₯𝑓>

2πœ‚ β„ŽΒ±

𝐷2

π‘₯ π‘₯𝑓

+ 1

3𝐷¯ βˆ’32πœ‚2𝐿 kβˆ‡π‘₯𝑓k + kβˆ‡π‘¦π‘“k

βˆ’768πœ‚4𝐿2

βˆ‡π‘₯𝑓

βˆ’ βˆ‡π‘¦π‘“>

2πœ‚ β„ŽΒ±

βˆ’π·2

𝑦 𝑦𝑓

+ 1

3𝐷˜ βˆ’32πœ‚2𝐿 kβˆ‡π‘₯𝑓k + kβˆ‡π‘¦π‘“k

βˆ’768πœ‚4𝐿2

βˆ‡π‘¦π‘“ .

Under suitable assumptions on the curvature of 𝑓, Theorem 24implies results on the convergence of CGD.

Corollary 1. Under the assumptions of Theorem24, if for𝛼 >0

2πœ‚ β„ŽΒ±

𝐷2

π‘₯ π‘₯𝑓

+ 1

3𝐷¯ βˆ’32πœ‚2𝐿 kβˆ‡π‘₯𝑓(π‘₯0, 𝑦0) k + kβˆ‡π‘¦π‘“(π‘₯0, 𝑦0) k

βˆ’768πœ‚4𝐿2

𝛼Id

2πœ‚ β„ŽΒ±

βˆ’π·2

𝑦 𝑦𝑓

+ 1

3𝐷˜ βˆ’32πœ‚2𝐿 kβˆ‡π‘₯𝑓(π‘₯0, 𝑦0) k + kβˆ‡π‘¦π‘“(π‘₯0, 𝑦0) k

βˆ’768πœ‚4𝐿2

𝛼Id for all (π‘₯ , 𝑦) ∈ Rπ‘š+𝑛, then CGD started in (π‘₯0, 𝑦0) converges at exponential rate with exponent𝛼to a critical point.

Furthermore, we can deduce the following local stability result.

Theorem 25. Let (π‘₯βˆ—, π‘¦βˆ—) be a critical point ((βˆ‡π‘₯𝑓 ,βˆ‡π‘¦π‘“) = (0,0)) and assume furthermore that

πœ†minBmin

πœ†min

2πœ‚ β„ŽΒ±

𝐷2

π‘₯ π‘₯𝑓

+ 1

3𝐷¯ βˆ’768πœ‚4𝐿2

, πœ†min

2πœ‚ β„ŽΒ±

βˆ’π·2

𝑦 𝑦𝑓

+ 1

3𝐷˜ βˆ’768πœ‚4𝐿2 >0 and 𝑓 ∈ 𝐢2(Rπ‘š+𝑛) with Lipschitz continuous Hessian. Then there exists a neigh-

borhoodU of(π‘₯βˆ—, π‘¦βˆ—), such that CGD started in (π‘₯1, 𝑦1) ∈ U converges to a point inU at an exponential rate that depends only onπœ†min.

The results on local stability for existing modifications of GDA, including those of [61,172,173] (see also [154]) all require the stepsize to be chosen inversely propor- tional to an upper bound on𝜎max(𝐷2

π‘₯ 𝑦𝑓), and indeed we will see in our experiments that the existing methods are prone to divergence under strong interactions between the two players (large 𝜎max(𝐷2

π‘₯ 𝑦𝑓)). In contrast to these results, our convergence resultsonly improveas the interaction between the players becomes stronger.

Why not use𝐷2

π‘₯ π‘₯𝑓 and𝐷2

𝑦 𝑦𝑔? The use of a bilinear approximation that contains some, but not all second order terms is unusual and begs the question why we do not include the diagonal blocks of the Hessian in Equation (7.8) resulting in the damped and regularized Newton’s method

Ξ”π‘₯ Δ𝑦

!

=βˆ’ Id+πœ‚ 𝐷2

π‘₯ π‘₯𝑓 πœ‚ 𝐷2

π‘₯ 𝑦𝑓 πœ‚ 𝐷2

𝑦π‘₯𝑔 Id+πœ‚ 𝐷2

𝑦 𝑦𝑔

!βˆ’1

βˆ‡π‘₯𝑓

βˆ‡π‘¦π‘”

!

. (7.10)

For the following reasons, we believe that the bilinear approximation is preferable both from a practical and conceptual point of view.

β€’ Conditioning of matrix inverse: One advantage of competitive gradient descent is that in many cases, including all zero-sum games, the condition number of the

140 matrix inverse in Algorithm 16is bounded above byπœ‚2k𝐷π‘₯ 𝑦k2. If we include the diagonal blocks of the Hessian in a non-convex-concave problem, the matrix can even be singular as soon asπœ‚k𝐷2

π‘₯ π‘₯𝑓k β‰₯ 1 orπœ‚k𝐷2

𝑦 𝑦𝑔k β‰₯ 1.

β€’ Irrational updates: We can only expect the update rule (7.10) to correspond to a local Nash equilibrium if the problem is convex-concave orπœ‚k𝐷2

π‘₯ π‘₯𝑓k, πœ‚k𝐷2

𝑦 𝑦𝑔k < 1.

If these conditions are violated, it can instead correspond to the players playing their worstas opposed to best strategy based on the quadratic approximation, leading to behavior that contradicts the game-interpretation of the problem.

β€’ Lack of regularity: For the inclusion of the diagonal blocks of the Hessian to be helpful at all, we need to make additional assumptions on the regularity of 𝑓, for example by bounding the Lipschitz constants of 𝐷2

π‘₯ π‘₯𝑓 and 𝐷2

𝑦 𝑦𝑔. Otherwise, their value at a given point can be totally uninformative about the global structure of the loss functions (consider as an example the minimization ofπ‘₯ ↦→ π‘₯2+πœ–3/2sin(π‘₯/πœ–) forπœ– 1). Many problems in competitive optimization, including GANs, have the form 𝑓(π‘₯ , 𝑦) = Ξ¦(G (π‘₯),D (𝑦)), 𝑔(π‘₯ , 𝑦) = Θ(G (π‘₯),D (𝑦)), whereΞ¦,Θaresmooth and simple, but G and D might only have first order regularity. In this setting, the bilinear approximation has the advantage of fully exploiting the first order information of G and D, without assuming them to have higher order regularity.

This is because the bilinear approximations of 𝑓 and𝑔 then contains only the first derivatives of G and D, while the quadratic approximation contains the second derivatives 𝐷2

π‘₯ π‘₯G and 𝐷2

𝑦 𝑦D and therefore needs stronger regularity assumptions onGandD to be effective.

β€’ No spurious symmetry: One reason to favor full Taylor approximations of a certain order in single-player optimization is that they are invariant under changes of the coordinate system. For competitive optimization, a change of coordinates of (π‘₯ , 𝑦) ∈ Rπ‘š+𝑛 can correspond, for instance, to taking a decision variable of one player and giving it to the other player. This changes the underlying game significantly and thus we donotwant our approximation to be invariant under this transformation. Instead, we want our local approximation to only be invariant to coordinate changes ofπ‘₯ ∈ Rπ‘š and 𝑦 ∈ R𝑛 in separation, that is to block-diagonal coordinate changes on Rπ‘š+𝑛. Mixed order approximations (bilinear, biquadratic, etc.) have exactly this invariance property and thus are the natural approximation for two-player games.

While we are convinced that the right notion of first order competitive optimization is given by quadratically regularized bilinear approximations, we believe that the right notion of second order competitive optimization is given by cubicallyregularized biquadraticapproximations, in the spirit of [182].

Dalam dokumen Inference, Computation, and Games (Halaman 158-164)