Introduction
Numerical algorithms as games and estimators
This justification would be easier if our method returns random results, such as in randomized linear algebra [165], or if it returns a Bayesian posterior of possible solutions, as in probabilistic numericals [116], thus solving a different problem than classical methods. However, for the work presented in this thesis, this is not the case, and the resulting algorithms could equally well be characterized in terms of linear algebra and optimization.
Numerical approximation, fast algorithms, and statistical inference . 2
The proof of the next lemma (on the stability of exponential decay under matrix inversion for well-conditioned matrices) is almost identical to that of [128] (we follow only the constants; see also [68] for a related result on the inverse of sparse matrices). First-player update rules for (from top to bottom) GDA, LCGD, ConOpt, OGDA, and CGD in a zero-sum game (𝑓 =−𝑔).
Elliptic partial differential equations and smooth Gaussian processes 13
Smooth Gaussian processes
This suggests that Gaussian fields are defined by choosing an L for which [L𝑢, 𝑢] is a measure of the roughness of the function. The Sobolev norms, being the sum of the𝐿2norms of the first𝑠derivatives, provide a natural measure of the roughness of a realizationL.
The cubic bottleneck
The set 𝑠𝑘 is the vector of row indices contained in the 𝑘-th column of the sparse pattern. This is because for an input (𝑖, 𝑗) to be part of the sparse pattern of 𝐿 𝐿>, there must exist a .
Sparse Cholesky factors by screening
Gaussian elimination and Cholesky factorization
When applied to symmetric and positive definite matrices, it is also known as Cholesky factorization and essentially involves representing the input matrix 𝐴 = 𝐿 𝐿> as the product of a lower triangular matrix with its transpose, using Algorithm1. Once the lower triangle factor𝐿 has been calculated, the determinant of𝐴 can be obtained as the square of the diagonal entries of 𝐿 and linear systems in 𝐴 can be solved by substitution (Algorithm2).
Sparse Cholesky factorization
Gaussian elimination and Gaussian conditioning
While the sparsity of the 𝑗th column of the Cholesky factor of Θ is determined by the conditional independence conditional on the variables appearing before 𝑗 in the ranking, the sparsity of 𝐴 is determined by the conditional independence conditional on the variables appearing after 𝑗 in the ranking. Therefore, elimination orders that induce sparsity inversion tend to be elimination orders that induce sparsity for 𝐴, and vice versa.
The screening effect
However, the Markov property is based only on the non-zero pattern of the entries of the precision matrix. For example, precision matrices of the form (𝐴−𝜆 𝐼)2 often do not allow a screening effect for large 𝜆, even if the precision matrix 𝐴.1 Conversely, important covariance models such as fractional-order Matérn kernels seem to allow a screening effect. , even if they are not associated with a known local precision operator (see Section 5.5.4, especially Tables 5.5 and 5.6).
The maximin ordering and sparsity pattern
Given a maximum order of the point set {𝑥𝑖} Given an inverse maximum order of the point set{𝑥𝑖}𝑖∈𝐼 ⊂ R𝑑with length scales{ℓ𝑘}1≤𝑘≤
Cholesky factorization, numerical homogenization, and gamblets
An attractive feature of our method is that it can be formulated in terms of . In the first figure, we plot the RMSE versus the true PDE solution as a function of 𝑞 ≈ log(𝑁).
Proving exponential decay of Cholesky factors
Overview
In the setting of Theorem 1, let 𝐿:,1:𝑘 be the ordering matrix defined by the first𝑘columns of the Cholesky factor𝐿ofΘin the maximin ordering. In the setting of Theorem 1, let𝐿:,1:𝑘be the rank matrix defined by the first𝑘 columns of the Cholesky factor𝐿of 𝐴B Θin the inverse maximin order.
Setting and notation
Denoting ask · k the norm of the operator and asℓ𝑘 the degrees of the sequence length, we have. In the above, we have discretized the Green's functions of the elliptic operators, resulting in the Green's matrix as the underlying discrete object.
Algebraic identities and roadmap
In setting Examples 1 and 2, denoting as 𝐿 the lower triangular Cholesky factor of ΘorΘ−1, we will denote that. We will run this program in the setting of Examples 1 and 2 and prove that (4.13) holds with . 4.20) To prove (1), the matrices Θ(𝑘), 𝐴(𝑘) (interpreted as coarse versions of G and L) and 𝐵(𝑘) will be identified as the matched wavelet stiffness matrices L described in section 3. 6.3.
142] present a simpler proof of the exponential decay of the LOD basis functions in [163] based on the exponential convergence of subspace iteration methods. The results of [189] are sufficient to show that Condition1 holds in the setting of Examples1 and 2.
Bounded condition numbers
To obtain estimates independent of the regularity of Ω, for the simplicity of the proof and without loss of generality, we will partially work in the extended space R𝑑 (instead of onΩ). We will now derive the exponential decay of the Cholesky factors𝐿 by combining the algebraic identities of Lemma 1 with the limits of the condition numbers of 𝐵(𝑘) (implied by Condition2) and the exponential decay of 𝐴(𝑘)1 (specified in Condition) ) .
Summary of results
In the setting of Theorem1, let𝐿:,1:𝑘 be the rank𝑘 matrix defined by the first 𝑘 columns of the Cholesky factor𝐿ofΘin the maximum order. In the setting of Theorem 11, let 𝐿:,1:𝑘be the rank𝑘matrix defined by the first𝑘 columns of the Cholesky factor𝐿of 𝐴 B Θin the reverse maximum order.
Extensions and comparisons
The submatrices Θ𝑠𝑖,𝑠𝑖 for all𝑖 belonging to supernode ˜𝑘 (i.e. 𝑖 { 𝑘˜) can be formed from the list of elements ˜𝑠𝑘. In the same vein, the CGD update is given by the Nash equilibrium of the ordered bilinear approximation of the base game.
Incomplete Cholesky factorization
Zero fill-in incomplete Cholesky factorization
In particular, the above complexities apply toICHOL(0) in the inverse maximin ordering and sparsity pattern. Analogous results hold in the setting of Theorem 11, when using the maximin [inverse maximin] ordering and parsimony pattern.
Implementation of ICHOL(0) for dense kernel matrices
Since the size of the blocks is approximately 𝜌𝑑×𝜌𝑑, the resulting asymptotic complexity is the same. The amount of parallelism that this mechanism provides depends strongly on the order of degrees of freedom.
Implementation of ICHOL(0) for sparse stiffness matrices
Although this algorithm requires some indexing in the supernodes, most of the work is done by calling dense linear algebra routines in Lines5and9. As in the case of Cholesky's upward factorization, any set of consecutive supernodes that are pairwise independent under the parsimony set 𝑆–| can be handled in parallel in the outer for loop in Algorithm 8.
Proof of stability of ICHOL(0)
We will now constrain the approximation error of the Cholesky factors obtained from Algorithm3, using the supernodal multicolor ordering and sparsity pattern described in Construction3. Thus, the above proof can be adapted to show that using the multicolored ordering and sparsity pattern of the column supernodal, ICHOL(0) applied to Θ−1 computes an 𝜖-approximation in the computational complexityO (𝑁log(𝑁/ 𝜖)).
Numerical example: Compression of dense kernel matrices
In particular, as shown in Figure 5.4, it does not cause significant artifacts in the approximation quality near the boundary. Second panel: exponential decay of the relative error in the Frobenius norm as 𝜌 is increased.
Numerical example: Preconditioning finite element matrices
We then calculate the inverse maximin order≺Pr of the prediction points using the modified ˜Ω B Ω∪ {𝑥𝑖}𝑖∈𝐼. In the original formulation due to [96] , both players play a zero-sum game with a generator loss function given by the binary cross-entropy.
Cholesky factorization by Kullback-Leibler minimization
Overview
We suspect that the effect of contrast on the accuracy of our method strongly depends on the geometric structure of the coefficients. In other hierarchy-based methods, the computational complexity depends exponentially on the dimension𝑑of the dataset.
Cholesky factorization by KL-minimization
Although the computational complexity is slightly worse than that of locally incomplete Cholesky factorization, we will show in Theorem 21 that for important choices of 𝑆, the time complexity can be reduced to O Í𝑁. It can be shown that the formula in equation (6.3) is equivalent to the formula used to calculate the Vecchia approximation [241] in spatial statistics, without explicit awareness of the KL optimality of the resulting 𝐿.
Ordering and sparsity pattern motivated by the screening effect
In the last panel, we are at the edge of the set of observations, which weakens the screening effect. The screening effect is significantly weakened near the boundary of the domain, as illustrated in the third panel of Figure 6.6 (cf. Figure 5.4).
Extensions
The computation of the mixed covariance matrix ΘPr,Tr can be prohibitively expensive when interpolating with a large number of prediction points. We note that the idea of ordering the prediction points first (last in their notation) has already been proposed by [136] in the context of the Vecchia approximation, although without giving an explicit algorithm.
Applications and numerical results
Time for computing the factor 𝐿𝜌 with or without aggregation (𝑁 =106), as a function of𝜌 and of the number of non-zero entries. The left plot shows the RMSE of the posterior mean and the right plot that of the posterior standard deviation.
Conclusions
This term allows each player to prefer strategies that are less vulnerable to the actions of the other player. The first experiment shows that the inclusion of the competing Term2 is enough to solve the cycling problem in the bilinear case.
Competitive Gradient Descent
Introduction
In single agent optimization, the solution to the problem consists of the minimizer of the cost function. Recall that in the single player setting, the gradient descent update is obtained as the optimal solution to a regularized linear approximation of the cost function.
Competitive gradient descent
An equally valid generalization of the linear approximation to the one-player setting is to use a bilinear approximation to the two-player setting. Among all (possibly random) strategies with the first finite moment, the only Nash equilibrium of the game (7.3) is given by.
Consensus, optimism, or competition?
We further investigate the effect of consensus term 3 by considering the concave-convex problem 𝑓(𝑥 , 𝑦) =𝛼(−𝑥2+𝑦2) (see Figure 7.5). For𝛼 = 1.0, however, ConOpt converges to (0,0), giving an example of consensus regularization introducing spurious solutions.
Implementation and numerical results
We randomly generate the stabilizing 𝐾0 and initialize the environmental agent parameter 𝐿0 inside the set of constraints. Based on the discussion in the last section, any attempt to understand GANs must include the inductive biases of the discriminator.
Competitive Mirror Descent
Simplifying constraints by duality
Although there are many ways in which a problem can be cast in the above form, we are interested in the case where the 𝑓 , 𝑓 , 𝑔,˜ 𝑔˜ are allowed to be complex, for example, given in terms of neural networks, while the ¯C,K¯,C˜,K˜ are simple and easy to understand. In the following we assume that the non-linear constraints of the problem have already been eliminated, which we (for possible differentCandG) with.
Projected CGD suffers from empty threats
The explanation for this behavior lies in the conflicting rules of the unbounded local game 8.6 and the bounded global game 8.7. This is an empty threat in that it violates the constraints and will therefore be undone by the projection step, but this information is not included in the local game.
Mirror descent and Bregman potentials
A possible generalization of the mirror deduction to Problem 8.5 is to obtain 𝑥𝑘+1and𝑦𝑘+1 as solutions to the game. However, the local game in this update rule does not have a closed-form solution as in mirror descent or competitive gradient descent.
The information geometry of Bregman divergences
A key feature of Bregman potentials is that they induce a new exponential map on the manifold C. An important property of the dual exponential map is that it inherits the completeness property of 𝜓.
Competitive mirror descent
Numerical comparison
These GAN formulations are fundamentally different from metric-agnostic GANs in that they depend explicitly on the gradient of the discriminator. We believe that the success of GANs is due to their ability to implicitly use the inductive biases of the discriminative network as a notion of similarity.
Implicit competitive regularization
Introduction
Imposing regularity on the discriminator amounts to forcing him to assign similar images to similar results. Contrary to this view, we believe that ICR is crucial to overcome the GAN dilemma and thereby explain GAN performance in practice by letting the inductive biases of the discriminator network inform the generative model.
The GAN-dilemma
An important feature of the original GAN is that it only depends on the discriminator through its output in sample-based evaluation. The above observation has motivated the introduction of metrically informed GANs that limit the size of the discriminator's gradient (as a function that maps images to real numbers).
Implicit competitive regularization (ICR)
We believe that GAN training relies on implicit competition regularization (ICR), an additional implicit regularization due to the simultaneous training of generator and discriminator. As has been observed before [168], simultaneous gradient descent with step sizes 𝜂𝑥 = 0.09 for 𝑥 and𝜂𝑦 = 0.01 for 𝑦 will converge to (0,0), despite being a locally worst strategy for the maximizing player. (See Figure 9.3 for an illustration.) This is a first example of ICR where the simultaneous optimization of the two agents introduces additional attractive points to the dynamics that are not attractive when optimizing one of the players using gradient descent, while the other player is held down. .
How ICR lets GANs generate
Competitive gradient descent amplifies ICR
Empirical study on CIFAR10