Chapter VI: Cholesky factorization by Kullback-Leibler minimization
6.1 Overview
pairwise distances alone. This means that the algorithm will automatically exploit any low-dimensional structure in the dataset. In order to illustrate this feature, we artificially construct a dataset with low-dimensional structure by randomly ro- tating four low-dimensional structures into a 20-dimensional ambient space (see Figure5.9). Chapter5.8shows that the resulting approximation is even better than the one obtained in dimension 3, illustrating that our algorithm did indeed exploit the low intrinsic dimension of the dataset.
5.6 Numerical example: Preconditioning finite element matrices
100
π 2.0 2.5 3.0 3.5 4.0
#iter averaging 184 135 85 80 67
#iter subsampling 433 278 134 117 72
Table 5.9: Averaging vs Subsampling. For homogeneous Poisson problems and small values of π, averaging can improve the preconditioning effect compared to subsampling.
3. We use a reordering akin to Example1as opposed to an averaging averaging scheme as in Example2even though the Laplace operator (π =1) with spatial dimensionπ β {2,3}violates the condition π > π/2 of Example1.
Regarding the first point, analog results could be derived by repeating the proofs of Chapter4in the fully discrete setting. Regarding the second point, in agreement with earlier numerical results by [188], we observe that the impact of unstructured high- contrast coefficients on our method is far less severe than could be expected based on the theoretical results in Chapter 4. We suspect that the impact of the contrast on the accuracy of our method is highly dependent on the geometric structure of the coefficients. While there should be adversarial choices of geometry with devastating effects, unstructured coefficient fields seem to be surprisingly benign. Regarding the third point, while the numerical homogenization results of Theorem12 break down if the conditionπ > π/2 is violated, the exponential decay result does seem to hold. For Poisson problems with slowly varying, low-contrast coefficients and small values ofπ, we obseve that using an averaging scheme improves the preconditioning effect, as shown in Table5.9.
On the other hand, in the presence of high contrast coefficients reordering according to Example1tends to perform better than averaging according to Example2since the resulting incomplete Cholesky factorization is less prone to encountering non- positive pivots. We have no theoretical explanation for this empirical observation.
5.6.2 Accuracy and computational complexity
We begin by investigating empirically the sparsity of the Cholesky factors and the scaling of the computational complexity of our factorization. In Figure 5.10, we show the sparsity pattern for different πtogether with the approximation error. We next investigate the computational time of computing the Cholesky factorization and how it scales with π. As shown in the log-log plot in Figure5.11, for fixed π, our factorization achieves the linear scaling predicted by the theory.
π=2 π=3 π=4 π=5
5 10 15 20
0 2 4 6
π
percentageoffill-ins
density percentage
5 10 15 20
10β10 10β7 10β4 10β1
π
error
kπΏ πΏ>βπ΄kFro/ kπ΄kFro
5 10 15 20
10β9 10β6 10β3 100
π
error
k (πΏ πΏ>)β1βπ΄β1kFro/ kπ΄β1kFro
5 10 15 20
10β6 10β4 10β2 100
π
error
kπ΄π’βπk2/ kπk2
Figure 5.10: Sparsity pattern, error and factor density. When computing the factorization of a Laplacian on a 16πΓ16π grid in the reverse maximin ordering of Definition 6, the Cholesky factors are approximately sparse according to the reverse maximin sparsity pattern of Definition7, despite the condition π > π/2 of Example1not being satisfied.
Exact sparse Cholesky factorization, as discussed in Section3.2, is a strong competi- tor for many practical two-dimensional problems. However, its superlinear scaling of both time and space complexity limits its applicability to three-dimensional problems. In Figure 5.12, we compare our method against the highly optimized CHOLMOD library on series of three-dimensional problems of increasing size.
While CHOLMOD is superior for small problems, our method shows superior per- formance on problems withπ > 106. We note that CHOLMOD has the advantage that it is not impacted by the geometric structure and magnitude of the coefficients and would therefore preserve its computational cost if we were to choose a more challenging problem of the same size. Nevertheless, it becomes infeasible for truly large-scale problems, while our method preserves its near-linear scaling.
5.6.3 Comparison with algebraic multigrid methods
A popular choice for solving elliptic partial differential equations are algebraic multigrid method [41, 42, 253]. Where geometric multigrid methods [81, 111, 113] use predetermined multiresolution schemes that are vulnerable to rough and high contrast coefficients [20], algebraic multigrid methods use an operator-adapted
102
105 106
100 101
π
time(s)
2D factorization
π=7.0 π=8.0
105 106
100 101 102
π
time(s)
3D factorization
π=2.5 π=3.5
Figure 5.11: Scalability. In 2D and 3D, our IC factorization time matches the expectedO (π π2π)time complexity for a matrix sizeπΓπand a sparsity parameter π; the dashed line indicates a slope of 1 in this log-log plot.
multiresolution analysis. While this improves their performance on many problems, rough coefficients with high contrast still pose challenges [7,255].
We compare our method to the popular AMG implementations AMGCL [67] and Trilinos [237]. For AMGCL, we chose the Ruge-StΓΌben method [230] to construct the hierarchy and sparse inverse approximation [102] for relaxation. For Trilinos, we use smoothed aggregation [240] for coarsening and symmetric Gauss-Seidel for relaxation.
105 106
100 101 102
π factorization time (s)
105 106
100 101 102
π total time (s)
105 106
100 101
π memory cost (Gb)
CHOLMOD vs. Our method
Figure 5.12: Direct vs. iterative solvers. For a 3D Poisson solve, CHOLMOD scales non-linearly in the linear system size π for factorization time, total solve time (which include factorization and back-substitution), and memory use, and fails for π >1M; instead, a PCG-based iterative solve using our preconditioner exhibits consistent linear behaviors on all three measurements.
We begin our comparison on the Poisson problem (5.44) and the elasticity problem (5.45) in two and three dimensions on a bimaterial generated by randomly assigned stiff regions. The stiff regions form about 1/8 of the problem in two dimensions and 1/64 of the problem in three dimensions. We refer to the ratio of coefficient magnitudes instiff and normal regions as the contrast. We furthermore investigate examples on both regular and (uniformly) randomly generated grids.
Poisson
regular mesh irregular mesh
x-axis shows contrast (π=2Γ105,g=1)
100 102 104 100
101
time (s)
100 102 104 100
101
time (s)
100 102 104 101
102
time (s)
100 102 104 102
time (s)
x-axis shows size (contrast=104,g=1)
105 106
100 101
time (s)
105 106
100 101 102
time (s)
105 106
101 102
time (s)
105 106
101 102 103 104
time (s)
ββ Ours ββ AMGCL ββ Trilinos
As we can see in Figure5.12our method beats both of these methods in all problems with the exception of Poisson problems on regular grids, where it is outperformed by AMGCL.
In many applications, our method will be used repeatedly during the solution of a nonlinear problem via implicit time-stepping or Newtonβs method. In Fiture5.13, we use our method to accelerate computations in quasi-statics: We stretch an armadillo model with about 340π nodes (thus over 1π degrees of freedom), for which each tetrahedron is randomly assigned one of two neo-Hookean materials whose contrast between Youngβs moduli is 104. In each step of our Newton descent to find the final shape, we project the neo-Hookean elastic stiffness matrix to enforce semi-positive
104 Elasticity
regular mesh irregular mesh
x-axis shows contrast (π=2Γ105,g=1)
100 102 104 101
102
103 time (s)
100 102 104 102
103
time (s)
100 102 104 101
102 103
time (s)
100 102 104 102
103
time (s)
x-axis shows size (contrast=104,g=1)
105 106
101 102 103
104 time (s)
105 106
102 103 104
time (s)
105 106
101 102 103
104 time (s)
105 106
102 103
time (s)
ββ Ours ββ AMGCL ββ Trilinos
Figure 5.12: Comparisons with AMG libraries.. Figures indicate time costs (including factorization and PCG iteration times) as a function of material contrast.
Our method is much less sensitive to contrast and problem size, and is particularly efficient when the sizeπ becomes large and/or for bad condition numbers. For our method, timings are within the orange region depending on the actual value of π, for which we used the range [6.5,8.5] in 2D and [2.5,4.0] in 3D. All meshes are generated by Delaunay triangulation.
definiteness [234]. The linear solves of the Newton descent exhibit a 2.5Γspeedup on average compared to AMGCL, with larger speed-ups for smaller error tolerances.
β0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
β6
β5
β4
β3
β2
β1
Β·105
time (104s)
potentialenergyvalue
Our method AMGCL
Figure 5.13: Nonlinear quasi-statics. An armadillo is stretched via lateral gravity with a few nodes (marked in black) fixed at their initial position. We use a trust region nonlinear optimization algorithm involving the solution of a linear system using our IC preconditioner or AMGCL at each step; timing of the first 20 iterations are plotted.
106 C h a p t e r 6
CHOLESKY FACTORIZATION BY KULLBACK-LEIBLER MINIMIZATION
6.1 Overview
6.1.1 FactorizingΞβ1from entries ofΞ
In this chapter, we propose to compute a sparse approximate inverse Cholesky factor πΏ of Ξ, by minimizing with respect to πΏ and subject to a sparsity constraint, the Kullback-Leibler (KL) divergence between two centered multivariate normal distri- butions with covariance matrices Ξand (πΏ πΏ>)β1. Surprisingly, this minimization problem has a closed-form solution, enabling the efficient computation of optimally accurate Cholesky factors for any specified sparsity pattern.
The resulting approximation can be shown to be equivalent to the Vecchia approxi- mation of Gaussian processes [241], which has become very popular for the analysis of geospatial data (e.g., [62,104,135,136,226,233]); to the best of our knowledge, rigorous convergence rates and error bounds were previously unavailable for Vecchia approximations, and this work is the first one presenting such results. An equiva- lent approximation has also been proposed by [133] and [141] in the literature on factorized sparse approximate inverse (FSAI) preconditioners of (typically) sparse matrices (see e.g., [33] for a review and comparison, [51] for an application to dense kernel matrices); however, its KL-divergence optimality has not been observed be- fore. KL-minimization has also been used to obtain sparse lower-triangular transport maps by [166]; while this literature is mostly concerned with the efficient sampling of non-Gaussian probability measures, the present work shows that an analogous approach can be used to obtain fast algorithms for numerical linear algebra if the sparsity pattern is chosen appropriately.
6.1.2 State-of-the-art computational complexity
The computational complexity and approximation accuracy of our approach depend on the choice of elimination ordering and sparsity pattern. WhenΞis the covariance function of a Gaussian process that is subject to the screening effect, we propose to use the reverse maximin ordering and sparsity pattern proposed in Definitions6 and7. By using a grouping algorithm similar to the supernodes of Chapter 5and the heuristics proposed by [83,104,226], we can show that the approximate inverse
Cholesky factor can be computed in computational complexity O (π π2π) in time andO (π ππ) in space, using onlyO (π ππ) entries of the original kernel matrix Ξ, whereπis a tuning parameter trading accuracy for computational efficiency.
In settings where Theorem 10on the exponential decay of Ξβ1 holds, it allows us to prove that the approximation error decays exponentially in π. In this setting, we can thus compute an π-approximation of Ξ in complexity O
πlog2π(π/π)
by choosing π β log(π/π). This is the best known trade-off between computational complexity and accuracy for inverses of Greenβs matrices of general elliptic boundary value problems.
6.1.3 Practical advantages
Our method has importantpracticaladvantages complementing its theoretical and asymptotic properties. In many GP regression applications, large values of π are computationally intractable with present-day resources. By incorporating prediction points in the computation of KL-optimal inverse-Cholesky factors, we obtain a GP regression algorithm that is accurate even for small (β 3) values of π, including in settings where truncation of thetrue Cholesky factor of Ξβ1 to the same sparsity pattern fails completely.
For other hierarchy-based methods, the computational complexity depends expo- nentially on the dimensionπof the dataset. In contrast, because the construction of the ordering and sparsity pattern only uses pairwise distances between points, our algorithms automatically adapt to low-dimensional structure in the data and operate in complexities identified by replacing πwith theintrinsic dimension πΛβ€ πof the dataset.
We have seen in section5.5.3that the screening deteriorates for independent sums of two GPs, such as when combining a GP with additive Gaussian white noise.
As noted there, this problem could be overcome if we had access to the inverse of the covariance matrices. Analogs of the Cholesky factorization that compute this inverse by recursive Schur complementation exist, but they are too unstable to be useful in practice. The inherent stability of the KL minimization allows us to realize this approach and thus compute both cheaply and accurately with sums of independent Gaussian processes arising, for instance, from modeling measurement noise. To the best of our knowledge, this is the first time this has been achieved by a method based on the screening effect.
Finally, our algorithm is intrinsically parallel because it allows each column of
108 the sparse factor to be computed independently (as in the setting of the Vec- chia approximation, factorized sparse approximate inverses, and lower-triangular transport maps). Furthermore, we show that in the context of GP regression, the log-likelihood, the posterior mean, and the posterior variance can be computed in O (π+ππ)space complexity. In a parallel setting, we requireO (ππ)communication between the different workers for every O (π3π) floating-point operations, resulting in a total communication complexity of O (π). Here, most of the floating-point operations arise from calls to highly optimized BLAS and LAPACK routines.