Overview - Cholesky factorization by Kullback-Leibler minimization

Chapter VI: Cholesky factorization by Kullback-Leibler minimization

6.1 Overview

pairwise distances alone. This means that the algorithm will automatically exploit any low-dimensional structure in the dataset. In order to illustrate this feature, we artificially construct a dataset with low-dimensional structure by randomly ro- tating four low-dimensional structures into a 20-dimensional ambient space (see Figure5.9). Chapter5.8shows that the resulting approximation is even better than the one obtained in dimension 3, illustrating that our algorithm did indeed exploit the low intrinsic dimension of the dataset.

5.6 Numerical example: Preconditioning finite element matrices

100

𝜌 2.0 2.5 3.0 3.5 4.0

#iter averaging 184 135 85 80 67

#iter subsampling 433 278 134 117 72

Table 5.9: Averaging vs Subsampling. For homogeneous Poisson problems and small values of 𝜌, averaging can improve the preconditioning effect compared to subsampling.

3. We use a reordering akin to Example1as opposed to an averaging averaging scheme as in Example2even though the Laplace operator (𝑠=1) with spatial dimension𝑑 ∈ {2,3}violates the condition 𝑠 > 𝑑/2 of Example1.

Regarding the first point, analog results could be derived by repeating the proofs of Chapter4in the fully discrete setting. Regarding the second point, in agreement with earlier numerical results by [188], we observe that the impact of unstructured high- contrast coefficients on our method is far less severe than could be expected based on the theoretical results in Chapter 4. We suspect that the impact of the contrast on the accuracy of our method is highly dependent on the geometric structure of the coefficients. While there should be adversarial choices of geometry with devastating effects, unstructured coefficient fields seem to be surprisingly benign. Regarding the third point, while the numerical homogenization results of Theorem12 break down if the condition𝑠 > 𝑑/2 is violated, the exponential decay result does seem to hold. For Poisson problems with slowly varying, low-contrast coefficients and small values of𝜌, we obseve that using an averaging scheme improves the preconditioning effect, as shown in Table5.9.

On the other hand, in the presence of high contrast coefficients reordering according to Example1tends to perform better than averaging according to Example2since the resulting incomplete Cholesky factorization is less prone to encountering non- positive pivots. We have no theoretical explanation for this empirical observation.

5.6.2 Accuracy and computational complexity

We begin by investigating empirically the sparsity of the Cholesky factors and the scaling of the computational complexity of our factorization. In Figure 5.10, we show the sparsity pattern for different 𝜌together with the approximation error. We next investigate the computational time of computing the Cholesky factorization and how it scales with 𝑁. As shown in the log-log plot in Figure5.11, for fixed 𝜌, our factorization achieves the linear scaling predicted by the theory.

𝜌=2 ^𝜌=3 𝜌=4 𝜌=5

5 10 15 20

0 2 4 6

𝜌

percentageoffill-ins

density percentage

5 10 15 20

10⁻¹⁰ 10⁻⁷ 10⁻⁴ 10⁻¹

𝜌

error

k𝐿 𝐿^>−𝐴k_Fro/ k𝐴k_Fro

5 10 15 20

10⁻⁹ 10⁻⁶ 10⁻³ 10⁰

𝜌

error

k (𝐿 𝐿^>)⁻¹−𝐴⁻¹k_Fro/ k𝐴⁻¹k_Fro

5 10 15 20

10⁻⁶ 10⁻⁴ 10⁻² 10⁰

𝜌

error

k𝐴𝑢−𝑔k₂/ k𝑔k₂

Figure 5.10: Sparsity pattern, error and factor density. When computing the factorization of a Laplacian on a 16𝑘×16𝑘 grid in the reverse maximin ordering of Definition 6, the Cholesky factors are approximately sparse according to the reverse maximin sparsity pattern of Definition7, despite the condition 𝑠 > 𝑑/2 of Example1not being satisfied.

Exact sparse Cholesky factorization, as discussed in Section3.2, is a strong competi- tor for many practical two-dimensional problems. However, its superlinear scaling of both time and space complexity limits its applicability to three-dimensional problems. In Figure 5.12, we compare our method against the highly optimized CHOLMOD library on series of three-dimensional problems of increasing size.

While CHOLMOD is superior for small problems, our method shows superior performance on problems with𝑁 > 10⁶. We note that CHOLMOD has the advantage that it is not impacted by the geometric structure and magnitude of the coefficients and would therefore preserve its computational cost if we were to choose a more challenging problem of the same size. Nevertheless, it becomes infeasible for truly large-scale problems, while our method preserves its near-linear scaling.

5.6.3 Comparison with algebraic multigrid methods

A popular choice for solving elliptic partial differential equations are algebraic multigrid method [41, 42, 253]. Where geometric multigrid methods [81, 111, 113] use predetermined multiresolution schemes that are vulnerable to rough and high contrast coefficients [20], algebraic multigrid methods use an operator-adapted

102

10⁵ 10⁶

10⁰ 10¹

𝑁

time(s)

2D factorization

𝜌=7.0 𝜌=8.0

10⁵ 10⁶

10⁰ 10¹ 10²

𝑁

time(s)

3D factorization

𝜌=2.5 𝜌=3.5

Figure 5.11: Scalability. In 2D and 3D, our IC factorization time matches the expectedO (𝑁 𝜌²^𝑑)time complexity for a matrix size𝑁×𝑁and a sparsity parameter 𝜌; the dashed line indicates a slope of 1 in this log-log plot.

multiresolution analysis. While this improves their performance on many problems, rough coefficients with high contrast still pose challenges [7,255].

We compare our method to the popular AMG implementations AMGCL [67] and Trilinos [237]. For AMGCL, we chose the Ruge-Stüben method [230] to construct the hierarchy and sparse inverse approximation [102] for relaxation. For Trilinos, we use smoothed aggregation [240] for coarsening and symmetric Gauss-Seidel for relaxation.

10⁵ 10⁶

10⁰ 10¹ 10²

𝑁 factorization time (s)

10⁵ 10⁶

10⁰ 10¹ 10²

𝑁 total time (s)

10⁵ 10⁶

10⁰ 10¹

𝑁 memory cost (Gb)

CHOLMOD vs. Our method

Figure 5.12: Direct vs. iterative solvers. For a 3D Poisson solve, CHOLMOD scales non-linearly in the linear system size 𝑁 for factorization time, total solve time (which include factorization and back-substitution), and memory use, and fails for 𝑁 >1M; instead, a PCG-based iterative solve using our preconditioner exhibits consistent linear behaviors on all three measurements.

We begin our comparison on the Poisson problem (5.44) and the elasticity problem (5.45) in two and three dimensions on a bimaterial generated by randomly assigned stiff regions. The stiff regions form about 1/8 of the problem in two dimensions and 1/64 of the problem in three dimensions. We refer to the ratio of coefficient magnitudes instiff and normal regions as the contrast. We furthermore investigate examples on both regular and (uniformly) randomly generated grids.

Poisson

regular mesh irregular mesh

x-axis shows contrast (𝑁=2×10⁵,g=1)

10⁰ 10² 10⁴ 10⁰

10¹

time (s)

10⁰ 10² 10⁴ 10⁰

10¹

time (s)

10⁰ 10² 10⁴ 10¹

10²

time (s)

10⁰ 10² 10⁴ 10²

time (s)

x-axis shows size (contrast=10⁴,g=1)

10⁵ 10⁶

10⁰ 10¹

time (s)

10⁵ 10⁶

10⁰ 10¹ 10²

time (s)

10⁵ 10⁶

10¹ 10²

time (s)

10⁵ 10⁶

10¹ 10² 10³ 10⁴

time (s)

—— Ours —— AMGCL —— Trilinos

As we can see in Figure5.12our method beats both of these methods in all problems with the exception of Poisson problems on regular grids, where it is outperformed by AMGCL.

In many applications, our method will be used repeatedly during the solution of a nonlinear problem via implicit time-stepping or Newton’s method. In Fiture5.13, we use our method to accelerate computations in quasi-statics: We stretch an armadillo model with about 340𝑘 nodes (thus over 1𝑀 degrees of freedom), for which each tetrahedron is randomly assigned one of two neo-Hookean materials whose contrast between Young’s moduli is 10⁴. In each step of our Newton descent to find the final shape, we project the neo-Hookean elastic stiffness matrix to enforce semi-positive

104 Elasticity

regular mesh irregular mesh

x-axis shows contrast (𝑁=2×10⁵,g=1)

10⁰ 10² 10⁴ 10¹

10²

10³ time (s)

10⁰ 10² 10⁴ 10²

10³

time (s)

10⁰ 10² 10⁴ 10¹

10² 10³

time (s)

10⁰ 10² 10⁴ 10²

10³

time (s)

x-axis shows size (contrast=10⁴,g=1)

10⁵ 10⁶

10¹ 10² 10³

10⁴ time (s)

10⁵ 10⁶

10² 10³ 10⁴

time (s)

10⁵ 10⁶

10¹ 10² 10³

10⁴ time (s)

10⁵ 10⁶

10² 10³

time (s)

—— Ours —— AMGCL —— Trilinos

Figure 5.12: Comparisons with AMG libraries.. Figures indicate time costs (including factorization and PCG iteration times) as a function of material contrast.

Our method is much less sensitive to contrast and problem size, and is particularly efficient when the size𝑁 becomes large and/or for bad condition numbers. For our method, timings are within the orange region depending on the actual value of 𝜌, for which we used the range [6.5,8.5] in 2D and [2.5,4.0] in 3D. All meshes are generated by Delaunay triangulation.

definiteness [234]. The linear solves of the Newton descent exhibit a 2.5×speedup on average compared to AMGCL, with larger speed-ups for smaller error tolerances.

−0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

−6

−5

−4

−3

−2

−1

·10⁵

time (10⁴s)

potentialenergyvalue

Our method AMGCL

Figure 5.13: Nonlinear quasi-statics. An armadillo is stretched via lateral gravity with a few nodes (marked in black) fixed at their initial position. We use a trust region nonlinear optimization algorithm involving the solution of a linear system using our IC preconditioner or AMGCL at each step; timing of the first 20 iterations are plotted.

106 C h a p t e r 6

CHOLESKY FACTORIZATION BY KULLBACK-LEIBLER MINIMIZATION

6.1 Overview

6.1.1 FactorizingΘ⁻¹from entries ofΘ

In this chapter, we propose to compute a sparse approximate inverse Cholesky factor 𝐿 of Θ, by minimizing with respect to 𝐿 and subject to a sparsity constraint, the Kullback-Leibler (KL) divergence between two centered multivariate normal distri- butions with covariance matrices Θand (𝐿 𝐿^>)⁻¹. Surprisingly, this minimization problem has a closed-form solution, enabling the efficient computation of optimally accurate Cholesky factors for any specified sparsity pattern.

The resulting approximation can be shown to be equivalent to the Vecchia approximation of Gaussian processes [241], which has become very popular for the analysis of geospatial data (e.g., [62,104,135,136,226,233]); to the best of our knowledge, rigorous convergence rates and error bounds were previously unavailable for Vecchia approximations, and this work is the first one presenting such results. An equivalent approximation has also been proposed by [133] and [141] in the literature on factorized sparse approximate inverse (FSAI) preconditioners of (typically) sparse matrices (see e.g., [33] for a review and comparison, [51] for an application to dense kernel matrices); however, its KL-divergence optimality has not been observed be- fore. KL-minimization has also been used to obtain sparse lower-triangular transport maps by [166]; while this literature is mostly concerned with the efficient sampling of non-Gaussian probability measures, the present work shows that an analogous approach can be used to obtain fast algorithms for numerical linear algebra if the sparsity pattern is chosen appropriately.

6.1.2 State-of-the-art computational complexity

The computational complexity and approximation accuracy of our approach depend on the choice of elimination ordering and sparsity pattern. WhenΘis the covariance function of a Gaussian process that is subject to the screening effect, we propose to use the reverse maximin ordering and sparsity pattern proposed in Definitions6 and7. By using a grouping algorithm similar to the supernodes of Chapter 5and the heuristics proposed by [83,104,226], we can show that the approximate inverse

Cholesky factor can be computed in computational complexity O (𝑁 𝜌²^𝑑) in time andO (𝑁 𝜌^𝑑) in space, using onlyO (𝑁 𝜌^𝑑) entries of the original kernel matrix Θ, where𝜌is a tuning parameter trading accuracy for computational efficiency.

In settings where Theorem 10on the exponential decay of Θ⁻¹ holds, it allows us to prove that the approximation error decays exponentially in 𝜌. In this setting, we can thus compute an 𝜖-approximation of Θ in complexity O

𝑁log²^𝑑(𝑁/𝜖)

by choosing 𝜌 ≈ log(𝑁/𝜖). This is the best known trade-off between computational complexity and accuracy for inverses of Green’s matrices of general elliptic boundary value problems.

6.1.3 Practical advantages

Our method has importantpracticaladvantages complementing its theoretical and asymptotic properties. In many GP regression applications, large values of 𝜌 are computationally intractable with present-day resources. By incorporating prediction points in the computation of KL-optimal inverse-Cholesky factors, we obtain a GP regression algorithm that is accurate even for small (≈ 3) values of 𝜌, including in settings where truncation of thetrue Cholesky factor of Θ⁻¹ to the same sparsity pattern fails completely.

For other hierarchy-based methods, the computational complexity depends exponentially on the dimension𝑑of the dataset. In contrast, because the construction of the ordering and sparsity pattern only uses pairwise distances between points, our algorithms automatically adapt to low-dimensional structure in the data and operate in complexities identified by replacing 𝑑with theintrinsic dimension 𝑑˜≤ 𝑑of the dataset.

We have seen in section5.5.3that the screening deteriorates for independent sums of two GPs, such as when combining a GP with additive Gaussian white noise.

As noted there, this problem could be overcome if we had access to the inverse of the covariance matrices. Analogs of the Cholesky factorization that compute this inverse by recursive Schur complementation exist, but they are too unstable to be useful in practice. The inherent stability of the KL minimization allows us to realize this approach and thus compute both cheaply and accurately with sums of independent Gaussian processes arising, for instance, from modeling measurement noise. To the best of our knowledge, this is the first time this has been achieved by a method based on the screening effect.

Finally, our algorithm is intrinsically parallel because it allows each column of

108 the sparse factor to be computed independently (as in the setting of the Vec- chia approximation, factorized sparse approximate inverses, and lower-triangular transport maps). Furthermore, we show that in the context of GP regression, the log-likelihood, the posterior mean, and the posterior variance can be computed in O (𝑁+𝜌^𝑑)space complexity. In a parallel setting, we requireO (𝜌^𝑑)communication between the different workers for every O (𝜌³^𝑑) floating-point operations, resulting in a total communication complexity of O (𝑁). Here, most of the floating-point operations arise from calls to highly optimized BLAS and LAPACK routines.

Dalam dokumen Inference, Computation, and Games (Halaman 122-131)