Zero fill-in incomplete Cholesky factorization

Chapter V: Incomplete Cholesky factorization

5.1 Zero fill-in incomplete Cholesky factorization

In Chapter4, we have shown that discretized Green’s matrices of elliptic PDEs, as well as their inverses, have exponentially decaying Cholesky factors. By setting small entries to zero, we can approximate these Cholesky factors with exponential accuracy by sparse matrices with near-linearly many nonzero entries. However, computing the exact Cholesky factor in the first place requires access to allO 𝑁² nonzero entries and has computational costO 𝑁³

, and is therefore not practical.

A simple approach to decreasing the computational complexity of Cholesky factorization is the zero fill-in incomplete Cholesky factorization [170] (ICHOL(0)).

When performing Gaussian elimination usingICHOL(0), we treat all entries of both the input matrix and the output factors outside a prescribedsparsity pattern𝑆 ⊂ 𝐼×𝐼 as zero and correspondingly ignore all operations in which they are involved.

5.1.2 Complexity vs accuracy

It is well known that the computational complexity ofICHOL(0)can be bounded in terms of the maximal number elements of𝑆either per row or per column:

Theorem 14. If 𝑚 upper bounds the number of elements of (the lower triangular part of)𝑆in either each row or each column, then Algorithm3has time complexity O (𝑁 𝑚²).

Proof. If 𝑚 upper bounds the number of elements per column, we see that the number of updates performed for a given𝑖is upper bounded by the number of pairs (𝑘 , 𝑗) for which (𝑘 , 𝑖)and (𝑗 , 𝑖) are contained in𝑆. This number is upper bounded by𝑚², leading to a complexityO 𝑁 𝑚²

. If𝑚upper bounds the number of elements per row, then the number of updates is upper bounded by the number of pairs(𝑖, 𝑗) for which both (𝑘 , 𝑖) and (𝑘 , 𝑗) are part of the sparsity pattern. This number is bounded by𝑚², leading to a complexityO 𝑁 𝑚²

We begin by providing a bound on the computational complexity of ICHOL(0)in coarse-to-fine ordering.

70 Algorithm 3Incomplete Cholesky factorization with sparsity pattern𝑆.

Input: 𝐴 ∈R^𝑁×𝑁 symmetric,nz(𝐴) ⊂ 𝑆 Output: 𝐿 ∈R^𝑁^×^𝑁 lower triang.nz(𝐿) ⊂𝑆

1: for (𝑖, 𝑗) ∉𝑆do

2: 𝐴_{𝑖 𝑗} ←0

3: end for

4: for𝑖 ∈ {1, . . . , 𝑁} do

5: 𝐿_:_𝑖 ← 𝐴_:_𝑖/√ 𝐴_𝑖𝑖

6: for 𝑗 ∈ {𝑖+1, . . . , 𝑁} : (𝑖, 𝑗) ∈𝑆do

7: for 𝑘 ∈ {𝑗 , . . . , 𝑁}: (𝑘 , 𝑖),(𝑘 , 𝑗) ∈ 𝑆do

8: 𝐴_{𝑘 𝑗} ← 𝐴_{𝑘 𝑗} − ^𝐴^{𝑘 𝑖}^𝐴^{𝑗 𝑖}

𝐴_𝑖𝑖 9: end for

10: end for

11: end for

12: return 𝐿

Figure 5.1: ICHOL(0). Incomplete Cholesky factorization, with the differences to ordinary Cholesky factorization (Algorithm 1), highlighted in red. Here, for a matrix 𝐴, nz(𝐴) B {(𝑖, 𝑗) | 𝐴_{𝑖 𝑗} ≠ 0}denotes the index set of its non-zero entries.

Theorem 15. In the setting of Examples 1 and2, there exists a constant 𝐶(𝑑 , 𝛿), such that, for

𝑆 ⊂ {(𝑖, 𝑗) |𝑖 ∈ 𝐽⁽^𝑘⁾, 𝑗 ∈𝐽⁽^𝑙⁾ℎ⁻^min(^{𝑘 ,𝑙}⁾dist supp(𝜙_𝑖),supp 𝜙_𝑗 ≤ 𝜌}, (5.1) the application of Algorithm 3 in a coarse-to-fine ordering has computational complexity 𝐶(𝑑 , 𝛿)𝑁 𝑞 𝜌^𝑑 in space and 𝐶(𝑑 , 𝛿)𝑁 𝑞²𝜌²^𝑑 in time. In particular, 𝑞 ∝ log𝑁/ln ¹

ℎ^𝑑 implies the upper bounds of 𝐶(𝑑 , 𝛿, ℎ)𝜌^𝑑𝑁log𝑁 on the space complexity, and of𝐶(𝑑 , 𝛿, ℎ)𝜌²^𝑑𝑁log²𝑁on the time complexity. In particular, the above complexities apply toICHOL(0)in the maximin ordering and sparsity pattern.

Proof. For𝑖 ∈ 𝐽^(𝑘), 𝑗 ∈ 𝐽^(𝑙), write 𝑑(𝑖, 𝑗) = ℎ⁻^min(^{𝑘 ,𝑙)}dist supp(𝜙_𝑖),supp 𝜙_𝑗 . We write𝑖 ≺ 𝑗 if𝑖precedes 𝑗 in the coarse-to-fine ordering.

Defining𝑚 B max𝑗∈𝐼 ,1≤𝑘≤𝑞#{𝑖 ∈ 𝐽⁽^𝑘⁾ |𝑖 ≺ 𝑗 and𝑑(𝑖, 𝑗) ≤ 𝜌},|𝑥_𝑖−𝑥_𝑗| ≥ 𝛿⁻¹ℎ^𝑙for 𝑖, 𝑗 ∈ 𝐼^(𝑙) implies that𝑚 ≤𝐶(𝑑 , 𝛿)𝜌^𝑑. Therefore #{𝑖 ∈ 𝐼 |𝑖 ≺ 𝑗 and𝑑(𝑖, 𝑗) ≤ 𝜌} ≤ 𝑞 𝑚 𝑁 implies the bound on space complexity. The bound on the time complexity

follows from Theorem14.

When using the reverse ordering and sparsity pattern, the computational complexity is even lower.

Theorem 16. In the setting of Example1, there exists a constant𝐶(𝑑 , 𝛿), such that, for

𝑆 ⊂ {(𝑖, 𝑗) |𝑖 ∈ 𝐽^(𝑘), 𝑗 ∈𝐽^(𝑙)ℎ⁻^{max(𝑘 ,𝑙)}dist supp(𝜙_𝑖),supp 𝜙_𝑗 ≤ 𝜌}, (5.2) the application of Algorithm3in a fine-to-coarse ordering has computational com- plexity 𝐶(𝑑 , 𝛿)𝑁 𝜌^𝑑 in space and 𝐶(𝑑 , 𝛿)𝑁 𝜌²^𝑑 in time. In particular, the above complexities apply toICHOL(0)in the reverse maximin ordering and sparsity pat- tern. In the setting of Example2, the complexities are𝐶(𝑑 , 𝛿)𝑁max log(𝑁), 𝜌^𝑑 in space and𝐶(𝑑 , 𝛿)𝑁max

log²(𝑁), 𝜌²^𝑑

in time.

Proof. For𝑖 ∈ 𝐽^(𝑘), 𝑗 ∈ 𝐽^(𝑙), write 𝑑(𝑖, 𝑗) = ℎ⁻^max⁽^{𝑘 ,𝑙)}dist supp(𝜙_𝑖),supp 𝜙_𝑗 . We write𝑖 ≺ 𝑗 if𝑖precedes 𝑗 in the fine-to-coarse ordering.

In the setting of Example1, for𝑖 ∈ 𝐽⁽^𝑘), the𝑖-th column has/ 𝜌^𝑑 nonzero entries, since it contains only points within radius 𝜌 ℎ^𝑘, while all points succeeding 𝑖 in the reverse maximin ordering have a distance of at least 𝛿 ℎ^𝑘 from each other. In the setting of Example 2, the 𝑖-th column also has at least a constant number of elements in each of the subsequent levels, which is why even for constant 𝜌, the number of elements per column and thus the complexity ofICHOL(0)will still scale

logarithmically with𝑁.

We can now make a first attempt at combining the estimates on computational complexity and accuracy.

Theorem 17. In the setting of Theorems 9 and 10, there exists constants 𝐶 , 𝑐 depending only on𝑑,Ω,𝑠, kL k, kL⁻¹k,ℎ, and𝛿such that for every𝜌 > 𝑐log(𝑁), there exists a perturbation 𝐸 ∈ R^𝐼×𝐼 withlog( k𝐸k_Fro) ≤ −𝐶 𝜌 such that the result applying Algorithm 3 in the coarse-to-fine [fine-to-coarse] ordering with sparsity set𝑆 satisfying

{(𝑖, 𝑗) |dist(𝑖, 𝑗) ≤ 𝜌/2} ⊂ 𝑆 ⊂ {(𝑖, 𝑗) | dist(𝑖, 𝑗) ≤2𝜌} (5.3) returns a Cholesky factor𝐿^𝑆satisfying

log

𝐿 𝐿^>−𝐿^𝑆𝐿^𝑆,^>

≤ −𝐶 𝜌 . (5.4)

72 Here, 𝐿 is the exact Cholesky factor. Analogue results hold in the setting of Theorem 11, when using the maximin [reverse maximin] ordering and sparsity pattern.

Proof. Writing 𝐸 B 𝐿 𝐿^> − 𝐿^𝑆𝐿^𝑆,^> the results of Theorems9 and 10 show that log( k𝐸k_Fro) ≤ −𝐶 𝜌. The Cholesky factor of 𝐿 𝐿^> − 𝐸 is equal to 𝐿^𝑆 and thus recovered exactly by Algorithm 3, when applied to the 𝐸-perturbed input matrix.

The results on the computational complexity follow from Theorems15and16.

If we were able to show that the incomplete Cholesky factorization is stable in the sense that an exponentially small perturbation of the input matrix toICHOL(0)leads to a perturbation of its output that is amplified by a factor at most polynomial in the size of the matrix, we could use Theorem 17to obtain a rigorous result on the complexity vs accuracy tradeoff. Unfortunately, despite overwhelming evidence that this is true for the problems considered in this paper, we are not able to prove this result in general. We also did not such a result in the literature, even though works such as [94] would require it to provide rigorous complexity vs accuracy bounds.

We will however show that it holds true for a slightly modified ordering and sparsity pattern that is obtained when introducing “supernodes” and multicolor orderings.

Since these techniques are also useful from a computational standpoint, we will first introduce them as part of a description detailing the efficient implementation of our methods, together with numerical results. We will then address and resolve the question of stability again in Section5.4.3.

5.1.3 Efficient implementation ofICHOL(0)

Theorems15and16show that in the limit of𝑁 going to infinity, using Algorithm3 and a sparsity pattern satisfying (5.3) is vastly cheaper that computing a dense Cholesky factorization. However, if we were to naïvely implement Algorithm 3 using a standard compressed sparse row representation of 𝑆, we would see that the number of floating-point operations per second that our algorithm can perform much lower compared to good implementations of dense Cholesky factorization.

This has three main reasons: irregular memory access, lack of memory re-use, andparallelism.

Irregular memory access. While the random access memory (RAM) of present- day computers allows to access memory in arbitrary order, the latency of random

memory access is often substantial. In order to mitigate this effect, present-day computers have a hierarchy ofcaches, smaller units of memory that can be accessed much more rapidly. As it is loading data from RAM, the computer constantly tries to predict what data could be requested next and preemptively loads it into the cache. So-calledcache misses, instances where data requested from the CPU is not preloaded in the cache but instead has to be loaded from RAM can lead to dramatical regression in performance. Arguably the most fundamental heuristic used to predict which data will be required next is known asspatial locality, meaning that whenever we request data from RAM, the data immediately before and after will be pre-fetched into the cache, as well. Thus, accessing data in linear order is one of the most reliable ways to avoid cache misses. Without further thought, Algorithm3has poor spatial locality. For instance, if we implement using a compressed sparse row (CSR) format to store (the lower triangular part of) 𝐴, each iteration of the inner loop over 𝑘 will access a different row of the matrix that could be located in a part of memory far away from where the previous row was stored. Therefore,every single updatecould access memory in a different location that therefore is unlikely to be pre-fetched, resulting in a large number of cache misses.

Our main remedy to this problem will be to change the order of the three for loops in Algorithm 3 resulting in the left-looking and up-looking variants of Cholesky factorization, as opposed to the right-looking variant presented in Algorithm 3.

Choosing the appropriate combination of direction (up, left, or right) and storage format (compressed sparse rows or compressed sparse columns) can greatly decrease the running time of our algorithms.

Lack of memory re-use. Even if we ignore questions of cache-misses and memory latency, there is still the issue of memory bandwidth. In Algorithm 3, for each loading of 𝐴_{𝑘 𝑗}, 𝐴_{𝑘 𝑖}, and 𝐴_{𝑘 𝑗} only a minuscule amount of computation is performed.

𝐴_{𝑘 𝑗} particular will only be used for a single computation, and it can take arbitrarily long until its next appearance in the computation. This poor temporal localityof Algorithm3means that even with perfect prefetching into the cache, the computation might be limited by the bandwidth with which memory can be loaded into the cache, rather than the clock speed of the CPU. In contrast, the canonical example for good memory reused are blocked algorithms for matrix-matrix multiplication, the product between two blocks of size𝑘×𝑘that are small enough to fit into cache only needs to load only≈ 𝑘²floating-point numbers into the cache to perform≈ 𝑘³floating-point operations.

74 We will mitigate this problem by introducing supernodes [158, 209], consisting of successive rows and columns that share the same sparsity pattern. Due to the flexibility for reordering rows and columns within each level of the multiresolution scheme and the geometry of the associated measurements, we can identify supernodes of size ≈ 𝜌^𝑑 by only marginally increasing the size of the sparsity pattern.

Once the supernodes have been identified, Algorithm 3 can be reexpressed as a smaller number of dense linear algebra operations that feature substantially more data re-use.

Parallelism. Modern CPUs have multiplecoresthat have access to the same memory, but can perform tasks independently. Optimized implementations of dense Cholesky factorization are able to exploit this so-calledshared memory parallelism by identifying subproblems that can be solved in parallel and assigning them to the different cores of the CPU. For large computations, distributed parallelism is necessary, where subproblems are solved by different computers that are connected to form a so-calledcluster. The amount of parallelism available on modern computational platforms is increasing steadily. Thus, for an algorithm to be practically viable, it is crucial that it allows for the efficient use of parallel computation.

We will exploit parallelism by observing that rows and columns with nonoverlapping sparsity patterns can be eliminated in parallel. Using once again the fact that we have complete freedom to choose the ordering within each level, we can use a multi-color ordering [3, 4,74, 192] inspired by the classicalred-black ordering[127] in order to maximize the number of operations that can be performed in parallel.

5.2 Implementation of ICHOL(0)for dense kernel matrices

Dalam dokumen Inference, Computation, and Games (Halaman 92-97)