Chapter V: Incomplete Cholesky factorization
5.1 Zero fill-in incomplete Cholesky factorization
In Chapter4, we have shown that discretized Greenβs matrices of elliptic PDEs, as well as their inverses, have exponentially decaying Cholesky factors. By setting small entries to zero, we can approximate these Cholesky factors with exponential accuracy by sparse matrices with near-linearly many nonzero entries. However, computing the exact Cholesky factor in the first place requires access to allO π2 nonzero entries and has computational costO π3
, and is therefore not practical.
A simple approach to decreasing the computational complexity of Cholesky fac- torization is the zero fill-in incomplete Cholesky factorization [170] (ICHOL(0)).
When performing Gaussian elimination usingICHOL(0), we treat all entries of both the input matrix and the output factors outside a prescribedsparsity patternπ β πΌΓπΌ as zero and correspondingly ignore all operations in which they are involved.
5.1.2 Complexity vs accuracy
It is well known that the computational complexity ofICHOL(0)can be bounded in terms of the maximal number elements ofπeither per row or per column:
Theorem 14. If π upper bounds the number of elements of (the lower triangular part of)πin either each row or each column, then Algorithm3has time complexity O (π π2).
Proof. If π upper bounds the number of elements per column, we see that the number of updates performed for a givenπis upper bounded by the number of pairs (π , π) for which (π , π)and (π , π) are contained inπ. This number is upper bounded byπ2, leading to a complexityO π π2
. Ifπupper bounds the number of elements per row, then the number of updates is upper bounded by the number of pairs(π, π) for which both (π , π) and (π , π) are part of the sparsity pattern. This number is bounded byπ2, leading to a complexityO π π2
.
We begin by providing a bound on the computational complexity of ICHOL(0)in coarse-to-fine ordering.
70 Algorithm 3Incomplete Cholesky factorization with sparsity patternπ.
Input: π΄ βRπΓπ symmetric,nz(π΄) β π Output: πΏ βRπΓπ lower triang.nz(πΏ) βπ
1: for (π, π) βπdo
2: π΄π π β0
3: end for
4: forπ β {1, . . . , π} do
5: πΏ:π β π΄:π/β π΄ππ
6: for π β {π+1, . . . , π} : (π, π) βπdo
7: for π β {π , . . . , π}: (π , π),(π , π) β πdo
8: π΄π π β π΄π π β π΄π ππ΄π π
π΄ππ 9: end for
10: end for
11: end for
12: return πΏ
Figure 5.1: ICHOL(0). Incomplete Cholesky factorization, with the differences to ordinary Cholesky factorization (Algorithm 1), highlighted in red. Here, for a matrix π΄, nz(π΄) B {(π, π) | π΄π π β 0}denotes the index set of its non-zero entries.
Theorem 15. In the setting of Examples 1 and2, there exists a constant πΆ(π , πΏ), such that, for
π β {(π, π) |π β π½(π), π βπ½(π)ββmin(π ,π)dist supp(ππ),supp ππ β€ π}, (5.1) the application of Algorithm 3 in a coarse-to-fine ordering has computational complexity πΆ(π , πΏ)π π ππ in space and πΆ(π , πΏ)π π2π2π in time. In particular, π β logπ/ln 1
βπ implies the upper bounds of πΆ(π , πΏ, β)πππlogπ on the space complexity, and ofπΆ(π , πΏ, β)π2ππlog2πon the time complexity. In particular, the above complexities apply toICHOL(0)in the maximin ordering and sparsity pattern.
Proof. Forπ β π½(π), π β π½(π), write π(π, π) = ββmin(π ,π)dist supp(ππ),supp ππ . We writeπ βΊ π ifπprecedes π in the coarse-to-fine ordering.
Definingπ B maxπβπΌ ,1β€πβ€π#{π β π½(π) |π βΊ π andπ(π, π) β€ π},|π₯πβπ₯π| β₯ πΏβ1βπfor π, π β πΌ(π) implies thatπ β€πΆ(π , πΏ)ππ. Therefore #{π β πΌ |π βΊ π andπ(π, π) β€ π} β€ π π π implies the bound on space complexity. The bound on the time complexity
follows from Theorem14.
When using the reverse ordering and sparsity pattern, the computational complexity is even lower.
Theorem 16. In the setting of Example1, there exists a constantπΆ(π , πΏ), such that, for
π β {(π, π) |π β π½(π), π βπ½(π)ββmax(π ,π)dist supp(ππ),supp ππ β€ π}, (5.2) the application of Algorithm3in a fine-to-coarse ordering has computational com- plexity πΆ(π , πΏ)π ππ in space and πΆ(π , πΏ)π π2π in time. In particular, the above complexities apply toICHOL(0)in the reverse maximin ordering and sparsity pat- tern. In the setting of Example2, the complexities areπΆ(π , πΏ)πmax log(π), ππ in space andπΆ(π , πΏ)πmax
log2(π), π2π
in time.
Proof. Forπ β π½(π), π β π½(π), write π(π, π) = ββmax(π ,π)dist supp(ππ),supp ππ . We writeπ βΊ π ifπprecedes π in the fine-to-coarse ordering.
In the setting of Example1, forπ β π½(π), theπ-th column has/ ππ nonzero entries, since it contains only points within radius π βπ, while all points succeeding π in the reverse maximin ordering have a distance of at least πΏ βπ from each other. In the setting of Example 2, the π-th column also has at least a constant number of elements in each of the subsequent levels, which is why even for constant π, the number of elements per column and thus the complexity ofICHOL(0)will still scale
logarithmically withπ.
We can now make a first attempt at combining the estimates on computational complexity and accuracy.
Theorem 17. In the setting of Theorems 9 and 10, there exists constants πΆ , π depending only onπ,Ξ©,π , kL k, kLβ1k,β, andπΏsuch that for everyπ > πlog(π), there exists a perturbation πΈ β RπΌΓπΌ withlog( kπΈkFro) β€ βπΆ π such that the result applying Algorithm 3 in the coarse-to-fine [fine-to-coarse] ordering with sparsity setπ satisfying
{(π, π) |dist(π, π) β€ π/2} β π β {(π, π) | dist(π, π) β€2π} (5.3) returns a Cholesky factorπΏπsatisfying
log
πΏ πΏ>βπΏππΏπ,>
β€ βπΆ π . (5.4)
72 Here, πΏ is the exact Cholesky factor. Analogue results hold in the setting of Theorem 11, when using the maximin [reverse maximin] ordering and sparsity pattern.
Proof. Writing πΈ B πΏ πΏ> β πΏππΏπ,> the results of Theorems9 and 10 show that log( kπΈkFro) β€ βπΆ π. The Cholesky factor of πΏ πΏ> β πΈ is equal to πΏπ and thus recovered exactly by Algorithm 3, when applied to the πΈ-perturbed input matrix.
The results on the computational complexity follow from Theorems15and16.
If we were able to show that the incomplete Cholesky factorization is stable in the sense that an exponentially small perturbation of the input matrix toICHOL(0)leads to a perturbation of its output that is amplified by a factor at most polynomial in the size of the matrix, we could use Theorem 17to obtain a rigorous result on the complexity vs accuracy tradeoff. Unfortunately, despite overwhelming evidence that this is true for the problems considered in this paper, we are not able to prove this result in general. We also did not such a result in the literature, even though works such as [94] would require it to provide rigorous complexity vs accuracy bounds.
We will however show that it holds true for a slightly modified ordering and sparsity pattern that is obtained when introducing βsupernodesβ and multicolor orderings.
Since these techniques are also useful from a computational standpoint, we will first introduce them as part of a description detailing the efficient implementation of our methods, together with numerical results. We will then address and resolve the question of stability again in Section5.4.3.
5.1.3 Efficient implementation ofICHOL(0)
Theorems15and16show that in the limit ofπ going to infinity, using Algorithm3 and a sparsity pattern satisfying (5.3) is vastly cheaper that computing a dense Cholesky factorization. However, if we were to naΓ―vely implement Algorithm 3 using a standard compressed sparse row representation of π, we would see that the number of floating-point operations per second that our algorithm can perform much lower compared to good implementations of dense Cholesky factorization.
This has three main reasons: irregular memory access, lack of memory re-use, andparallelism.
Irregular memory access. While the random access memory (RAM) of present- day computers allows to access memory in arbitrary order, the latency of random
memory access is often substantial. In order to mitigate this effect, present-day computers have a hierarchy ofcaches, smaller units of memory that can be accessed much more rapidly. As it is loading data from RAM, the computer constantly tries to predict what data could be requested next and preemptively loads it into the cache. So-calledcache misses, instances where data requested from the CPU is not preloaded in the cache but instead has to be loaded from RAM can lead to dramatical regression in performance. Arguably the most fundamental heuristic used to predict which data will be required next is known asspatial locality, meaning that whenever we request data from RAM, the data immediately before and after will be pre-fetched into the cache, as well. Thus, accessing data in linear order is one of the most reliable ways to avoid cache misses. Without further thought, Algorithm3has poor spatial locality. For instance, if we implement using a compressed sparse row (CSR) format to store (the lower triangular part of) π΄, each iteration of the inner loop over π will access a different row of the matrix that could be located in a part of memory far away from where the previous row was stored. Therefore,every single updatecould access memory in a different location that therefore is unlikely to be pre-fetched, resulting in a large number of cache misses.
Our main remedy to this problem will be to change the order of the three for loops in Algorithm 3 resulting in the left-looking and up-looking variants of Cholesky factorization, as opposed to the right-looking variant presented in Algorithm 3.
Choosing the appropriate combination of direction (up, left, or right) and storage format (compressed sparse rows or compressed sparse columns) can greatly decrease the running time of our algorithms.
Lack of memory re-use. Even if we ignore questions of cache-misses and memory latency, there is still the issue of memory bandwidth. In Algorithm 3, for each loading of π΄π π, π΄π π, and π΄π π only a minuscule amount of computation is performed.
π΄π π particular will only be used for a single computation, and it can take arbitrarily long until its next appearance in the computation. This poor temporal localityof Algorithm3means that even with perfect prefetching into the cache, the computation might be limited by the bandwidth with which memory can be loaded into the cache, rather than the clock speed of the CPU. In contrast, the canonical example for good memory reused are blocked algorithms for matrix-matrix multiplication, the product between two blocks of sizeπΓπthat are small enough to fit into cache only needs to load onlyβ π2floating-point numbers into the cache to performβ π3floating-point operations.
74 We will mitigate this problem by introducing supernodes [158, 209], consisting of successive rows and columns that share the same sparsity pattern. Due to the flexibility for reordering rows and columns within each level of the multiresolution scheme and the geometry of the associated measurements, we can identify supern- odes of size β ππ by only marginally increasing the size of the sparsity pattern.
Once the supernodes have been identified, Algorithm 3 can be reexpressed as a smaller number of dense linear algebra operations that feature substantially more data re-use.
Parallelism. Modern CPUs have multiplecoresthat have access to the same mem- ory, but can perform tasks independently. Optimized implementations of dense Cholesky factorization are able to exploit this so-calledshared memory parallelism by identifying subproblems that can be solved in parallel and assigning them to the different cores of the CPU. For large computations, distributed parallelism is necessary, where subproblems are solved by different computers that are connected to form a so-calledcluster. The amount of parallelism available on modern com- putational platforms is increasing steadily. Thus, for an algorithm to be practically viable, it is crucial that it allows for the efficient use of parallel computation.
We will exploit parallelism by observing that rows and columns with nonoverlapping sparsity patterns can be eliminated in parallel. Using once again the fact that we have complete freedom to choose the ordering within each level, we can use a multi-color ordering [3, 4,74, 192] inspired by the classicalred-black ordering[127] in order to maximize the number of operations that can be performed in parallel.
5.2 Implementation of ICHOL(0)for dense kernel matrices