Chapter V: Incomplete Cholesky factorization
5.2 Implementation of ICHOL(0) for dense kernel matrices
74 We will mitigate this problem by introducing supernodes [158, 209], consisting of successive rows and columns that share the same sparsity pattern. Due to the flexibility for reordering rows and columns within each level of the multiresolution scheme and the geometry of the associated measurements, we can identify supern- odes of size β ππ by only marginally increasing the size of the sparsity pattern.
Once the supernodes have been identified, Algorithm 3 can be reexpressed as a smaller number of dense linear algebra operations that feature substantially more data re-use.
Parallelism. Modern CPUs have multiplecoresthat have access to the same mem- ory, but can perform tasks independently. Optimized implementations of dense Cholesky factorization are able to exploit this so-calledshared memory parallelism by identifying subproblems that can be solved in parallel and assigning them to the different cores of the CPU. For large computations, distributed parallelism is necessary, where subproblems are solved by different computers that are connected to form a so-calledcluster. The amount of parallelism available on modern com- putational platforms is increasing steadily. Thus, for an algorithm to be practically viable, it is crucial that it allows for the efficient use of parallel computation.
We will exploit parallelism by observing that rows and columns with nonoverlapping sparsity patterns can be eliminated in parallel. Using once again the fact that we have complete freedom to choose the ordering within each level, we can use a multi-color ordering [3, 4,74, 192] inspired by the classicalred-black ordering[127] in order to maximize the number of operations that can be performed in parallel.
5.2 Implementation of ICHOL(0)for dense kernel matrices
In Figure 5.2, we describe the algorithm for computing the up-looking Cholesky factorization in the CSR format. We observe that this variant shows significantly improved performance in practice. A possible approach to further improve its performance would be to reorder the degrees of freedom on each level in a way that increases locality, for instance ordered along aπ-curve.
5.2.2 Supernodes
We will now discuss a supernodal implementation appropriate for Cholesky factor- ization ofΞin a coarse-to-fine ordering. In order to improve readability, we do not provide rigorous proofs in this section and refer to the more technical discussion in Section5.4.3.
The main idea of our implementation of supernodes is to aggregate degrees of free- dom with length-scaleβ βthat are within a ball of radiusπβand appear consecutively in the elimination ordering into a so-calledsupernode.
Definition 9. Let{π₯π}1β€πβ€π β Rπ. A supernodal aggregation of scaleβ is given by a set
Λ π₯Λ
π 1β€πβ€Λ πΛ β Rπ and assignment of each pointπ₯π to exactly one supernodeπ₯ΛΛ
π. We denote this assignmentπ {πΛand require that forπ {πΛ, we havedist(π₯π,π₯ΛΛ
π) β€β. Given{π₯π}1β€πβ€π βRπ, such an ordering can be constructed efficiently using a greedy algorithm.
Construction 1(Supernodal aggregation). We successively select the firstπ₯πthat is not yet within range of one of the existing supernodes and make it a new supernode until allπ₯π are within a distance of β of at least one supernode. Then, we assign eachπ₯πto the closest supernode, using an arbitrary mechanism to break ties.
Construction 1can be implemented efficiently using, for instance, KD-tree based range search. Under mild regularity conditions such as the one given in Equa- tion (4.6), the number of resulting supernodes is bounded asπ/βπ.
In a multi-scale setting, we will use separate aggregation scales for each level:
Definition 10. For 1 β€ π β€ π, let n
π₯(π)
π
o
1β€πβ€π(π) β Rπ. A multiscale supernodal aggregation with scales
β(π)
1β€πβ€π. Is given by the union of the supernodal aggregation for eachn
π₯(π)
π
o
1β€πβ€π(π) with the respective length scaleβ(π).
Once a multiscale supernodal aggregation is available, we can use it to derive associated orderings and sparsity patterns.
76
Algorithm 4In-place, up-lookingICHOL(0)in CSR format
Input: πΏ βRπΓπ lower triangular, as CSR withrowptr,colval,nzval
1: forπ =1 : π do
2: for πΛ=rowptr[π] : (rowptr[π+1] β1)do
3: π β colval[πΛ]
4: nzval[πΛ] βnzval[πΛ] βdot(π, π ,rowptr,colval,nzval)
5: if π < π then
6: nzval[πΛ] βnzval[πΛ] /nzval[rowptr[π +1] β1]
7: else
8: nzval[πΛ] βp
nzval[πΛ]
9: end if
10: end for
11: end for
Algorithm 5Alg. dotcomputes update as inner product ofπ-th and π-th row of πΏ. Input: πΏ βRπΓπ lower triangular, as CSR withrowptr,colval,nzval
1: πΛβrowptr[π]; Λπ βrowptr[π]
2: πΛβcolval[πΛ]; Λπ βcolval[πΛ]
3: outβ 0
4: whileπ,Λ π <Λ min(π, π) do
5: if πΛ== πΛthen
6: outβ out+nzval[πΛ] βnzval[πΛ]
7: πΛβπΛ+1; Λπ β πΛ+1
8: πΛβcolval[πΛ]; Λπ β colval[πΛ]
9: else if π >Λ πΛthen
10: πΛβ πΛ+1
11: πΛβ colval[πΛ]
12: else
13: πΛβπΛ+1
14: πΛβcolval[πΛ]
15: end if
16: end while
17: return out
Figure 5.2: Up-looking ICHOL(0)in CSR format. We present an algorithm that computes a Cholesky factorization of a lower triangular matrix in CSR format, in-place. The up-looking factorization greatly improves the spatial locality.
Definition 11. Given a multiscale supernodal aggregation, acoarse-to-fine[fine-to- coarse]multiscale supernodal ordering is any ordering of the underlying points in which points assigned to the same supernode appear consecutively the supernodes are ordered by scale,coarse-to-fine[fine-to-coarse]. We define the row-supernodal sparsity patternπβ, column-supernodal sparsity patternπ|, and two-way-supernodal sparsity patternπβ| as
πβ B
(π, πΛ ) | βπ {πΛ: (π, π) β π β
1, . . . ,πΛ Γ {1, . . . , π} (5.7) π| B
(π, πΛ) | βπ { πΛ: (π, π) βπ β {1, . . . , π} Γ
1, . . . ,πΛ (5.8) πβ| B
(π,Λ πΛ) | βπ {π, πΛ { πΛ:(π, π) β π β
1, . . . ,πΛ Γ
1, . . . ,πΛ . (5.9) An entry(π, πΛ )or(π ,πΛ)existing inπβorπ|is to be interpreted adding all interactions between elements of Λπand πto the original sparsity pattern. Similarly, an entry(π,Λ πΛ) existing in πβ| signifies that all interactions between elements of Λπ and Λπ have been added to the sparsity pattern. This means that we increase the sparsity patternπin order to ensure that in each supernode all rows, all columns, or both share the same sparsity pattern.
For a coarse-to-fine sparsity pattern π as described in Theorem 17 and using a multiscale supernodal aggregation with β(π) = π βπ, the triangle inequality implies that the associatedπ|β(interpreted, by abuse of notation, as subset ofπΌ2) satisfies
π β πβ| β {(π, π) | dist(π, π) β€4(π+1)}. (5.10) Thus, passing toπβ| preserves the asymptotic results on accuracy and complexity.
The benefit of using the supernodal ordering and sparsity pattern is that instead of performing O
πlog2(π)π2π
element-wise operations on a sparse matrix with O πlog(π)ππ
nonzero entries, we can perform O
πlog2(π)πβπ
block-wise operations on a block-sparse matrix withO
πlog2(π)πβπ
nonzero blocks. Since the size of the blocks is approximatelyππΓππ, the resulting asymptotic complexity is the same. However, the block-wise operations can be performed using optimized libraries for dense linear algebra. In particular, they require only β π2π memory accesses for everyβ π3π floating point operations.
We can readily adapt Algorithm 4 to this setting by letting nzval have matrices as entries. To maximize spatial locality, we store these matrices consecutively in column-major format, in a buffer of lengthΓ
πβnzval# rows(π) Β·# cols(π). Using cholesky to denote the dense Cholesky factorization returning a lower triangular matrix, the resulting algorithm is presented in Figure5.3.
78 Algorithm 6In-place, up-looking, supernodalICHOL(0)in CSR format
Input: πΛΓπΛblock-l. triang. MatrixπΏ, as block-CSR withrowptr,colval,nzval
1: forπ =1 : Λπ do
2: for πΛ=rowptr[π] : (rowptr[π+1] β1)do
3: π β colval[πΛ]
4: nzval[πΛ] βnzval[πΛ] βdot(π, π ,rowptr,colval,nzval)
5: if π < π then
6: nzval[πΛ] βnzval[πΛ] /nzval[rowptr[π +1] β1]
7: else
8: nzval[πΛ] βcholesky(nzval[πΛ])
9: end if
10: end for
11: end for
Algorithm 7Alg. dotcomputes update as inner product ofπ-th and π-th row of πΏ. Input: Supernodal indicesπand π and block-CSR withrowptr,colval,nzval
1: πΛβrowptr[π]
2: πΛβ rowptr[π]
3: πΛβcolval[πΛ]
4: πΛβ colval[πΛ]
5: outβ 0
6: whileπ,Λ π <Λ min(π, π) do
7: if πΛ== πΛthen
8: outβ out+nzval[πΛ]>βnzval[πΛ]
9: πΛβπΛ+1
10: πΛβcolval[πΛ]
11: πΛβ πΛ+1
12: πΛβ colval[πΛ]
13: else if π >Λ πΛthen
14: πΛβ πΛ+1
15: πΛβ colval[πΛ]
16: else
17: πΛβπΛ+1
18: πΛβcolval[πΛ]
19: end if
20: end while
21: return out
Figure 5.3: Up-looking supernodal ICHOL(0)in CSR format. We only need to replace the square root with Cholesky factorization in Algorithm 4 and add a transpose in Algorithm5to obtain a supernodal factorization.
5.2.3 Multicolor ordering
We will now address the parallel implementation of Algorithms4and 6, focusing on shared memory parallelism. The key observation is that depending on the sparsity pattern, many iterations of the outermost for-loop of Algorithms 4 and 6 can be performed in arbitrary order, and in particular in parallel, without changing the result.
Definition 12. A set of indices π½ β {1, . . . , π} is pairwise independent under the sparsity patternπ, if{(π, π) |π, π βπ½} β©π=β .
If a consecutive set of indices is mutually independent, the corresponding iterations of the outermost for-loops of Algorithms4and6can be performed in parallel.
The amount of parallelism afforded by this mechanism depends strongly on the order of the degrees of freedom. This has motivated the development ofmulticolor orderings[3,4,74,192] that attempt to maximize the size of consecutive mutually independent sets of indices. A downside of these techniques, which were designed for the preconditioning of linear system, is the orderings that lead to the largest amount of parallelism often do not achieve a good preconditioning effect. This limitation has led [50] to develop altogether different, asynchronous, and iterative approaches.
A key advantage of the method presented is this chapter is that its exponential accuracy holds for anarbitraryordering of the degrees of freedom on each scale.
This allows us to greatly increase the amount of parallelism by adopting a multiscale, multicolor ordering of the supernodes.
Construction 2(Multiscale multicolor ordering). Within each level, we initialize each (supernodal) index uncolored. We then greedily select a maximal set of uncolored and pairwise independent, with respect to π (π|β) indices and assign a new color to them. We proceed until every index is colored and order the indices within each level by color.
In the setting of Theorem17, the number of required colors isO log(π)ππ when coloring individual entries and O (log(π)) when coloring supernodes, leading to practically unlimited amounts of parallelism for large problems.
80