Chapter VI: Cholesky factorization by Kullback-Leibler minimization
6.3 Ordering and sparsity pattern motivated by the screening effect
Figure 6.2: The reverse maximin ordering. To obtain the reverse maximin or- dering, for π = π β1, π β2, . . . ,1, we successively select the pointπ₯π
π that has the largest distanceβπ
π to those pointsπ₯π
π+1, . . . , π₯π
π selected previously (shown as enlarged). All previously selected points within distance πβπ ofπ₯π
π (here, π = 2) form theπ-th column of the sparsity pattern.
112 Algorithm 12Without aggregation
Input: G,{π₯π}πβπΌ, βΊ,πβΊ,π , π
Output: πΏ βRπΓπ l. triang. inβΊ
1: for π βπΌ do
2: for π, π β π π do
3: Ξπ π,π π
π π β G (π₯π, π₯π)
4: end for
5: πΏπ
π, π β Ξβ1π
π,π πeπ 6: πΏπ
π, π β πΏπ
π, π/p πΏπ , π
7: end for
8: return πΏ
Algorithm 13 With aggregation Input: G, {π₯π}πβπΌ,βΊ, πβΊ,π , π,π output: πΏ βRπΓπ l. triang. inβΊ
1: for πΛ β πΌΛdo
2: for π, π βπ Λ
π do
3:
Ξπ Λ
π,π Λ
π
π π
β G (π₯π, π₯π)
4: end for
5: π β
πlchol(πlΞπ Λ
π,π Λ
π
πl)πl
6: for π { πΛ do
7: πΏπ
π, π βπβ>eπ 8: end for
9: end for
10: return πΏ
Figure 6.3: KL-minimizing Cholesky factorization. KL-minimization with and without using aggregation. For notational convenience, all matrices are assumed to have row and column ordering according to βΊ. πl denotes the order-reversing permutation matrix, and eπ is the vector with 1 in the π-th component and zero elsewhere.
Write βπ
π = dist π₯π
π, π₯π
π+1, . . . , π₯π
π βͺπΞ©
, and writeπ βΊ π if π precedes π in the reverse maximin ordering. We collect the{βπ}πβπΌ into a vector denoted byβ.
For a tuning parameter π βR+, we select the sparsity setπβΊ,β, π β πΌ ΓπΌ as πβΊ,β, π B
(π, π):π π ,dist(π₯π, π₯π) β€ πβπ . (6.9) The reverse maximin ordering and sparsity pattern is illustrated in Figure6.2.
We can use a minor modification of Algorithm11to construct the reverse maximin ordering and sparsity pattern in computational complexityO (πlog2(π)ππΛ)in time andO (π ππΛ)in space, where Λπ β€ πis the intrinsic dimension of the dataset, as will be defined in Condition5. The inverse Cholesky factors πΏ can then be computed using Equation (6.3), as in Algorithm12.
6.3.2 Aggregated sparsity pattern
It was already observed by [83] in the context of sparse approximate inverses, and by [104, 226] in the context of the Vecchia approximation that a suitable grouping of the degrees of freedom makes it possible toreuseCholesky factorizations of the
Figure 6.4: Geometric aggregation. The left figure illustrates the original pattern πβΊ,β, π. For each orange point, we need to keep track of its interactions with all points within a circle of radiusβ π. In the right figure, the points have been collected into a supernode, which can be represented by a list ofparents(the orange points within an inner sphere of radiusβ π) andchildren(all points within a radiusβ2π).
matrices Ξπ π,π π in Equation (6.3) to update multiple columns at once. The authors of [83,104] propose grouping heuristics based on the sparsity graph ofπΏ and show empirically that they lead to improved performance. In contrast, we propose a grouping procedure based on geometric information and prove that it allows us to reach the best asymptotic complexity in the literature in a more concrete setting.
Assume that we have already computed the reverse maximin orderingβΊand sparsity pattern πβΊ,β, π, and that we have access to the βπ as defined above. We will now aggregate the points into groups called supernodes, consisting of points that are close in both location and ordering. To do so, we pick at each step the first (w.r.t.βΊ) indexπ β πΌthat has not been aggregated into a supernode yet and then we aggregate into a common supernode the indices in{π : (π, π) β πβΊ,β, π, βπ β€πβπ}for someπ >1 (πβ1.5 is typically a good choice) that have not been aggregated yet. We proceed with this procedure until every node has been aggregated into a supernode. We write
Λ
πΌfor the set of all supernodes; forπ βπΌ ,πΛβ πΌΛ, we writeπ { πΛif Λπis the supernode to whichπhas been aggregated. We furthermore defineπ Λ
π B
π :βπ {π, πΛ β π π and introduce the aggregated sparsity pattern ΛπβΊ,β, π,π B Γ
π{πΛ
(π, π): π π βπ Λ
π . This sparsity pattern, while larger than πβΊ,β, π, can be represented efficiently by keeping track of the set of parents (the π β πΌ such that π { π Λ
π) and children (the π β π Λ
π) of each supernode, rather than the individual entries (see Figure 6.4 for an illustration). For well-behaved (cf. Theorem 21) sets of points, we obtain O (π πβπ) supernodes, each with O (ππ) parents and children, thus improving the cost of storing the sparsity pattern fromO (π ππ) toO (π).
114
Β· =
Figure 6.5: Reusing Cholesky factors. (Left:) By adding a few nonzero entries to the sparsity pattern, the sparsity patterns of columns in π Λ
π become subsets of one another. (Right:) Therefore, the matrices {Ξπ π,π π}π
{πΛ, which need to be inverted to compute the columns πΏ:, π forπ { πΛ, become submatrices of one another. Thus, submatrices of the Cholesky factors ofΞπ Λ
π,π Λ
π can be used as factors ofΞπ π,π π for any π { πΛ.
While the above aggregation procedure can be performed efficiently once βΊ and πβΊ,β, πare computed, it is possible to directly computeβΊand an outer approximation
Β―
πβΊ,β, π,π β πΛβΊ,β, π,π in computational complexity O (π) in space and O (πlog(π)) in time. Β―πβΊ,β, π,π can either be used directly, or it can be used to compute ΛπβΊ,β, π,π in O (π) in space andO (πlog(π)ππ) in time, using a simple and embarrassingly parallel algorithm. Details are given in Section.3.4.
In addition to reducing the memory cost, the aggregated ordering and sparsity pattern allows us to compute the Cholesky factors (in reverse ordering)Ξπ Λ
π,π Λ
π =ππ>once for each supernode and then use it to compute the πΏπ
π, π for all π { πΛ as in Algoritm13(see Figure6.5for an illustration).
As we show in the next section, this allows us to reduce the computational complexity fromO (π π3π)toO (π π2π)for sufficiently well-behaved sets of points.
6.3.3 Theoretical guarantees
We now present our rigorous theoretical result bounding the computational com- plexity and approximation error of our method. Proofs and additional details are deferred to Section.3.2.
Remark 1. As detailed in Section .3.2, the results below apply to more general reverseπ-maximinorderings, which can be computed in complexity O (πlog(π)), improving over reverse maximin orderings by a factor oflog(π).
Computational complexity
We can derive the following bounds on the computational complexity depending on πandπ.
Theorem 21(Informal). Under mild assumptions on{π₯π}πβπΌ β Rπ, the KL-minimizer πΏ is computed in complexityπΆ π ππ in space andπΆ π π3π in time when using Algo- rithm12withπβΊ,β, πand in complexityπΆ π ππin space andπΆπ,βπΆ π π2πin time when using Algorithm13 withπΛβΊ,β, π,π. Here, the constantπΆ depends only on π, π, and the cost of evaluating entries ofΞ.
A more formal statement and a proof of Theorem21can be found in Section.3.2.
As can be seen from Theorem 21, using the aggregation scheme decreases the computational cost by a factor ππ. This is because each supernode has β ππ members that can all be updated by reusing the same Cholesky factorization.
Remark 2. As described in Section.3.2, the computational complexity only depends on the intrinsic dimension of the dataset (as opposed to the potentially much larger ambient dimension π). This means that the algorithm automatically exploits low- dimensional structure in the data to decrease the computational complexity.
Approximation error
The rigorous bounds on the approximation error are derived from Theorem10and therefore hold under the same assumptions. We assume for the purpose of this section that Ξ© is a bounded domain of Rπ with Lipschitz boundary, and for an integerπ > π/2, we writeπ»π
0(Ξ©)for usual Sobolev the space of functions with zero Dirichlet boundary values and order π derivatives in πΏ2, and π»βπ
0 (Ξ©) for its dual.
Let the operator
L :π»π
0(Ξ©) β¦βπ»βπ (Ξ©), (6.10)
be linear, symmetric (β«
π’Lπ£ = β«
π£Lπ’), positive (β«
π’Lπ’ β₯ 0), bijective, bounded (write kL k := supπ’kLπ’kπ»βπ (Ξ©)/kπ’kπ»π
0(Ξ©) for its operator norm), and local in the sense thatβ«
π’Lπ£dπ₯ =0, for allπ’, π£ βπ»π
0(Ξ©)with disjoint support. By the Sobolev embedding theorem, we haveπ»π
0(Ξ©) βπΆ0(Ξ©), and hence{Ξ΄π₯}π₯βΞ© β π»βπ (Ξ©). We then defineGas the Greenβs function ofL,
G (π₯1, π₯2) B
β«
Ξ΄π₯1Lβ1Ξ΄π₯2dπ₯ . (6.11) A simple example whenπ =1 andΞ© = (0,1), isL =βΞ, andG (π₯ , π¦) =1π₯ < π¦1β1βπ¦π₯ + 1π¦β€π₯π¦π₯. Let us define the following measure of homogeneity of the distribution of {π₯π}πβπΌ,
πΏ B
minπ₯π,π₯πβπΌ dist(π₯π,{π₯π} βͺπΞ©)
maxπ₯βΞ© dist(π₯ ,{π₯π}πβπΌ βͺπΞ©). (6.12)
116 Using the above definitions, we can rigorously quantify the increase in approximation accuracy as πincreases.
Theorem 22. There exists a constantπΆ depending only onπ,Ξ©,π, π , kL k, kLβ1k, andπΏ, such that for π β₯πΆlog(π/π), we have
DKL N (0,Ξ)
N (0, πΏππΏπ,>)β1 +
Ξβ (πΏππΏπ,>)β1
Fro β€ π . (6.13) Thus, Algorithm 12computes anπ-accurate approximation ofΞin computational complexityπΆ πlogπ(π/π)in space andπΆ πlog3π(π/π)in time, fromπΆ πlogπ(π/π) entries ofΞ. Similarly, Algorithm13 computes anπ-accurate approximation ofΞ in computational complexityπΆ πlogπ(π/π) in space and πΆ πlog2π(π/π) in time, fromπΆ πlogπ(π/π)entries ofΞ.
To the best of our knowledge, the above result is the best known complexity/accuracy trade-off for kernel matrices based on Greenβs functions of elliptic boundary value problems. In particular, we improve upon the methods in Chapter 5, where we showed that the Cholesky factors of Ξ (as opposed to those of Ξβ1) can be ap- proximated in computational complexity O (πlog2(π)log2π(π/π)) in time and O (πlog(π)logπ(π/π))in space using zero-fill-in incomplete Cholesky factoriza- tion (Algorithm15) applied toΞ.
Screening in theory and practice
The theory described in the last section covers any self-adjoined operatorLwith an associated quadratic form
L [π’] B
β«
Ξ©
π’Lπ’dπ₯ =
π
Γ
π=0
β«
π(π)(π₯) kπ·(π)π’(π₯) k2dπ₯
andπ(π ) β πΏ2(Ξ©)positive almost everywhere. That is,L [π’]is a weighted average of the squared norms of derivatives ofπ’and thus measures the roughness ofπ’. We can formally think of a Gaussian process with covariance function given byGas having densityβΌ exp(βL [π’]/2) and therefore assigning an exponentially low probability to βroughβ functions, making it a prototypical smoothness prior. Theorems10and9 imply that these Gaussian processes are subject to an exponentially strong screening effect in the sense that, after conditioning a set ofβ-dense points, the conditional covariance of a given point decays exponentially with rate βΌ ββ1, as shown in the first panel of Figure 6.6. The most closely related model in common use is the
Figure 6.6: Limitations of screening. To illustrate the screening effect exploited by our methods, we plot theconditional correlationwiththe point in redconditional onthe blue points. In the first panel, the points are evenly distributed, leading to a rapidly decreasing conditional correlation. In the second panel, the same number of points is irregularly distributed, slowing the decay. In the last panel, we are at the fringe of the set of observations, weakening the screening effect.
MatΓ©rn covariance function [167] that is the Greenβs function of an elliptic PDE of order π , when choosing the βsmoothness parameterβ π as π = π β π/2. While our theory only coversπ βN, we observed in Section 6.5that MatΓ©rn kernels with non-integer values of π and even the βCauchy classβ [95] seem to be subject to similar behavior. In the second panel of Figure6.6, we show that as the distribution of conditioning points becomes more irregular, the screening effect weakens. In our theoretical results, this is controlled by the upper bound on πΏ in (6.12). The screening effect is significantly weakened close to the boundary of the domain, as illustrated in the third panel of Figure 6.6 (cf. Figure 5.4). This is the reason why our theoretical results, different from the MatΓ©rn covariance, are restricted to Greenβs functions with zero Dirichlet boundary condition, which corresponds to conditioning the process to be zero on πΞ©. A final limitation is that the screening effect weakens as we take the order of smoothness to infinity, obtaining, for instance the Gaussian kernel. However, according to Theorem2, this results in matrices that have efficient low-rank approximations, instead.