Chapter VI: Cholesky factorization by Kullback-Leibler minimization
6.4 Extensions
Figure 6.6: Limitations of screening. To illustrate the screening effect exploited by our methods, we plot theconditional correlationwiththe point in redconditional onthe blue points. In the first panel, the points are evenly distributed, leading to a rapidly decreasing conditional correlation. In the second panel, the same number of points is irregularly distributed, slowing the decay. In the last panel, we are at the fringe of the set of observations, weakening the screening effect.
MatΓ©rn covariance function [167] that is the Greenβs function of an elliptic PDE of order π , when choosing the βsmoothness parameterβ π as π = π β π/2. While our theory only coversπ βN, we observed in Section 6.5that MatΓ©rn kernels with non-integer values of π and even the βCauchy classβ [95] seem to be subject to similar behavior. In the second panel of Figure6.6, we show that as the distribution of conditioning points becomes more irregular, the screening effect weakens. In our theoretical results, this is controlled by the upper bound on πΏ in (6.12). The screening effect is significantly weakened close to the boundary of the domain, as illustrated in the third panel of Figure 6.6 (cf. Figure 5.4). This is the reason why our theoretical results, different from the MatΓ©rn covariance, are restricted to Greenβs functions with zero Dirichlet boundary condition, which corresponds to conditioning the process to be zero on πΞ©. A final limitation is that the screening effect weakens as we take the order of smoothness to infinity, obtaining, for instance the Gaussian kernel. However, according to Theorem2, this results in matrices that have efficient low-rank approximations, instead.
118 Section6.4.3, we discuss memory savings and parallel computation for GP inference when we are only interested in computing the likelihood and the posterior mean and covariance (as opposed to, for example, sampling from N (0,Ξ) or computing productsπ£ β¦βΞπ£).
We note that it is not possible to combine the variant in Section6.4.1 with that in Section6.4.3, and that the combination of the variants in Sections 6.4.1and6.4.2 might diminish accuracy gains from the latter. Furthermore, while Section6.4.3can be combined with Section 6.4.2to compute the posterior mean, this combination cannot be used to compute the full posterior covariance matrix.
6.4.1 Additive noise
Assume that a diagonal noise term is added to Ξ, so that Ξ£ = Ξ+π , where π is diagonal. Extending the Vecchia approximation to this setting has been a major open problem in spatial statistics [62, 135, 136]. Applying our approximation directly toΞ£would not work well because the noise term attenuates the exponential decay.
Instead, given the approximation ΛΞβ1 = πΏ πΏ> obtained using our method, we can write, following Section5.5.3:
Ξ£β ΞΛ +π =Ξ(Λ π β1+ΞΛβ1)π .
Applying an incomplete Cholesky factorization with zero fill-in (Algorithm15) to π β1+ΞΛβ1 β πΏΛπΏΛ>, we have
Ξ£ β (πΏ πΏ>)β1πΏΛπΏΛ>π .
The resulting procedure, given in Algorithm14, has asymptotic complexityO (π π2π), because every column of the triangular factors has at mostO (ππ)entries.
Following the intuition thatΞβ1isessentiallyan elliptic partial differential operator, Ξβ1 + π β1 is essentially a partial differential operator with an added zero-order term, and its Cholesky factors can thus be expected to satisfy an exponential decay property just as those ofΞβ1. Indeed, as shown in Figure5.5, the exponential decay of the Cholesky factors of π β1+Ξβ1is as strong as forΞβ1, even for largeπ . We suspect that this could be proved rigorously by adapting the proof of exponential decay in [190] to the discrete setting. We note that while independent noise is most commonly used, the above argument leads to an efficient algorithm whenever π β1 is approximately given by an elliptic PDE (possibly of order zero).
For smallπ, the additional error introduced by the incomplete Cholesky factorization can harm accuracy, which is why we recommend using the conjugate gradient
Algorithm 14 Including independent noise with covariance matrix π
Input: G,{π₯π}πβπΌ, π, (π,) andπ Output: πΏ ,πΏΛ βRπΓπ l. triang. inβΊ
1: Comp. βΊandπβ πβΊ,β, π(πβΊ,β, π,π)
2: Comp. πΏusing Alg.12(13)
3: for (π, π) βπdo
4: π΄π π β hπΏπ,:, πΏπ ,:i
5: end for
6: π΄β π΄+π
7: πΏΛ β ichol(π΄, π)
8: return πΏ, ΛπΏ
Algorithm 15 Zero fill-in incomplete Cholesky factorization (ichol(A,S)) Input: π΄ βRπΓπ, π
Output: πΏ βRπΓπ l. triang. inβΊ
1: πΏ β (0, . . . ,0) (0, . . . ,0)>
2: for π β {1, . . . , π}do
3: forπ β {π , . . . , π}: (π, π) β πdo
4: πΏπ π β π΄π πβ hπΏπ,1:(πβ1), πΏπ ,1:(πβ1)i
5: end for
6: πΏ:π β π΄:π/β π΄ππ
7: end for
8: return πΏ
Figure 6.7: Sums of independent processes. Algorithms for approximating co- variance matrices with added independent noise Ξ+π (left), using the zero fill-in incomplete Cholesky factorization (right). Alternatively, the variants discussed in Section5.3could be used. See Section6.4.1.
algorithm (CG) to invert(π β1+ΞΛβ1)using ΛπΏas a preconditioner. In our experience, CG converges to single precision in a small number of iterations (βΌ10).
Alternatively, higher accuracy can be achieved by using the sparsity pattern ofπΏ πΏ>
(as opposed to that of πΏ) to compute the incomplete Cholesky factorization ofπ΄in Algorithm14; in fact, in our numerical experiments in Section6.5.2, this approach was as accurate as using the exact Cholesky factorization of π΄ over a wide range of π values and noise levels. The resulting algorithm still requires O (π π2π) time, albeit with a larger constant. This is because for an entry (π, π) to be part of the sparsity pattern of πΏ πΏ>, there needs to exist aπ such that both(π, π) and(π , π)are part of the sparsity pattern of πΏ. By the triangle inequality, this implies that (π, π) is contained in the sparsity pattern ofπΏ obtained by doubling π. In conclusion, we believe that the above modifications allow us to compute anπβaccurate factorization inO (πlog2π(π/π))time andO (πlogπ(π/π))space, just as in the noiseless case.
6.4.2 Including the prediction points
In GP regression, we are given πTr points of training data and want to compute predictions at πPr points of test data. We denote asΞTr,Tr, ΞPr,Pr, ΞTr,Pr,ΞPr,Tr the covariance matrix of the training data, the covariance matrix of the test data, and the covariance matrices of training and test data. Together, they form the joint
120 covariance matrix
Ξ
Tr,Tr ΞTr,Pr
ΞPr,Tr ΞPr,Pr
of training and test data. In GP regression with training dataπ¦ βRπTr, we are interested in:
β’ Computation of the log-likelihoodβΌ π¦>Ξβ1Tr,Trπ¦+logdetΞTr,Tr+πlog(2π).
β’ Computation of the posterior meanπ¦>ΞβTr1
,TrΞTr,Pr.
β’ Computation of the posterior covarianceΞPr,PrβΞPr,TrΞβ1
Tr,TrΞTr,Pr.
In the setting of Theorem22, our method can be applied to accurately approximate the matrix ΞTr,Tr in near-linear cost. The training covariance matrix can then be replaced by the resulting approximation for all downstream applications.
However, approximating instead the joint covariance matrix of training and pre- diction variables improves (1) stability and accuracy compared to computing the KL-optimal approximation of the training covariance alone, (2) computational com- plexity by circumventing the computation of most of the πTrπPr entries of the off-diagonal partΞTr,Prof the covariance matrix.
We can add the prediction points before or after the training points in the elimination ordering.
Ordering the prediction points first, for rapid interpolation
The computation of the mixed covariance matrixΞPr,Trcan be prohibitively expen- sive when interpolating with a large number of prediction points. This situation is common in spatial statistics when estimating a stochastic field throughout a large domain. In this regime, we propose to order the {π₯π}πβπΌ by first computing the re- verse maximin orderingβΊTrof only the training points as described in Section6.3.1 using the original Ξ©, writing βTr for the corresponding length scales. We then compute the reverse maximin orderingβΊPr of the prediction points using the mod- ified ΛΞ© B Ξ©βͺ {π₯π}πβπΌ
Tr, obtaining the length scales βPr. Since ΛΞ©contains{π₯π}πβπΌ
Tr, when computing the ordering of the prediction points, prediction points close to the training set will tend to have a smaller length-scale β than in the naive ap- plication of the algorithm, and thus, the resulting sparsity pattern will have fewer nonzero entries. We then order the prediction points before the training points and computeπ(βΊ
Pr,βΊTr),(βPr,βTr), π orπ(βΊ
Pr,βΊTr),(βPr,βTr), π,πfollowing the same procedure as in Sections 6.3.1 and 6.3.2, respectively. The distance of each point in the predic- tion set to the training set can be computed in near-linear complexity using, for
example, a minor variation of Algorithm11. Writing πΏfor the resulting Cholesky factor of the joint precision matrix, we can approximate ΞPr,Pr β πΏβ>
Pr,PrπΏβ1
Pr,Pr and ΞPr,Tr β πΏβ>
Pr,PrπΏ>
Tr,Pr based on submatrices of πΏ. See Sections .3.3 and21 for ad- ditional details. We note that the idea of ordering the prediction points first (last, in their notation) has already been proposed by [136] in the context of the Vecchia approximation, although without providing an explicit algorithm.
If one does not use the method in Section 6.4.1 to treat additive noise, then the method described in this section amounts to making each prediction using only O (ππ) nearby points. In the extreme case where we only have a single prediction point, this means that we are only using O (ππ) training values for prediction. On the one hand, this can lead to improved robustness of the resulting estimator, but on the other hand, it can lead to some training data being missed entirely.
Ordering the prediction points last, for improved robustness
If we want to use the improved stability of including the prediction points, maintain near-linear complexity, and use all πTrtraining values for the prediction of even a single point, we have to include the prediction points after the training points in the elimination ordering. Naively, this would lead to a computational complexity of O (πTr(ππ + πPr)2), which might be prohibitive for large values of πPr. If it is enough to compute the posterior covariance only among πPr small batches of up to πPr predictions each (often, it makes sense to choose πPr = 1), we can avoid this increase of complexity by performing prediction on groups of onlyπPrat once, with the computation for each batch only having computational complexity O (πTr(ππ+πPr)2). A naive implementation would still require us to perform this procedureπPrtimes, eliminating any gains due to the batched procedure. However, careful use of the Sherman-Morrison-Woodbury matrix identity allows to to reuse the biggest part of the computation for each of the batches, thus reducing the computational cost for prediction and computation of the covariance matrix to only O (πTr( (ππ+πPr)2+ (ππ+πPr)πPr)). This procedure is detailed in Section.3.3and summarized in Section23.
6.4.3 GP regression inO (π+π2π)space complexity
When deploying direct methods for approximate inversion of kernel matrices, a major difficulty is the superlinear memory cost that they incur. This, in particular, poses difficulties in a distributed setting or on graphics processing units. In the
122 following,πΌ =πΌTrdenotes the indices of the training data, and we writeΞB ΞTr,Tr, whileπΌPrdenotes those of the test data. In order to compute the log-likelihood, we need to compute the matrix-vector product πΏπ,>π¦, as well as the diagonal entries of πΏπ. This can be done by computing the columns πΏ
π
:, π of πΏπ individually using (6.3) and setting (πΏπ,>π¦)π = (πΏ
π
:, π)>π¦, πΏ
π
π π = (πΏ
π
:, π)π, without ever forming the matrix πΏπ. Similarly, in order to compute the posterior mean, we only need to computeΞβ1π¦ = πΏπ,>πΏππ¦, which only requires us to compute each column of πΏπ twice, without ever forming the entire matrix. In order to compute the posterior covariance, we need to compute the matrix-matrix product πΏπ,>ΞTr,Pr, which again can be performed by computing each column of πΏπ once without ever forming the entire matrix πΏπ. However, it does require us to know beforehand at which points we want to make predictions. The submatrices Ξπ π,π π for allπ belonging to the supernode Λπ (i.e., π { πΛ) can be formed from a list of the elements of Λπ π. Thus, the overall memory complexity of the resulting algorithm isO (Γ
πβπΌΛ# Λπ π) = π(πTr+πPr+π2π). The above described procedure is implemented in Algorithms19 and 20 in Section .3.1. In a distributed setting with workers π1, π2, . . ., this requires communicating onlyO (# Λπ π) floating-point numbers to workerππ, which then performs O ( (# Λπ π)3) floating-point operations; a naive implementation would require the communication ofO ( (# Λπ )2)floating-point numbers to perform the same number of floating-point operations.