• Tidak ada hasil yang ditemukan

Ordering and sparsity pattern motivated by the screening effect

Dalam dokumen Inference, Computation, and Games (Halaman 134-140)

Chapter VI: Cholesky factorization by Kullback-Leibler minimization

6.3 Ordering and sparsity pattern motivated by the screening effect

Figure 6.2: The reverse maximin ordering. To obtain the reverse maximin or- dering, for π‘˜ = 𝑁 βˆ’1, 𝑁 βˆ’2, . . . ,1, we successively select the pointπ‘₯𝑖

π‘˜ that has the largest distanceℓ𝑖

π‘˜ to those pointsπ‘₯𝑖

π‘˜+1, . . . , π‘₯𝑖

𝑁 selected previously (shown as enlarged). All previously selected points within distance πœŒβ„“π‘– ofπ‘₯𝑖

π‘˜ (here, 𝜌 = 2) form theπ‘˜-th column of the sparsity pattern.

112 Algorithm 12Without aggregation

Input: G,{π‘₯𝑖}π‘–βˆˆπΌ, β‰Ί,𝑆≺,𝑙 , 𝜌

Output: 𝐿 ∈R𝑁×𝑁 l. triang. inβ‰Ί

1: for π‘˜ ∈𝐼 do

2: for 𝑖, 𝑗 ∈ π‘ π‘˜ do

3: Ξ˜π‘ π‘˜,π‘ π‘˜

𝑖 𝑗 ← G (π‘₯𝑖, π‘₯𝑗)

4: end for

5: 𝐿𝑠

π‘˜, π‘˜ ← Ξ˜βˆ’1𝑠

π‘˜,π‘ π‘˜eπ‘˜ 6: 𝐿𝑠

π‘˜, π‘˜ ← 𝐿𝑠

π‘˜, π‘˜/p πΏπ‘˜ , π‘˜

7: end for

8: return 𝐿

Algorithm 13 With aggregation Input: G, {π‘₯𝑖}π‘–βˆˆπΌ,β‰Ί, 𝑆≺,𝑙 , 𝜌,πœ† output: 𝐿 ∈R𝑁×𝑁 l. triang. inβ‰Ί

1: for π‘˜Λœ ∈ 𝐼˜do

2: for 𝑖, 𝑗 βˆˆπ‘ Λœ

π‘˜ do

3:

Ξ˜π‘ Λœ

π‘˜,π‘ Λœ

π‘˜

𝑖 𝑗

← G (π‘₯𝑖, π‘₯𝑗)

4: end for

5: π‘ˆ ←

𝑃lchol(𝑃lΞ˜π‘ Λœ

π‘˜,π‘ Λœ

π‘˜

𝑃l)𝑃l

6: for π‘˜ { π‘˜Λœ do

7: 𝐿𝑠

π‘˜, π‘˜ β†π‘ˆβˆ’>eπ‘˜ 8: end for

9: end for

10: return 𝐿

Figure 6.3: KL-minimizing Cholesky factorization. KL-minimization with and without using aggregation. For notational convenience, all matrices are assumed to have row and column ordering according to β‰Ί. 𝑃l denotes the order-reversing permutation matrix, and eπ‘˜ is the vector with 1 in the π‘˜-th component and zero elsewhere.

Write ℓ𝑖

π‘˜ = dist π‘₯𝑖

π‘˜, π‘₯𝑖

π‘˜+1, . . . , π‘₯𝑖

𝑁 βˆͺπœ•Ξ©

, and write𝑖 β‰Ί 𝑗 if 𝑖 precedes 𝑗 in the reverse maximin ordering. We collect the{ℓ𝑖}π‘–βˆˆπΌ into a vector denoted byβ„“.

For a tuning parameter 𝜌 ∈R+, we select the sparsity set𝑆≺,β„“, 𝜌 βŠ‚ 𝐼 ×𝐼 as 𝑆≺,β„“, 𝜌 B

(𝑖, 𝑗):𝑖 𝑗 ,dist(π‘₯𝑖, π‘₯𝑗) ≀ πœŒβ„“π‘— . (6.9) The reverse maximin ordering and sparsity pattern is illustrated in Figure6.2.

We can use a minor modification of Algorithm11to construct the reverse maximin ordering and sparsity pattern in computational complexityO (𝑁log2(𝑁)πœŒπ‘‘Λœ)in time andO (𝑁 πœŒπ‘‘Λœ)in space, where Λœπ‘‘ ≀ 𝑑is the intrinsic dimension of the dataset, as will be defined in Condition5. The inverse Cholesky factors 𝐿 can then be computed using Equation (6.3), as in Algorithm12.

6.3.2 Aggregated sparsity pattern

It was already observed by [83] in the context of sparse approximate inverses, and by [104, 226] in the context of the Vecchia approximation that a suitable grouping of the degrees of freedom makes it possible toreuseCholesky factorizations of the

Figure 6.4: Geometric aggregation. The left figure illustrates the original pattern 𝑆≺,β„“, 𝜌. For each orange point, we need to keep track of its interactions with all points within a circle of radiusβ‰ˆ 𝜌. In the right figure, the points have been collected into a supernode, which can be represented by a list ofparents(the orange points within an inner sphere of radiusβ‰ˆ 𝜌) andchildren(all points within a radiusβ‰ˆ2𝜌).

matrices Ξ˜π‘ π‘–,𝑠𝑖 in Equation (6.3) to update multiple columns at once. The authors of [83,104] propose grouping heuristics based on the sparsity graph of𝐿 and show empirically that they lead to improved performance. In contrast, we propose a grouping procedure based on geometric information and prove that it allows us to reach the best asymptotic complexity in the literature in a more concrete setting.

Assume that we have already computed the reverse maximin orderingβ‰Ίand sparsity pattern 𝑆≺,β„“, 𝜌, and that we have access to the ℓ𝑖 as defined above. We will now aggregate the points into groups called supernodes, consisting of points that are close in both location and ordering. To do so, we pick at each step the first (w.r.t.β‰Ί) index𝑖 ∈ 𝐼that has not been aggregated into a supernode yet and then we aggregate into a common supernode the indices in{𝑗 : (𝑖, 𝑗) ∈ 𝑆≺,β„“, 𝜌, ℓ𝑗 β‰€πœ†β„“π‘–}for someπœ† >1 (πœ†β‰ˆ1.5 is typically a good choice) that have not been aggregated yet. We proceed with this procedure until every node has been aggregated into a supernode. We write

˜

𝐼for the set of all supernodes; for𝑖 ∈𝐼 ,π‘–Λœβˆˆ 𝐼˜, we write𝑖 { π‘–Λœif Λœπ‘–is the supernode to which𝑖has been aggregated. We furthermore defineπ‘ Λœ

𝑖 B

𝑗 :βˆƒπ‘– {𝑖, π‘—Λœ ∈ 𝑠𝑖 and introduce the aggregated sparsity pattern Λœπ‘†β‰Ί,β„“, 𝜌,πœ† B Ð

π‘˜{π‘˜Λœ

(𝑖, π‘˜): π‘˜ 𝑖 βˆˆπ‘ Λœ

π‘˜ . This sparsity pattern, while larger than 𝑆≺,β„“, 𝜌, can be represented efficiently by keeping track of the set of parents (the π‘˜ ∈ 𝐼 such that π‘˜ { π‘ Λœ

π‘˜) and children (the 𝑖 ∈ π‘ Λœ

π‘˜) of each supernode, rather than the individual entries (see Figure 6.4 for an illustration). For well-behaved (cf. Theorem 21) sets of points, we obtain O (𝑁 πœŒβˆ’π‘‘) supernodes, each with O (πœŒπ‘‘) parents and children, thus improving the cost of storing the sparsity pattern fromO (𝑁 πœŒπ‘‘) toO (𝑁).

114

Β· =

Figure 6.5: Reusing Cholesky factors. (Left:) By adding a few nonzero entries to the sparsity pattern, the sparsity patterns of columns in π‘ Λœ

π‘˜ become subsets of one another. (Right:) Therefore, the matrices {Ξ˜π‘ π‘˜,π‘ π‘˜}π‘˜

{π‘˜Λœ, which need to be inverted to compute the columns 𝐿:, π‘˜ forπ‘˜ { π‘˜Λœ, become submatrices of one another. Thus, submatrices of the Cholesky factors ofΞ˜π‘ Λœ

π‘˜,π‘ Λœ

π‘˜ can be used as factors ofΞ˜π‘ π‘˜,π‘ π‘˜ for any π‘˜ { π‘˜Λœ.

While the above aggregation procedure can be performed efficiently once β‰Ί and 𝑆≺,β„“, 𝜌are computed, it is possible to directly computeβ‰Ίand an outer approximation

Β―

𝑆≺,β„“, 𝜌,πœ† βŠƒ π‘†Λœβ‰Ί,β„“, 𝜌,πœ† in computational complexity O (𝑁) in space and O (𝑁log(𝑁)) in time. ¯𝑆≺,β„“, 𝜌,πœ† can either be used directly, or it can be used to compute Λœπ‘†β‰Ί,β„“, 𝜌,πœ† in O (𝑁) in space andO (𝑁log(𝑁)πœŒπ‘‘) in time, using a simple and embarrassingly parallel algorithm. Details are given in Section.3.4.

In addition to reducing the memory cost, the aggregated ordering and sparsity pattern allows us to compute the Cholesky factors (in reverse ordering)Ξ˜π‘ Λœ

π‘˜,π‘ Λœ

π‘˜ =π‘ˆπ‘ˆ>once for each supernode and then use it to compute the 𝐿𝑠

π‘˜, π‘˜ for all π‘˜ { π‘˜Λœ as in Algoritm13(see Figure6.5for an illustration).

As we show in the next section, this allows us to reduce the computational complexity fromO (𝑁 𝜌3𝑑)toO (𝑁 𝜌2𝑑)for sufficiently well-behaved sets of points.

6.3.3 Theoretical guarantees

We now present our rigorous theoretical result bounding the computational com- plexity and approximation error of our method. Proofs and additional details are deferred to Section.3.2.

Remark 1. As detailed in Section .3.2, the results below apply to more general reverseπ‘Ÿ-maximinorderings, which can be computed in complexity O (𝑁log(𝑁)), improving over reverse maximin orderings by a factor oflog(𝑁).

Computational complexity

We can derive the following bounds on the computational complexity depending on 𝜌and𝑁.

Theorem 21(Informal). Under mild assumptions on{π‘₯𝑖}π‘–βˆˆπΌ βŠ‚ R𝑑, the KL-minimizer 𝐿 is computed in complexity𝐢 𝑁 πœŒπ‘‘ in space and𝐢 𝑁 𝜌3𝑑 in time when using Algo- rithm12with𝑆≺,β„“, 𝜌and in complexity𝐢 𝑁 πœŒπ‘‘in space andπΆπœ†,ℓ𝐢 𝑁 𝜌2𝑑in time when using Algorithm13 withπ‘†Λœβ‰Ί,β„“, 𝜌,πœ†. Here, the constant𝐢 depends only on 𝑑, πœ†, and the cost of evaluating entries ofΘ.

A more formal statement and a proof of Theorem21can be found in Section.3.2.

As can be seen from Theorem 21, using the aggregation scheme decreases the computational cost by a factor πœŒπ‘‘. This is because each supernode has β‰ˆ πœŒπ‘‘ members that can all be updated by reusing the same Cholesky factorization.

Remark 2. As described in Section.3.2, the computational complexity only depends on the intrinsic dimension of the dataset (as opposed to the potentially much larger ambient dimension 𝑑). This means that the algorithm automatically exploits low- dimensional structure in the data to decrease the computational complexity.

Approximation error

The rigorous bounds on the approximation error are derived from Theorem10and therefore hold under the same assumptions. We assume for the purpose of this section that Ξ© is a bounded domain of R𝑑 with Lipschitz boundary, and for an integer𝑠 > 𝑑/2, we write𝐻𝑠

0(Ξ©)for usual Sobolev the space of functions with zero Dirichlet boundary values and order 𝑠 derivatives in 𝐿2, and π»βˆ’π‘ 

0 (Ξ©) for its dual.

Let the operator

L :𝐻𝑠

0(Ξ©) β†¦β†’π»βˆ’π‘ (Ξ©), (6.10)

be linear, symmetric (∫

𝑒L𝑣 = ∫

𝑣L𝑒), positive (∫

𝑒L𝑒 β‰₯ 0), bijective, bounded (write kL k := sup𝑒kL𝑒kπ»βˆ’π‘ (Ξ©)/k𝑒k𝐻𝑠

0(Ω) for its operator norm), and local in the sense that∫

𝑒L𝑣dπ‘₯ =0, for all𝑒, 𝑣 βˆˆπ»π‘ 

0(Ξ©)with disjoint support. By the Sobolev embedding theorem, we have𝐻𝑠

0(Ξ©) βŠ‚πΆ0(Ξ©), and hence{Ξ΄π‘₯}π‘₯∈Ω βŠ‚ π»βˆ’π‘ (Ξ©). We then defineGas the Green’s function ofL,

G (π‘₯1, π‘₯2) B

∫

Ξ΄π‘₯1Lβˆ’1Ξ΄π‘₯2dπ‘₯ . (6.11) A simple example when𝑑 =1 andΞ© = (0,1), isL =βˆ’Ξ”, andG (π‘₯ , 𝑦) =1π‘₯ < 𝑦1βˆ’1βˆ’π‘¦π‘₯ + 1𝑦≀π‘₯𝑦π‘₯. Let us define the following measure of homogeneity of the distribution of {π‘₯𝑖}π‘–βˆˆπΌ,

𝛿 B

minπ‘₯𝑖,π‘₯π‘—βˆˆπΌ dist(π‘₯𝑖,{π‘₯𝑗} βˆͺπœ•Ξ©)

maxπ‘₯∈Ω dist(π‘₯ ,{π‘₯𝑖}π‘–βˆˆπΌ βˆͺπœ•Ξ©). (6.12)

116 Using the above definitions, we can rigorously quantify the increase in approximation accuracy as 𝜌increases.

Theorem 22. There exists a constant𝐢 depending only on𝑑,Ξ©,πœ†, 𝑠, kL k, kLβˆ’1k, and𝛿, such that for 𝜌 β‰₯𝐢log(𝑁/πœ–), we have

DKL N (0,Θ)

N (0, 𝐿𝜌𝐿𝜌,>)βˆ’1 +

Ξ˜βˆ’ (𝐿𝜌𝐿𝜌,>)βˆ’1

Fro ≀ πœ– . (6.13) Thus, Algorithm 12computes anπœ–-accurate approximation ofΘin computational complexity𝐢 𝑁log𝑑(𝑁/πœ–)in space and𝐢 𝑁log3𝑑(𝑁/πœ–)in time, from𝐢 𝑁log𝑑(𝑁/πœ–) entries ofΘ. Similarly, Algorithm13 computes anπœ–-accurate approximation ofΘ in computational complexity𝐢 𝑁log𝑑(𝑁/πœ–) in space and 𝐢 𝑁log2𝑑(𝑁/πœ–) in time, from𝐢 𝑁log𝑑(𝑁/πœ–)entries ofΘ.

To the best of our knowledge, the above result is the best known complexity/accuracy trade-off for kernel matrices based on Green’s functions of elliptic boundary value problems. In particular, we improve upon the methods in Chapter 5, where we showed that the Cholesky factors of Θ (as opposed to those of Ξ˜βˆ’1) can be ap- proximated in computational complexity O (𝑁log2(𝑁)log2𝑑(𝑁/πœ–)) in time and O (𝑁log(𝑁)log𝑑(𝑁/πœ–))in space using zero-fill-in incomplete Cholesky factoriza- tion (Algorithm15) applied toΘ.

Screening in theory and practice

The theory described in the last section covers any self-adjoined operatorLwith an associated quadratic form

L [𝑒] B

∫

Ξ©

𝑒L𝑒dπ‘₯ =

𝑠

Γ•

π‘˜=0

∫

𝜎(π‘˜)(π‘₯) k𝐷(π‘˜)𝑒(π‘₯) k2dπ‘₯

and𝜎(𝑠) ∈ 𝐿2(Ξ©)positive almost everywhere. That is,L [𝑒]is a weighted average of the squared norms of derivatives of𝑒and thus measures the roughness of𝑒. We can formally think of a Gaussian process with covariance function given byGas having density∼ exp(βˆ’L [𝑒]/2) and therefore assigning an exponentially low probability to β€œrough” functions, making it a prototypical smoothness prior. Theorems10and9 imply that these Gaussian processes are subject to an exponentially strong screening effect in the sense that, after conditioning a set ofβ„“-dense points, the conditional covariance of a given point decays exponentially with rate ∼ β„“βˆ’1, as shown in the first panel of Figure 6.6. The most closely related model in common use is the

Figure 6.6: Limitations of screening. To illustrate the screening effect exploited by our methods, we plot theconditional correlationwiththe point in redconditional onthe blue points. In the first panel, the points are evenly distributed, leading to a rapidly decreasing conditional correlation. In the second panel, the same number of points is irregularly distributed, slowing the decay. In the last panel, we are at the fringe of the set of observations, weakening the screening effect.

MatΓ©rn covariance function [167] that is the Green’s function of an elliptic PDE of order 𝑠, when choosing the β€œsmoothness parameter” 𝜈 as 𝜈 = π‘ βˆ’ 𝑑/2. While our theory only covers𝑠 ∈N, we observed in Section 6.5that MatΓ©rn kernels with non-integer values of 𝑠 and even the β€œCauchy class” [95] seem to be subject to similar behavior. In the second panel of Figure6.6, we show that as the distribution of conditioning points becomes more irregular, the screening effect weakens. In our theoretical results, this is controlled by the upper bound on 𝛿 in (6.12). The screening effect is significantly weakened close to the boundary of the domain, as illustrated in the third panel of Figure 6.6 (cf. Figure 5.4). This is the reason why our theoretical results, different from the MatΓ©rn covariance, are restricted to Green’s functions with zero Dirichlet boundary condition, which corresponds to conditioning the process to be zero on πœ•Ξ©. A final limitation is that the screening effect weakens as we take the order of smoothness to infinity, obtaining, for instance the Gaussian kernel. However, according to Theorem2, this results in matrices that have efficient low-rank approximations, instead.

Dalam dokumen Inference, Computation, and Games (Halaman 134-140)