Parametric KF Algorithm - Kernel Flows - Learning patterns with kernels and learning kernels fr

Chapter VI: Kernel Flows

6.1 Parametric KF Algorithm

1: Input: dataset (𝑡^𝑖, 𝑦^𝑖)_1≤𝑖≤𝑁, kernel family𝑘_𝜽, initial parameter𝜽₀

2: 𝜽 ←𝜽₀

3: repeat

4: Randomly select{𝑠_𝑓(1), . . . 𝑠_𝑓(𝑁_𝑓)}from{1, . . . , 𝑁}without replacement.

5: D^f ← (𝑡^𝑠^𝑓⁽^𝑖⁾)_{1≤𝑖≤𝑁}_𝑐 andY^f ← (𝑦^𝑠^𝑓⁽^𝑖⁾)_{1≤𝑖≤𝑁}_𝑐.

6: Randomly select {𝑠_𝑐(1), . . . 𝑠_𝑐(𝑁_𝑐)} from {𝑠_𝑓(1), . . . 𝑠_𝑓(𝑁_𝑓)} without replacement.

7: D^c ← (𝑡^𝑠^𝑐⁽^𝑖⁾)_{1≤𝑖≤𝑁}_𝑐 andY^c ← (𝑦^𝑠^𝑐⁽^𝑖⁾)_{1≤𝑖≤𝑁}_𝑐.

8: 𝜽 ← 𝜽−𝜖∇_𝜽𝜌(𝜽,X^f,Y^f,X^c,Y^c)

9: untilEnd criterion

10: Return optimized kernel parameter 𝜽^∗← 𝜽

As inputs, in step 1, the parametric variant of the KF algorithm takes a training dataset(𝑡^𝑖, 𝑦^𝑖)𝑖, a parametric family of kernels𝑘_𝜽, and a initial parameter𝜽₀. Kernel parameter 𝜽 is first initialized to 𝜽₀ in step 2. The main iterative process follows between steps3and9, beginning with the random selection of size 𝑁_𝑓 subvectors D^f andY^f ofDandY(through uniform sampling without replacement in the index set{1, . . . , 𝑁}) as in steps4and5. This selection is randomly sampled further with length 𝑁^𝑐 subvectorsD^c andY^c ofD^f andY^f (by selecting, at random, uniformly and without replacement,𝑁_𝑐 of the indices defining𝐷^𝑓) in steps6and7.

Finally in step 8, we update the kernel parameter𝜽 according to gradient descent with loss 𝜌. We define𝜌(𝜽,D^f,Y^f,D^c,Y^c) to be the squared relative error (in the RKHS norm1 k · k𝑘_𝜽 defined by𝑘_𝜽) between the interpolants𝑢^†^{, 𝑓} and𝑢^†^,𝑐 obtained from the two nested subsets of the dataset and the kernel𝑘_𝜽, i.e.2

𝜌(𝜽,X^f,Y^f,X^c,Y^c):=

k𝑢^{†, 𝑓} −𝑢^†,𝑐k²

𝑘_𝜽

k𝑢^†^{, 𝑓}k²

𝑘_𝜽

=1− Y^c,>k𝜽(X^c,X^c)⁻¹Yc

Y^f^,>k𝜽(X^f,X^f)⁻¹Y^f

. (6.1.1) Note that loss𝜌is doubly randomized through the selection of batches(D^f,Y^f)and sub-batches (D^c,Y^c). The gradient of 𝜌 with respect to 𝜽 is then computed and 𝜽 is updated 𝜽 ← 𝜽 − 𝛿∇_𝜽𝜌. This process is iterated until an ending condition is satisfied, such as the number of iteration steps or 𝜌 < 𝜌_stop. The optimized kernel parameter𝜽^∗is returned. The corresponding kernel,𝑘_𝜽∗, can then be used to interpolate testing points. This algorithm is a stochastic gradient descent algorithm of𝜌, hence the KF algorithm is a stochastic gradient descent algorithm.

The 𝑙₂-norm variant. In the application of the KF algorithm to NN, we will consider the 𝑙₂-norm variant of this algorithm (introduced in [103, Sec. 10]) in which the instantaneous loss 𝜌 in (6.1.1) is replaced by the error (let k · k₂ be the Euclidean𝑙₂norm)𝑒₂:= kY^f −𝑢^†^,𝑐(D^f) k²

2 of𝑢^†^,𝑐 in predicting the labelsY^f, i.e.

𝑒₂(𝜽,D^f,Y^f,D^c,Y^c) :=kY^f −k_𝜽(D^f,D^c)k_𝜽(D^c,D^c)⁻¹Y^ck₂². (6.1.2) A simple PDE model

To motivate, illustrate and study the parametric KF algorithm, it is useful to start with an application [103, Sec. 5] to the following simple PDE model amenable to

1Note that convergence in RKHS norm implies pointwise convergence.

2Where𝑢^{†, 𝑓}(𝑡) =k_𝜽(𝑡 ,X^f)k_𝜽(X^f,X^f)⁻¹Y^f and𝑢^{†, 𝑐}(𝑡) =k_𝜽(𝑡 ,X^c)𝐾_𝜃(X^c,X^c)⁻¹Y^c. Further, 𝜌admits the representation on the left-hand side of equation (6.1.1) enabling its computation [103, Prop. 3.1].

detailed analysis [101]. Let𝑢be the solution of second order elliptic PDE











−div 𝑎(𝑥)∇𝑢(𝑥)

= 𝑓(𝑥) 𝑥 ∈Ω;

𝑢=0 on 𝜕Ω,

(6.1.3)

with interval Ω ⊂ R¹ and uniformly elliptic symmetric matrix 𝑎 with entries in 𝐿^∞(Ω). We write L := −div(𝑎∇·) for the corresponding linear bijection from H¹

0(Ω) to H⁻¹(Ω). In this proposed simple application we seek to both esti- mate conductivity coefficient 𝑎 and recover the solution of (6.1.3) from the data (𝑥_𝑖, 𝑦_𝑖)_{1≤𝑖≤𝑁} and the information 𝑢(𝑥_𝑖) = 𝑦_𝑖. In this example, we use kernel family, {𝐺_𝑏|𝑏 ∈ L^∞(Ω),essinfΩ(𝑏) > 0}, where each 𝐺_𝑏 is the Green’s function of operator−div(𝑏∇·). It is known that kernel recovery with𝐺_𝑎 is minimax optimal in k · knorm [101]. In what follows, we will numerically demonstrate that the KF algorithm applied to this problem recovers a kernel𝐺_𝑏∗ such that𝑎 ≈ 𝑏^∗.

Figure 6.1: [103, Fig. 5], (1)𝑎 (2) 𝑓 (3)𝑢 (4) 𝜌(𝑎) and 𝜌(𝑏) (where 𝑏 ≡ 1) vs 𝑘 (5)𝑒(𝑎)and𝑒(𝑏) vs𝑘 (6) 20 random realizations of𝜌(𝑎) and𝜌(𝑏)(7) 20 random realizations of𝑒(𝑎)and𝑒(𝑏).

Fig. 6.1 provides a numerical illustration of setting of our example, where Ω is discretized over 2⁸ equally spaced interior points (and piecewise linear tent finite elements) and Fig.6.1.1-3 shows𝑎, 𝑓 and𝑢. For 𝑘 ∈ {1, . . . ,8} and𝑖 ∈ I^(𝑘) :=

{1, . . . ,2^𝑘 − 1} let 𝑥⁽

𝑘)

𝑖 = 𝑖/2^𝑘 and write 𝑣⁽

𝑘)

𝑏 for the interpolation of the data (𝑥⁽

𝑘) 𝑖 , 𝑢(𝑥⁽

𝑘)

𝑖 ))_𝑖∈I(𝑘) using the kernel𝐺_𝑏(note that𝑣⁽⁸⁾

𝑏 =𝑢). Letk𝑣k𝑏be the energy norm k𝑣k²

𝑏 = ∫

Ω(∇𝑣)^𝑇𝑏∇𝑣. Take 𝑏 ≡ 1. Fig. 6.1.4 shows (in semilog scale) the values of 𝜌(𝑎) = ^k^𝑣

(𝑘) 𝑎 −𝑣_𝑎⁽⁸⁾k𝑎²

k𝑣⁽_𝑎⁸⁾k²𝑎

and 𝜌(𝑏) = ^k𝑣

(𝑘) 𝑏 −𝑣⁽⁸⁾

𝑏 k²

𝑏

k𝑣⁽⁸⁾

𝑏 k²

𝑏

vs 𝑘. Note that the value of

ratio𝜌is much smaller when the kernel𝐺_𝑎is used for the interpolation of the data.

The geometric decay 𝜌(𝑎) ≤𝐶2⁻²^𝑘^k^𝑓^k^𝐿2^(Ω)

k𝑢k𝑎²

is well known and has been extensively studied in Numerical Homogenization [101].

Fig.6.1.5 shows (in semilog scale) the values of the prediction errors𝑒(𝑎)and𝑒(𝑏) (vs𝑘) defined (after normalization) to be proportional tok𝑣⁽^𝑘)

𝑎 (𝑥) −𝑢(𝑥) k_𝐿2(Ω) and k𝑣⁽

𝑘)

𝑏 (𝑥) −𝑢(𝑥) k_𝐿2(Ω). Note again that the prediction error is much smaller when the kernel𝐺_𝑎is used for the interpolation.

Now, let us consider the case where the interpolation points form a random subset of the discretization points. Take𝑁_𝑓 =2⁷and𝑁_𝑐 =2⁶. Let{𝑥₁, . . . , 𝑥_𝑁

𝑓}be a subset with 𝑁_𝑓 distinct points of (the discretization points) {𝑖/2⁸|𝑖 ∈ I⁽⁸⁾} sampled with uniform distribution. Let{𝑧₁, . . . , 𝑧_𝑁

𝑐}be a subset of𝑁_𝑐distinct points of𝑋sampled with uniform distribution. Write𝑣^𝑓

𝑏for the interpolation of the data(𝑥_𝑖, 𝑢(𝑥_𝑖))using the kernel 𝐺_𝑏 and write 𝑣^𝑐

𝑏 for the interpolation of the data (𝑧_𝑖, 𝑢(𝑧_𝑖)) using the kernel𝐺_𝑏. Fig.6.1.6 shows in (semilog scale) 20 independent random realizations3 of the values of𝜌(𝑎) =k𝑣

𝑓 𝑎−𝑣^𝑐

𝑎k²_𝑎/k𝑣

𝑓

𝑎k_𝑎²and𝜌(𝑏) = k𝑣

𝑓 𝑏−𝑣^𝑐

𝑏k²

𝑏/k𝑣

𝑓 𝑏k²

𝑏. Fig.6.1.7 shows in (semilog scale) 20 independent random realizations of the values of the prediction errors 𝑒(𝑎) ∝ k𝑢 − 𝑣^𝑐

𝑎k_𝐿2(Ω) and 𝑒(𝑏) ∝ k𝑢 − 𝑣^𝑐

𝑏k_𝐿2(Ω). Note again that the values of 𝜌(𝑎), 𝑒(𝑎) are consistently and significantly lower than those of 𝜌(𝑏), 𝑒(𝑏).

Figure 6.2: [103, Fig. 6], (1)𝑎and𝑏 for𝑛=1 (2)𝑎and𝑏 for𝑛=350 (2)𝜌(𝑏)vs 𝑛(4)𝑒(𝑏)vs𝑛.

Fig. 6.2 provides a numerical illustration of an implementation of Alg. 5 with 𝑁 = 𝑁_𝑓 = 2⁷, 𝑁_𝑐 = 2⁶, and𝑛indexing each iteration of the KF algorithm starting with𝑛=1. In this implementation𝑎, 𝑓 and𝑢are as in Fig.6.1.1-3. The training data corresponds to𝑁points𝑋 ={𝑥₁, . . . , 𝑥_𝑁}uniformly sampled (without replacement) from{𝑖/2⁸|𝑖 ∈ I⁽⁸⁾}. Note that𝑋 remains fixed and since𝑁 =𝑁_𝑓, the larger batch

3Random realizations of the subsets.

(as in step4in Alg. 5) is always selected as𝑋. The purpose of the algorithm is to learn the kernel 𝐺_𝑎 in the set of kernels {𝐺_𝑏

𝑊|𝑊} parameterized by the vector𝑊 via

log𝑏_𝑊 =

2⁶

𝑖=1

(𝑊^𝑐

𝑖 cos(2𝜋𝑖𝑥) +𝑊^𝑠

𝑖 sin(2𝜋𝑖𝑥)). (6.1.4) Using𝑛to label its progression, Alg.5is initialized at𝑛 =1 with the guess𝑏₀≡ 1 (i.e.,𝑊₀ ≡0) (Fig.6.2.1). Fig.6.2.2 shows the value of𝑏after𝑛=350 iterations and can be seen to approximate𝑎. Fig. 6.2.3 shows the value of 𝜌(𝑏) vs𝑛. Fig. 6.2.4 shows the value of the prediction error 𝑒(𝑏) ∝ k𝑢− 𝑣^𝑐

𝑏k_𝐿2(Ω) vs 𝑛. The lack of smoothness of the plots of 𝜌(𝑏), 𝑒(𝑏) vs 𝑛 originate from the re-sampling of the set 𝑍 at each step𝑛. Further details of this application of the KF algorithm can be found in [103, Sec. 5].

Dalam dokumen Learning patterns with kernels and learning kernels from patterns (Halaman 90-94)