• Tidak ada hasil yang ditemukan

Chapter VI: Kernel Flows

6.1 Parametric KF Algorithm

1: Input: dataset (𝑑𝑖, 𝑦𝑖)1≀𝑖≀𝑁, kernel familyπ‘˜πœ½, initial parameter𝜽0

2: 𝜽 β†πœ½0

3: repeat

4: Randomly select{𝑠𝑓(1), . . . 𝑠𝑓(𝑁𝑓)}from{1, . . . , 𝑁}without replacement.

5: Df ← (𝑑𝑠𝑓(𝑖))1≀𝑖≀𝑁𝑐 andYf ← (𝑦𝑠𝑓(𝑖))1≀𝑖≀𝑁𝑐.

6: Randomly select {𝑠𝑐(1), . . . 𝑠𝑐(𝑁𝑐)} from {𝑠𝑓(1), . . . 𝑠𝑓(𝑁𝑓)} without re- placement.

7: Dc ← (𝑑𝑠𝑐(𝑖))1≀𝑖≀𝑁𝑐 andYc ← (𝑦𝑠𝑐(𝑖))1≀𝑖≀𝑁𝑐.

8: 𝜽 ← πœ½βˆ’πœ–βˆ‡πœ½πœŒ(𝜽,Xf,Yf,Xc,Yc)

9: untilEnd criterion

10: Return optimized kernel parameter πœ½βˆ—β† 𝜽

As inputs, in step 1, the parametric variant of the KF algorithm takes a training dataset(𝑑𝑖, 𝑦𝑖)𝑖, a parametric family of kernelsπ‘˜πœ½, and a initial parameter𝜽0. Kernel parameter 𝜽 is first initialized to 𝜽0 in step 2. The main iterative process follows between steps3and9, beginning with the random selection of size 𝑁𝑓 subvectors Df andYf ofDandY(through uniform sampling without replacement in the index set{1, . . . , 𝑁}) as in steps4and5. This selection is randomly sampled further with length 𝑁𝑐 subvectorsDc andYc ofDf andYf (by selecting, at random, uniformly and without replacement,𝑁𝑐 of the indices defining𝐷𝑓) in steps6and7.

Finally in step 8, we update the kernel parameter𝜽 according to gradient descent with loss 𝜌. We define𝜌(𝜽,Df,Yf,Dc,Yc) to be the squared relative error (in the RKHS norm1 k Β· kπ‘˜πœ½ defined byπ‘˜πœ½) between the interpolants𝑒†, 𝑓 and𝑒†,𝑐 obtained from the two nested subsets of the dataset and the kernelπ‘˜πœ½, i.e.2

𝜌(𝜽,Xf,Yf,Xc,Yc):=

k𝑒†, 𝑓 βˆ’π‘’β€ ,𝑐k2

π‘˜πœ½

k𝑒†, 𝑓k2

π‘˜πœ½

=1βˆ’ Yc,>k𝜽(Xc,Xc)βˆ’1Yc

Yf,>k𝜽(Xf,Xf)βˆ’1Yf

. (6.1.1) Note that loss𝜌is doubly randomized through the selection of batches(Df,Yf)and sub-batches (Dc,Yc). The gradient of 𝜌 with respect to 𝜽 is then computed and 𝜽 is updated 𝜽 ← 𝜽 βˆ’ π›Ώβˆ‡πœ½πœŒ. This process is iterated until an ending condition is satisfied, such as the number of iteration steps or 𝜌 < 𝜌stop. The optimized kernel parameterπœ½βˆ—is returned. The corresponding kernel,π‘˜πœ½βˆ—, can then be used to interpolate testing points. This algorithm is a stochastic gradient descent algorithm of𝜌, hence the KF algorithm is a stochastic gradient descent algorithm.

The 𝑙2-norm variant. In the application of the KF algorithm to NN, we will consider the 𝑙2-norm variant of this algorithm (introduced in [103, Sec. 10]) in which the instantaneous loss 𝜌 in (6.1.1) is replaced by the error (let k Β· k2 be the Euclidean𝑙2norm)𝑒2:= kYf βˆ’π‘’β€ ,𝑐(Df) k2

2 of𝑒†,𝑐 in predicting the labelsYf, i.e.

𝑒2(𝜽,Df,Yf,Dc,Yc) :=kYf βˆ’k𝜽(Df,Dc)k𝜽(Dc,Dc)βˆ’1Yck22. (6.1.2) A simple PDE model

To motivate, illustrate and study the parametric KF algorithm, it is useful to start with an application [103, Sec. 5] to the following simple PDE model amenable to

1Note that convergence in RKHS norm implies pointwise convergence.

2Where𝑒†, 𝑓(𝑑) =k𝜽(𝑑 ,Xf)k𝜽(Xf,Xf)βˆ’1Yf and𝑒†, 𝑐(𝑑) =k𝜽(𝑑 ,Xc)πΎπœƒ(Xc,Xc)βˆ’1Yc. Further, 𝜌admits the representation on the left-hand side of equation (6.1.1) enabling its computation [103, Prop. 3.1].

detailed analysis [101]. Let𝑒be the solution of second order elliptic PDE





ο£²



ο£³

βˆ’div π‘Ž(π‘₯)βˆ‡π‘’(π‘₯)

= 𝑓(π‘₯) π‘₯ ∈Ω;

𝑒=0 on πœ•Ξ©,

(6.1.3)

with interval Ξ© βŠ‚ R1 and uniformly elliptic symmetric matrix π‘Ž with entries in 𝐿∞(Ξ©). We write L := βˆ’div(π‘Žβˆ‡Β·) for the corresponding linear bijection from H1

0(Ξ©) to Hβˆ’1(Ξ©). In this proposed simple application we seek to both esti- mate conductivity coefficient π‘Ž and recover the solution of (6.1.3) from the data (π‘₯𝑖, 𝑦𝑖)1≀𝑖≀𝑁 and the information 𝑒(π‘₯𝑖) = 𝑦𝑖. In this example, we use kernel fam- ily, {𝐺𝑏|𝑏 ∈ L∞(Ξ©),essinfΞ©(𝑏) > 0}, where each 𝐺𝑏 is the Green’s function of operatorβˆ’div(π‘βˆ‡Β·). It is known that kernel recovery withπΊπ‘Ž is minimax optimal in k Β· knorm [101]. In what follows, we will numerically demonstrate that the KF algorithm applied to this problem recovers a kernelπΊπ‘βˆ— such thatπ‘Ž β‰ˆ π‘βˆ—.

Figure 6.1: [103, Fig. 5], (1)π‘Ž (2) 𝑓 (3)𝑒 (4) 𝜌(π‘Ž) and 𝜌(𝑏) (where 𝑏 ≑ 1) vs π‘˜ (5)𝑒(π‘Ž)and𝑒(𝑏) vsπ‘˜ (6) 20 random realizations of𝜌(π‘Ž) and𝜌(𝑏)(7) 20 random realizations of𝑒(π‘Ž)and𝑒(𝑏).

Fig. 6.1 provides a numerical illustration of setting of our example, where Ξ© is discretized over 28 equally spaced interior points (and piecewise linear tent finite elements) and Fig.6.1.1-3 showsπ‘Ž, 𝑓 and𝑒. For π‘˜ ∈ {1, . . . ,8} and𝑖 ∈ I(π‘˜) :=

{1, . . . ,2π‘˜ βˆ’ 1} let π‘₯(

π‘˜)

𝑖 = 𝑖/2π‘˜ and write 𝑣(

π‘˜)

𝑏 for the interpolation of the data (π‘₯(

π‘˜) 𝑖 , 𝑒(π‘₯(

π‘˜)

𝑖 ))π‘–βˆˆI(π‘˜) using the kernel𝐺𝑏(note that𝑣(8)

𝑏 =𝑒). Letk𝑣k𝑏be the energy norm k𝑣k2

𝑏 = ∫

Ξ©(βˆ‡π‘£)π‘‡π‘βˆ‡π‘£. Take 𝑏 ≑ 1. Fig. 6.1.4 shows (in semilog scale) the values of 𝜌(π‘Ž) = k𝑣

(π‘˜) π‘Ž βˆ’π‘£π‘Ž(8)kπ‘Ž2

k𝑣(π‘Ž8)k2π‘Ž

and 𝜌(𝑏) = k𝑣

(π‘˜) 𝑏 βˆ’π‘£(8)

𝑏 k2

𝑏

k𝑣(8)

𝑏 k2

𝑏

vs π‘˜. Note that the value of

ratio𝜌is much smaller when the kernelπΊπ‘Žis used for the interpolation of the data.

The geometric decay 𝜌(π‘Ž) ≀𝐢2βˆ’2π‘˜k𝑓k𝐿2(Ξ©)

k𝑒kπ‘Ž2

is well known and has been extensively studied in Numerical Homogenization [101].

Fig.6.1.5 shows (in semilog scale) the values of the prediction errors𝑒(π‘Ž)and𝑒(𝑏) (vsπ‘˜) defined (after normalization) to be proportional tok𝑣(π‘˜)

π‘Ž (π‘₯) βˆ’π‘’(π‘₯) k𝐿2(Ξ©) and k𝑣(

π‘˜)

𝑏 (π‘₯) βˆ’π‘’(π‘₯) k𝐿2(Ξ©). Note again that the prediction error is much smaller when the kernelπΊπ‘Žis used for the interpolation.

Now, let us consider the case where the interpolation points form a random subset of the discretization points. Take𝑁𝑓 =27and𝑁𝑐 =26. Let{π‘₯1, . . . , π‘₯𝑁

𝑓}be a subset with 𝑁𝑓 distinct points of (the discretization points) {𝑖/28|𝑖 ∈ I(8)} sampled with uniform distribution. Let{𝑧1, . . . , 𝑧𝑁

𝑐}be a subset of𝑁𝑐distinct points of𝑋sampled with uniform distribution. Write𝑣𝑓

𝑏for the interpolation of the data(π‘₯𝑖, 𝑒(π‘₯𝑖))using the kernel 𝐺𝑏 and write 𝑣𝑐

𝑏 for the interpolation of the data (𝑧𝑖, 𝑒(𝑧𝑖)) using the kernel𝐺𝑏. Fig.6.1.6 shows in (semilog scale) 20 independent random realizations3 of the values of𝜌(π‘Ž) =k𝑣

𝑓 π‘Žβˆ’π‘£π‘

π‘Žk2π‘Ž/k𝑣

𝑓

π‘Žkπ‘Ž2and𝜌(𝑏) = k𝑣

𝑓 π‘βˆ’π‘£π‘

𝑏k2

𝑏/k𝑣

𝑓 𝑏k2

𝑏. Fig.6.1.7 shows in (semilog scale) 20 independent random realizations of the values of the prediction errors 𝑒(π‘Ž) ∝ k𝑒 βˆ’ 𝑣𝑐

π‘Žk𝐿2(Ξ©) and 𝑒(𝑏) ∝ k𝑒 βˆ’ 𝑣𝑐

𝑏k𝐿2(Ξ©). Note again that the values of 𝜌(π‘Ž), 𝑒(π‘Ž) are consistently and significantly lower than those of 𝜌(𝑏), 𝑒(𝑏).

Figure 6.2: [103, Fig. 6], (1)π‘Žand𝑏 for𝑛=1 (2)π‘Žand𝑏 for𝑛=350 (2)𝜌(𝑏)vs 𝑛(4)𝑒(𝑏)vs𝑛.

Fig. 6.2 provides a numerical illustration of an implementation of Alg. 5 with 𝑁 = 𝑁𝑓 = 27, 𝑁𝑐 = 26, and𝑛indexing each iteration of the KF algorithm starting with𝑛=1. In this implementationπ‘Ž, 𝑓 and𝑒are as in Fig.6.1.1-3. The training data corresponds to𝑁points𝑋 ={π‘₯1, . . . , π‘₯𝑁}uniformly sampled (without replacement) from{𝑖/28|𝑖 ∈ I(8)}. Note that𝑋 remains fixed and since𝑁 =𝑁𝑓, the larger batch

3Random realizations of the subsets.

(as in step4in Alg. 5) is always selected as𝑋. The purpose of the algorithm is to learn the kernel πΊπ‘Ž in the set of kernels {𝐺𝑏

π‘Š|π‘Š} parameterized by the vectorπ‘Š via

logπ‘π‘Š =

26

Γ•

𝑖=1

(π‘Šπ‘

𝑖 cos(2πœ‹π‘–π‘₯) +π‘Šπ‘ 

𝑖 sin(2πœ‹π‘–π‘₯)). (6.1.4) Using𝑛to label its progression, Alg.5is initialized at𝑛 =1 with the guess𝑏0≑ 1 (i.e.,π‘Š0 ≑0) (Fig.6.2.1). Fig.6.2.2 shows the value of𝑏after𝑛=350 iterations and can be seen to approximateπ‘Ž. Fig. 6.2.3 shows the value of 𝜌(𝑏) vs𝑛. Fig. 6.2.4 shows the value of the prediction error 𝑒(𝑏) ∝ kπ‘’βˆ’ 𝑣𝑐

𝑏k𝐿2(Ξ©) vs 𝑛. The lack of smoothness of the plots of 𝜌(𝑏), 𝑒(𝑏) vs 𝑛 originate from the re-sampling of the set 𝑍 at each step𝑛. Further details of this application of the KF algorithm can be found in [103, Sec. 5].

Dokumen terkait