Chapter VI: Kernel Flows
6.1 Parametric KF Algorithm
1: Input: dataset (π‘π, π¦π)1β€πβ€π, kernel familyππ½, initial parameterπ½0
2: π½ βπ½0
3: repeat
4: Randomly select{π π(1), . . . π π(ππ)}from{1, . . . , π}without replacement.
5: Df β (π‘π π(π))1β€πβ€ππ andYf β (π¦π π(π))1β€πβ€ππ.
6: Randomly select {π π(1), . . . π π(ππ)} from {π π(1), . . . π π(ππ)} without re- placement.
7: Dc β (π‘π π(π))1β€πβ€ππ andYc β (π¦π π(π))1β€πβ€ππ.
8: π½ β π½βπβπ½π(π½,Xf,Yf,Xc,Yc)
9: untilEnd criterion
10: Return optimized kernel parameter π½ββ π½
As inputs, in step 1, the parametric variant of the KF algorithm takes a training dataset(π‘π, π¦π)π, a parametric family of kernelsππ½, and a initial parameterπ½0. Kernel parameter π½ is first initialized to π½0 in step 2. The main iterative process follows between steps3and9, beginning with the random selection of size ππ subvectors Df andYf ofDandY(through uniform sampling without replacement in the index set{1, . . . , π}) as in steps4and5. This selection is randomly sampled further with length ππ subvectorsDc andYc ofDf andYf (by selecting, at random, uniformly and without replacement,ππ of the indices definingπ·π) in steps6and7.
Finally in step 8, we update the kernel parameterπ½ according to gradient descent with loss π. We defineπ(π½,Df,Yf,Dc,Yc) to be the squared relative error (in the RKHS norm1 k Β· kππ½ defined byππ½) between the interpolantsπ’β , π andπ’β ,π obtained from the two nested subsets of the dataset and the kernelππ½, i.e.2
π(π½,Xf,Yf,Xc,Yc):=
kπ’β , π βπ’β ,πk2
ππ½
kπ’β , πk2
ππ½
=1β Yc,>kπ½(Xc,Xc)β1Yc
Yf,>kπ½(Xf,Xf)β1Yf
. (6.1.1) Note that lossπis doubly randomized through the selection of batches(Df,Yf)and sub-batches (Dc,Yc). The gradient of π with respect to π½ is then computed and π½ is updated π½ β π½ β πΏβπ½π. This process is iterated until an ending condition is satisfied, such as the number of iteration steps or π < πstop. The optimized kernel parameterπ½βis returned. The corresponding kernel,ππ½β, can then be used to interpolate testing points. This algorithm is a stochastic gradient descent algorithm ofπ, hence the KF algorithm is a stochastic gradient descent algorithm.
The π2-norm variant. In the application of the KF algorithm to NN, we will consider the π2-norm variant of this algorithm (introduced in [103, Sec. 10]) in which the instantaneous loss π in (6.1.1) is replaced by the error (let k Β· k2 be the Euclideanπ2norm)π2:= kYf βπ’β ,π(Df) k2
2 ofπ’β ,π in predicting the labelsYf, i.e.
π2(π½,Df,Yf,Dc,Yc) :=kYf βkπ½(Df,Dc)kπ½(Dc,Dc)β1Yck22. (6.1.2) A simple PDE model
To motivate, illustrate and study the parametric KF algorithm, it is useful to start with an application [103, Sec. 5] to the following simple PDE model amenable to
1Note that convergence in RKHS norm implies pointwise convergence.
2Whereπ’β , π(π‘) =kπ½(π‘ ,Xf)kπ½(Xf,Xf)β1Yf andπ’β , π(π‘) =kπ½(π‘ ,Xc)πΎπ(Xc,Xc)β1Yc. Further, πadmits the representation on the left-hand side of equation (6.1.1) enabling its computation [103, Prop. 3.1].
detailed analysis [101]. Letπ’be the solution of second order elliptic PDE


ο£²

ο£³
βdiv π(π₯)βπ’(π₯)
= π(π₯) π₯ βΞ©;
π’=0 on πΞ©,
(6.1.3)
with interval Ξ© β R1 and uniformly elliptic symmetric matrix π with entries in πΏβ(Ξ©). We write L := βdiv(πβΒ·) for the corresponding linear bijection from H1
0(Ξ©) to Hβ1(Ξ©). In this proposed simple application we seek to both esti- mate conductivity coefficient π and recover the solution of (6.1.3) from the data (π₯π, π¦π)1β€πβ€π and the information π’(π₯π) = π¦π. In this example, we use kernel fam- ily, {πΊπ|π β Lβ(Ξ©),essinfΞ©(π) > 0}, where each πΊπ is the Greenβs function of operatorβdiv(πβΒ·). It is known that kernel recovery withπΊπ is minimax optimal in k Β· knorm [101]. In what follows, we will numerically demonstrate that the KF algorithm applied to this problem recovers a kernelπΊπβ such thatπ β πβ.
Figure 6.1: [103, Fig. 5], (1)π (2) π (3)π’ (4) π(π) and π(π) (where π β‘ 1) vs π (5)π(π)andπ(π) vsπ (6) 20 random realizations ofπ(π) andπ(π)(7) 20 random realizations ofπ(π)andπ(π).
Fig. 6.1 provides a numerical illustration of setting of our example, where Ξ© is discretized over 28 equally spaced interior points (and piecewise linear tent finite elements) and Fig.6.1.1-3 showsπ, π andπ’. For π β {1, . . . ,8} andπ β I(π) :=
{1, . . . ,2π β 1} let π₯(
π)
π = π/2π and write π£(
π)
π for the interpolation of the data (π₯(
π) π , π’(π₯(
π)
π ))πβI(π) using the kernelπΊπ(note thatπ£(8)
π =π’). Letkπ£kπbe the energy norm kπ£k2
π = β«
Ξ©(βπ£)ππβπ£. Take π β‘ 1. Fig. 6.1.4 shows (in semilog scale) the values of π(π) = kπ£
(π) π βπ£π(8)kπ2
kπ£(π8)k2π
and π(π) = kπ£
(π) π βπ£(8)
π k2
π
kπ£(8)
π k2
π
vs π. Note that the value of
ratioπis much smaller when the kernelπΊπis used for the interpolation of the data.
The geometric decay π(π) β€πΆ2β2πkπkπΏ2(Ξ©)
kπ’kπ2
is well known and has been extensively studied in Numerical Homogenization [101].
Fig.6.1.5 shows (in semilog scale) the values of the prediction errorsπ(π)andπ(π) (vsπ) defined (after normalization) to be proportional tokπ£(π)
π (π₯) βπ’(π₯) kπΏ2(Ξ©) and kπ£(
π)
π (π₯) βπ’(π₯) kπΏ2(Ξ©). Note again that the prediction error is much smaller when the kernelπΊπis used for the interpolation.
Now, let us consider the case where the interpolation points form a random subset of the discretization points. Takeππ =27andππ =26. Let{π₯1, . . . , π₯π
π}be a subset with ππ distinct points of (the discretization points) {π/28|π β I(8)} sampled with uniform distribution. Let{π§1, . . . , π§π
π}be a subset ofππdistinct points ofπsampled with uniform distribution. Writeπ£π
πfor the interpolation of the data(π₯π, π’(π₯π))using the kernel πΊπ and write π£π
π for the interpolation of the data (π§π, π’(π§π)) using the kernelπΊπ. Fig.6.1.6 shows in (semilog scale) 20 independent random realizations3 of the values ofπ(π) =kπ£
π πβπ£π
πk2π/kπ£
π
πkπ2andπ(π) = kπ£
π πβπ£π
πk2
π/kπ£
π πk2
π. Fig.6.1.7 shows in (semilog scale) 20 independent random realizations of the values of the prediction errors π(π) β kπ’ β π£π
πkπΏ2(Ξ©) and π(π) β kπ’ β π£π
πkπΏ2(Ξ©). Note again that the values of π(π), π(π) are consistently and significantly lower than those of π(π), π(π).
Figure 6.2: [103, Fig. 6], (1)πandπ forπ=1 (2)πandπ forπ=350 (2)π(π)vs π(4)π(π)vsπ.
Fig. 6.2 provides a numerical illustration of an implementation of Alg. 5 with π = ππ = 27, ππ = 26, andπindexing each iteration of the KF algorithm starting withπ=1. In this implementationπ, π andπ’are as in Fig.6.1.1-3. The training data corresponds toπpointsπ ={π₯1, . . . , π₯π}uniformly sampled (without replacement) from{π/28|π β I(8)}. Note thatπ remains fixed and sinceπ =ππ, the larger batch
3Random realizations of the subsets.
(as in step4in Alg. 5) is always selected asπ. The purpose of the algorithm is to learn the kernel πΊπ in the set of kernels {πΊπ
π|π} parameterized by the vectorπ via
logππ =
26
Γ
π=1
(ππ
π cos(2πππ₯) +ππ
π sin(2πππ₯)). (6.1.4) Usingπto label its progression, Alg.5is initialized atπ =1 with the guessπ0β‘ 1 (i.e.,π0 β‘0) (Fig.6.2.1). Fig.6.2.2 shows the value ofπafterπ=350 iterations and can be seen to approximateπ. Fig. 6.2.3 shows the value of π(π) vsπ. Fig. 6.2.4 shows the value of the prediction error π(π) β kπ’β π£π
πkπΏ2(Ξ©) vs π. The lack of smoothness of the plots of π(π), π(π) vs π originate from the re-sampling of the set π at each stepπ. Further details of this application of the KF algorithm can be found in [103, Sec. 5].