Patterns and kernels - Learning patterns with kernels and learning kernels from patterns

Chapter I: Introduction

1.6 Patterns and kernels

transform described in Theorem2.3.1is found to be within a constant of a minimax optimal recovery. This denoising is interpretable as a Gaussian conditioning with covariance kernel derived fromL. The length-scale of the conditioning is dependent on the signal-to-noise ratio in the assumption of the problem. A conventional method for denoising smooth signals involves thresholding empirical wavelet coefficients.

We numerically compare the near-minimax optimal recovery with thresholding gamblet transform coefficients in Section2.4.

Our approach [102] to the mode decomposition problem, kernel mode decomposition (KMD), is outlined in Section 4. There are two main components within the algorithm: The first, which we name max-pooling, estimates the instantaneous phase and frequency of the lowest frequency mode in a signal and is presented in Section 4.1. It is a close variant of the continuous wavelet transform (CWT) [24], which we summarize in Section 3.2. The second component uses GPR to estimate the instantaneous amplitude and phase of this lowest frequency mode and is summarized in Section4.2. We define a GP with a covariance kernel constructed from Gaussian-windowed trigonometric waves, i.e., Gabor wavelets [48]. GPR then is able to estimate the instantaneous phase and amplitude of the lowest frequency mode. Further, when the base waveform, i.e., 𝑦_𝑖 in 𝑣(𝑡) = Í

𝑎_𝑖(𝑡)𝑦_𝑖(𝜃_𝑖(𝑡)), of each mode is unknown, GPR can be applied to estimate 𝑦_𝑖as presented in Section 5.1. These algorithms were extended in [102, Sec. 10], showing the method can be constructed to be robust to noise, vanishing amplitudes, and modes with crossing frequencies. We find that patterns within each mode can be estimated with GPR from the sum of modes, even when these patterns are visually indistinguishable in the composite signal.

Learning kernels from patterns

We will discuss the Kernel Flow (KF) algorithm [154] next. At a high level, the algorithm is interpretable as learning kernels from patterns with a method inspired by cross-validation. The algorithm is described in Section 6 and operates under the principle that a kernel is desirable when it can make low error predictions with small samples of the whole data set. This error is quantified by selecting𝑁random training points and computing the kernel interpolation with a further random 𝑁/2 of these points. We compute the error of the of the interpolation on the other𝑁/2 points. Assuming these interpolations are written as𝑢^†^{, 𝑓} and𝑢^†^,𝑐 respectively, the

error is quantified as the KF loss function, 𝜌:=

k𝑢^†^{, 𝑓} −𝑢^†^,𝑐k²

𝑘_𝜽

k𝑢^†^{, 𝑓}k²

𝑘_𝜽

, (1.6.1)

and the KF algorithm selects a kernel with optimization. Since this loss function is dependent mainly on the training data, this method selects a kernel based on patterns in that data. We present examples throughout Section6 including on the MNIST dataset. We find that this technique is able to learn kernels which can predict classes accurately only observing one point per class. Additionally, we observe evidence of unsupervised learning since archetypes within each class appear to be learned. A further example of pattern learning applying the KF algorithm can be found in [57], where it has been applied to data in chaotic dynamical systems to learn a model kernel.

An application of the KF algorithm to Artificial Neural Networks (ANNs) will further be demonstrated in Section7to improve key performance statistics in MNIST and CIFAR image classification problems. These ANNs are widely used to address this problem and are defined as the mapping

𝑓_𝜃(𝑥)= 𝑓⁽

𝑛) 𝜃_𝑛 ◦ 𝑓⁽

𝑛−1)

𝜃_𝑛−1 ◦ · · · ◦ 𝑓⁽¹⁾

𝜃₁

(𝑥). (1.6.2)

This map has input 𝑥 and 𝑛 layers 𝑓^(𝑖)

𝜃𝑖

(𝑧) = 𝜙(𝑊_𝑖𝑧 + 𝑏_𝑖) parameterized17 by the weights and biases 𝜃_𝑖 := (𝑊_𝑖, 𝑏_𝑖), 𝜃 := {𝜃₁, . . . , 𝜃_𝑛}. The output of 𝑓_𝜃 is in R^𝑐, where 𝑐 represents the number of classes in the dataset. This is converted into a classifier by selecting the component with largest value. The parameters 𝜃 best modeling the patterns of the data are learned by optimizing the error of the classifier, usually with cross-entropy loss18, on the training data.

Kernels can be incorporated into ANNs by allowing 𝑓_𝜃 to map into a higher dimensional space and applying kernel interpolation on the result, which leads to an improvement of error rates [103, Sec. 10]. Further improvements can be made by reverting to the standard 𝑓_𝜃 largest component classifier and constructing a kernel dependent on intermediate-layer output

ℎ^(𝑖)(𝑥) := 𝑓⁽

𝑖) 𝜃_𝑖 ◦ 𝑓⁽

𝑖−1)

𝜃_𝑖−1 ◦ · · · ◦ 𝑓⁽¹⁾

𝜃₁

(𝑥), (1.6.3)

for𝑖 = 1, . . . , 𝑛. The KF loss corresponding to this kernel is then used in tandem with the standard cross-entropy loss, which leads to improvements in testing error,

17Weights,𝑊_𝑖, are linear operators and biases, 𝑏_𝑖 are vectors. The function 𝜙 is an arbitrary function, typically taken as the ReLU,𝜙(𝑧)=max(0, 𝑧).

18This loss is defined in equation (7.0.4).

generalization gap, and robustness to distributional shift. Details on our numerical findings can be found in Section 7.1. Note that kernel interpolation itself is not directly used as a classifier; the KF loss is used only as a regularization of the loss function used in the optimization of the ANN parameters. This application of kernels is a novel method for training and clustering intermediate-layer outputs in conjunction with the final output 𝑓_𝜃. We present numerical experiments that show the KF loss function aids in the learning of parameters which most accurately classify patterns in images.

C h a p t e r 2

DENOISING

2.1 Introduction to the denoising problem

[37–39] addressed the problem of recovering of a smooth signal from noisy obser- vations by soft-thresholding empirical wavelet coefficients [37]. More recently, [33]

considered the recovery of𝑥 ∈ 𝑋based on the observation of𝑇 𝑥+𝜁, where𝜁_𝑖is IID.

N (0, 𝜎²) and𝑇 is a compact linear operator between Hilbert spaces 𝑋 and𝑌, with the prior that𝑥lies in an ellipsoid defined by the eigenvectors of𝑇^∗𝑇. [33] showed that thresholding the coefficients of the corrupted signal𝑇 𝑥+𝜁in the basis formed by the singular value decomposition (SVD) of𝑇 (which can be computed in (𝑁³) complexity) approached the minimax recovery to a fixed multiplicative constant.

The contributions presented in this section [153] address denoisings in the following formulation. Suppose

L :H₀^𝑠(Ω) → H^−𝑠(Ω) (2.1.1)

is a symmetric positive local1 linear bijection with 𝑠 ∈ N^∗ and regular bounded Ω ⊂ R^𝑑 (𝑑 ∈N). Letk · k be the energy-norm defined by

k𝑢k²:=

∫

𝑢L𝑢 , (2.1.2)

and write

𝑢, 𝑣

∫

𝑢L𝑣 (2.1.3)

for the associated scalar product. Further, define

𝑉_𝑀 :={𝑢 ∈ H₀^𝑠(Ω) :L𝑢 ∈ 𝐿²(Ω)and kL𝑢k_𝐿2(Ω) ≤ 𝑀}. (2.1.4) Further, let

𝜁 ∼ N (0, 𝜎²𝛿(𝑥−𝑦)) (2.1.5) be white noise in domain Ω with variance 𝜎². The following is the continuous version of the denoising problem studied in this section.

Problem 5. Let 𝑢 be an unknown element of 𝑉_𝑀, given the noisy observation 𝜂 =𝑢+𝜁, find an approximation of𝑢that is as accurate as possible in the energy norm k · k.

1Symmetric positive local defined as∫

Ω𝑢L𝑣=∫

Ω𝑣L𝑢,∫

Ω𝑢L𝑢 >0 for𝑢 ≠0, and∫

Ω𝑢L𝑣=0 for𝑢, 𝑣∈ H^𝑠

0(Ω)with disjoint supports, respectively.

This problem will be illustrated in the𝑠=1 case with the linear differential operator L = −div 𝑎(𝑥)∇ ·

where the conductivity 𝑎 is a uniformly elliptic symmetric 𝑑 × 𝑑 matrix with entries in 𝐿^∞(Ω). This example is of practical importance in groundwater flow modeling (where 𝑎 is the porosity of the medium) and in electrostatics (where𝑎is the dielectric constant), and in both applications𝑎may be rough (non-smooth) [7,19].

Example 2.1.1. Assuming

L =−div 𝑎(𝑥)∇ ·

:H₀¹(Ω) → H⁻¹(Ω), (2.1.6) Prob.5then corresponds to the problem of recovering the solution of the PDE











−div 𝑎(𝑥)∇𝑢(𝑥)

= 𝑓(𝑥) 𝑥 ∈Ω;

𝑢 =0 on 𝜕Ω,

(2.1.7)

from its noisy observation𝜂=𝑢+𝜁 with knowledgek𝑓k_𝐿2(Ω) < 𝑀.

This problem is addressed by expressing 𝜂 in the gamblet transform adapted to operator L and applying a truncation to the series. This method is theoretically proved to yield a recovery within a constant of the minimax optimal recovery [153]. This method is numerically compared to thresholding the gamblet transform coefficients as well asregularization, the minimization of

k𝑣(𝜂) −𝜂k²

𝐿²(Ω)+𝛼k𝑣(𝜂) k². (2.1.8) 2.2 Summary of operator-adapted wavelets

We proceed by reviewingoperator-adapted waveletsas in [153, Sec. 2], also named gambletsin reference to their game theoretic interpretation, and their main properties [99,101,104,119]. They are constructed with a hierarchy of measurement functions and an operator. Theorem2.2.4shows these gamblets are simultaneously associated with Gaussian conditioning, optimal recovery, and game theory. By selecting these measurement functions to bepre-Haar wavelets, the gamblets are localized both in space and in the eigenspace of the operator.

Hierarchy of measurement functions

Let𝑞 ∈ N^∗ (used to represent a number of scales). Let(I⁽^𝑘))_1≤𝑘≤𝑞 be a hierarchy of labels defined as follows. I^(𝑞) is a set of 𝑞-tuples consisting of elements 𝑖 = (𝑖₁, . . . , 𝑖_𝑞). For 1 ≤ 𝑘 ≤ 𝑞 and𝑖 ∈ I^(𝑞),𝑖⁽^𝑘) := (𝑖₁, . . . , 𝑖_𝑘) andI⁽^𝑘) is the set of

𝑘-tuplesI^(𝑘) = {𝑖⁽^𝑘)|𝑖 ∈ I^(𝑞)}. For 1 ≤ 𝑟 ≤ 𝑘 ≤ 𝑞 and 𝑗 = (𝑗₁, . . . , 𝑗_𝑘) ∈ I⁽^𝑘), we write 𝑗^(𝑟) = (𝑗₁, . . . , 𝑗_𝑟). We say that 𝑀 is aI⁽^𝑘) × I^(𝑙) matrix if its rows and columns are indexed by elements ofI⁽^𝑘⁾ andI⁽^𝑙⁾, respectively.

Let {𝜙^(𝑘)

𝑖 |𝑘 ∈ {1, . . . , 𝑞}, 𝑖 ∈ I⁽^𝑘⁾} be a nested hierarchy of elements of H⁻^𝑠(Ω) such that(𝜙^(𝑞)

𝑖 )_𝑖_∈I(𝑞) are linearly independent and 𝜙⁽^𝑘)

𝑖 = Õ

𝑗∈I⁽^𝑘⁺¹⁾

𝜋^{(𝑘 , 𝑘+}¹⁾

𝑖, 𝑗 𝜙^(𝑘+¹⁾

𝑗 (2.2.1)

for𝑖 ∈ I⁽^𝑘⁾,𝑘 ∈ {1, . . . , 𝑞−1}, where𝜋⁽^{𝑘 , 𝑘}⁺¹⁾ is anI⁽^𝑘⁾ × I⁽^𝑘⁺¹⁾ matrix and

𝜋⁽^{𝑘 , 𝑘+1)}𝜋⁽^𝑘+1^{, 𝑘)} = 𝐼^(𝑘). (2.2.2)

In (2.2.2), 𝜋⁽^𝑘+1^{, 𝑘)} is the transpose of 𝜋^{(𝑘 , 𝑘+1)} and 𝐼^(𝑘) is the I^(𝑘) × I⁽^𝑘) identity matrix.

Hierarchy of operator-adapted pre-wavelets Let(𝜓⁽

𝑘)

𝑖 )_𝑖∈I(𝑘)be the hierarchy of optimal recovery splines associated with(𝜙⁽

𝑘) 𝑖 )_𝑖∈I(𝑘), i.e., for𝑘 ∈ {1, . . . , 𝑞}and𝑖 ∈ I⁽^𝑘⁾,

𝜓⁽

𝑘)

𝑖 = Õ

𝑗∈I^(𝑘)

𝐴⁽

𝑘) 𝑖, 𝑗 L⁻¹𝜙⁽

𝑘)

𝑗 , (2.2.3)

where

𝐴^(𝑘) :=(Θ^(𝑘))⁻¹ (2.2.4)

andΘ^(𝑘) is theI⁽^𝑘)× I^(𝑘) symmetric positive definite Gramian matrix with entries (writing [𝜙, 𝑣] for the duality pairing between𝜙 ∈ H^−𝑠(Ω) and𝑣 ∈ H^𝑠

0(Ω)) Θ_{𝑖, 𝑗}⁽^𝑘⁾ =[𝜙⁽

𝑘)

𝑖 ,L⁻¹𝜙⁽

𝑘)

𝑗 ]. (2.2.5)

Note that 𝐴^(𝑘) is the stiffness matrix of the elements(𝜓⁽

𝑘)

𝑖 )_𝑖_∈I(𝑘) in the sense that 𝐴⁽

𝑘) 𝑖, 𝑗 =

𝜓⁽

𝑘) 𝑖 , 𝜓⁽

𝑘) 𝑗

. (2.2.6)

Writing Φ⁽^𝑘⁾ :=span{𝜙^(𝑘)

𝑖 | 𝑖 ∈ I⁽^𝑘⁾} and𝔙⁽^𝑘⁾ := span{𝜓^(𝑘)

𝑖 | 𝑖 ∈ I⁽^𝑘⁾}, Φ⁽^𝑘⁾ ⊂ Φ⁽^𝑘⁺¹⁾ and Ψ⁽^𝑘⁾ = L⁻¹Φ⁽^𝑘⁾ imply Ψ⁽^𝑘⁾ ⊂ Ψ⁽^𝑘⁺¹⁾. We further write [𝜙⁽^𝑘⁾, 𝑢] =

[𝜙^(𝑘)

𝑖 , 𝑢]

𝑖∈I⁽^𝑘⁾ ∈R^I

(𝑘). The(𝜙⁽^𝑘)

𝑖 )_𝑖_∈I(𝑘) and (𝜓^(𝑘)

𝑖 )_𝑖_∈I(𝑘) form a bi-orthogonal system in the sense that [𝜙⁽^𝑘)

𝑖 , 𝜓^(𝑘)

𝑗 ] =𝛿_{𝑖, 𝑗} for𝑖, 𝑗 ∈ I⁽^𝑘⁾ (2.2.7) and the

·,·

-orthogonal projection of𝑢 ∈ H^𝑠

0(Ω)onΨ⁽^𝑘⁾ is 𝑢⁽^𝑘) := Õ

𝑖∈I⁽^𝑘)

[𝜙⁽

𝑘) 𝑖 , 𝑢]𝜓⁽

𝑘)

𝑖 . (2.2.8)

Multiple interpretations of operator adapted pre-wavelets Using operator-adapted pre-wavelets,𝜓^(𝑘)

𝑖 , we summarize the connections between optimal recovery, game theory, and Gaussian conditioning. First, we define Gaussian fields, a generalization of Gaussian processes.

Definition 2.2.1. The canonical Gaussian field 𝜉 associated with operator L : H^𝑠

0(Ω) → H⁻^𝑠(Ω) is defined such that 𝜙 ↦→ [𝜙, 𝜉] is the linear isometry from H^−𝑠(Ω) to a Gaussian space characterized by











[𝜙, 𝜉] ∼ N (0,k𝜙k_∗²) Cov [𝜙, 𝜉],[𝜑, 𝜉]

= h𝜙, 𝜑i_∗,

(2.2.9)

where k𝜙k_∗ =sup_𝑢∈H^𝑠

0(Ω)

∫

Ω𝜙𝑢

k𝑢k is the dual norm ofk · k.

Remark 2.2.2. When𝑠 > 𝑑/2, the evaluation functional𝛿_𝑥(𝑓) = 𝑓(𝑥) is continu- ous. Hence,𝜉|𝛿_𝑥,𝑥∈Ωis naturally isomorphic to a Gaussian process with covariance function𝑘(𝑥 , 𝑥⁰) =h𝛿_𝑥, 𝛿_𝑥0i_∗.

Several notable properties of these pre-wavelets are summarized in the following result. Recall we write[𝜙⁽^𝑘⁾, 𝑢] = [𝜙^(𝑘)

𝑖 , 𝑢]

𝑖∈I⁽^𝑘⁾ ∈R^I

(𝑘). Theorem 2.2.3. Consider pre-wavelets 𝜓⁽

𝑘)

𝑖 adapted to operator L constructed with measurement functions 𝜙⁽^𝑘)

𝑖 . Further, suppose that for 𝑢 ∈ H^𝑠

0(Ω) we define 𝑣^†(𝑢) =𝑢⁽^𝑘⁾ =Í

𝑖∈I^(𝑘)[𝜙^(𝑘)

𝑖 , 𝑢]𝜓^(𝑘)

𝑖 . 1. For fixed𝑢 ∈ H^𝑠

0(Ω),𝑣^†(𝑢) is the minimizer of











Minimizek𝜓k Subject to𝜓 ∈ H^𝑠

0(Ω)and [𝜙^(𝑘), 𝜓] =[𝜙^(𝑘), 𝑢].

(2.2.10)

2. For fixed𝑢 ∈ H^𝑠

0(Ω),𝑣^†(𝑢) is the minimizer of











Minimizek𝑢−𝜓k Subject to𝜓 ∈span{𝜓^(𝑘)

𝑖 :𝑖 ∈ I⁽^𝑘⁾}.

(2.2.11)

3. For canonical Gaussian field𝜉 ∼ N (0,L⁻¹), 𝑣^†(𝑢) =E

𝜉

[𝜙⁽^𝑘⁾, 𝜉] = [𝜙⁽^𝑘⁾, 𝑢]

. (2.2.12)

4. It is true that2

𝑣^†∈argmin_{𝑣∈𝐿(Φ}_,H^𝑠

0(Ω)) sup

𝑢∈H^𝑠

0(Ω)

k𝑢−𝑣(𝑢) k

k𝑢k (2.2.13)

Proof. (1) is a result of [101, Cor. 3.4] (2) is equivalent to [101, Thm. 12.2] (3) and

(4) are results in [101, Sec. 8.5].

This result shows that the operator-adapted pre-wavelets transform defined by 𝑣^†(𝑢) = 𝑢⁽^𝑘) is an optimal recovery in the sense of Theorem 2.2.3.1-2. Simul- taneously, 𝑣^†(𝑢) are conditional expectations of the canonical Gaussian field with respect to the measurements [𝜙^(𝑘),·] as in Theorem 2.2.3.3. Another interpretation of the transform is game theoretic as expressed in Theorem2.2.13.4. Equation (2.2.13) represents the adversarial two player game where player I selects𝑢 ∈ H^𝑠

0(Ω) and player II approximates𝑢 with𝑣(𝑢) with measurements [𝜙⁽^𝑘⁾, 𝑢]. Player I and II aim to maximize and minimize the recovery error of𝑣(𝑢). This game theoretic interpretation inspires the name gamblets, referring to operator-adapted wavelets.

Note that the pre-wavelets𝜓⁽^𝑘)

𝑖 lie on only one level of the hierarchy. The following addresses the construction of a wavelet decomposition ofH^𝑠

0(Ω)on all hierarchical levels.

Operator-adapted wavelets

Let(J^(𝑘))₂_{≤𝑘≤𝑞}be a hierarchy of labels such that, writing|J⁽^𝑘)|for the cardinal of J^(𝑘),

|J^(𝑘)|=|I^(𝑘)| − |I^(𝑘−1)|. (2.2.14) For𝑘 ∈ {2, . . . , 𝑞}, let𝑊⁽^𝑘⁾ be a J⁽^𝑘⁾ × I⁽^𝑘⁾ matrix such that3

Ker(𝜋^(𝑘−1^{, 𝑘)}) =Im(𝑊^(𝑘),𝑇). (2.2.15)

For𝑘 ∈ {2, . . . , 𝑞}and𝑖 ∈ J⁽^𝑘) define 𝜒^(𝑘)

𝑖 := Õ

𝑗∈I^(𝑘)

𝑊⁽^𝑘)

𝑖, 𝑗 𝜓^(𝑘)

𝑗 , (2.2.16)

and write 𝔚⁽^𝑘⁾ := span{𝜒^(𝑘)

𝑖 | 𝑖 ∈ J⁽^𝑘⁾}. Then 𝔚⁽^𝑘⁾ is the

·,·

-orthogonal complement of𝔙⁽^𝑘⁻¹⁾ in𝔙⁽^𝑘⁾, i.e. 𝔙⁽^𝑘⁾ = 𝔙⁽^𝑘⁻¹⁾ ⊕𝔚⁽^𝑘⁾,and

𝔙^(𝑞) = 𝔙⁽¹⁾ ⊕𝔚⁽²⁾ ⊕ · · · ⊕𝔚^(𝑞). (2.2.17)

2𝐿(Φ,H^𝑠

0(Ω)) is defined as the set ofH^𝑠

0(Ω) → H^𝑠

0(Ω) functions that are of form 𝑣(𝑢) = Ψ [𝜙^(𝑘), 𝑢])

with measureableΨ:R^I

(𝑘) → H₀^𝑠(Ω).

3We write𝑀^(𝑘),𝑇 and𝑀^(𝑘),−1for the transpose and inverse of a matrix𝑀^(𝑘).

For𝑘 ∈ {2, . . . , 𝑞}write

𝐵⁽^𝑘⁾ :=𝑊⁽^𝑘⁾𝐴⁽^𝑘⁾𝑊⁽^𝑘⁾^,𝑇 . (2.2.18) Note that𝐵^(𝑘) is the stiffness matrix of the elements(𝜒⁽^𝑘)

𝑗 )_𝑗_∈J(𝑘), i.e., 𝐵⁽^𝑘)

𝑖, 𝑗 = 𝜒^(𝑘)

𝑖 , 𝜒^(𝑘)

𝑗

. (2.2.19)

Further, for𝑘 ∈ {2, . . . , 𝑞}, define

𝑁⁽^𝑘) := 𝐴^(𝑘)𝑊^(𝑘),𝑇𝐵⁽^𝑘)^,−1 (2.2.20)

and, for𝑖 ∈ J⁽^𝑘⁾,

𝜙^{(𝑘), 𝜒}

𝑖 := Õ

𝑗∈I⁽^𝑘⁾

𝑁⁽^𝑘),𝑇

𝑖, 𝑗 𝜙⁽^𝑘)

𝑗 . (2.2.21)

Then defining𝑢⁽^𝑘⁾ as in (2.2.8), it holds true that for𝑘 ∈ {2, . . . , 𝑞},𝑢⁽^𝑘⁾−𝑢⁽^𝑘⁻¹⁾ is the

·,·

-orthogonal projection of𝑢 on𝔚^(𝑘) and 𝑢^(𝑘) −𝑢⁽^𝑘−¹⁾ = Õ

𝑖∈J^(𝑘)

[𝜙⁽

𝑘), 𝜒 𝑖 , 𝑢]𝜒⁽

𝑘)

𝑖 . (2.2.22)

To simplify notations, write J⁽¹⁾ :=I⁽¹⁾, 𝐵⁽¹⁾ := 𝐴⁽¹⁾, 𝑁⁽¹⁾ := 𝐼⁽¹⁾, 𝜙⁽¹⁾

, 𝜒

𝑖 := 𝜙⁽¹⁾

𝑖

for𝑖 ∈ J⁽¹⁾,J := J⁽¹⁾∪ · · · ∪ J^(𝑞), 𝜒_𝑖 := 𝜒⁽

𝑘)

𝑖 and𝜙

𝜒 𝑖 :=𝜙⁽

𝑘), 𝜒

𝑖 for𝑖 ∈ J⁽^𝑘) and4 1≤ 𝑘 ≤ 𝑞. Then the𝜙

𝜒

𝑖 and 𝜒_𝑖 form a bi-orthogonal system, i.e., [𝜙

𝜒

𝑖 , 𝜒_𝑗] =𝛿_{𝑖, 𝑗} for𝑖, 𝑗 ∈ J (2.2.23) and

𝑢⁽^𝑞⁾ =Õ

𝑖∈J

[𝜙

𝜒

𝑖 , 𝑢]𝜒_𝑖. (2.2.24)

Simplifying notations further, we will write [𝜙^𝜒, 𝑢] for the J vector with entries [𝜙

𝜒

𝑖 , 𝑢]and 𝜒for theJ vector with entries 𝜒_𝑖so that (2.2.24) can be written 𝑢⁽^𝑞) = [𝜙^𝜒, 𝑢] ·𝜒 . (2.2.25) Further, define the J by J block-diagonal matrix 𝐵 defined as 𝐵_{𝑖, 𝑗} = 𝐵⁽^𝑘)

𝑖, 𝑗 if 𝑖, 𝑗 ∈ J⁽^𝑘⁾ and 𝐵_{𝑖, 𝑗} = 0 otherwise. Note that it holds that 𝐵_{𝑖, 𝑗} =

𝜒_𝑖, 𝜒_𝑗

. When 𝑞 =∞and∪^∞

𝑘=1Φ⁽^𝑘⁾ is dense inH⁻^𝑠(Ω), then, writing𝔚⁽¹⁾ := 𝔙⁽¹⁾, H^𝑠

0(Ω) =⊕^∞

𝑘=1𝔚^(𝑘), (2.2.26)

4The dependence on𝑘is left implicit to simplify notation, for𝑖 ∈ Jthere exists a unique𝑘such that𝑖∈ J^(𝑘).

𝑢^(𝑞) = 𝑢, and (2.2.24) is the corresponding multi-resolution decomposition of 𝑢. When 𝑞 < ∞, 𝑢^(𝑞) is the projection of 𝑢 on ⊕^𝑞

𝑘=1𝔚^(𝑘) and (2.2.25) is the corresponding multi-resolution decomposition. Note that the optimal recovery, game theory, and Gaussian conditioning results in Theorem 2.2.3 also holds for wavelets.

Theorem 2.2.4. Consider pre-wavelets 𝜒_𝑖 adapted to operatorL constructed with measurement functions𝜙^𝜒. Further, suppose that for𝑢 ∈ H^𝑠

0(Ω), we define𝑣^†(𝑢)= 𝑢⁽^𝑞⁾ = [𝜙^𝜒, 𝑢] ·𝜒.

1. For fixed𝑢 ∈ H^𝑠

0(Ω),𝑣^†(𝑢) is the minimizer of











Minimizek𝜓k Subject to𝜓 ∈ H^𝑠

0(Ω) and[𝜙^𝜒, 𝜓] =[𝜙^𝜒, 𝑢].

(2.2.27)

2. For fixed𝑢 ∈ H^𝑠

0(Ω),𝑣^†(𝑢) is the minimizer of











Minimizek𝑢−𝜓k

Subject to𝜓 ∈span{𝜒_𝑖 :𝑖 ∈ J }.

(2.2.28)

3. For canonical Gaussian field𝜉 ∼ N (0,L⁻¹), 𝑣^†(𝑢) =E

𝜉

[𝜙^𝜒, 𝜉] = [𝜙^𝜒, 𝑢]

. (2.2.29)

4. It is true that5

𝑣^† ∈argmin𝑣∈𝐿(Φ,H₀^𝑠(Ω)) sup

𝑢∈H^𝑠

0(Ω)

k𝑢−𝑣(𝑢) k

k𝑢k . (2.2.30)

Pre-Haar wavelet measurement functions

The gamblets used in the subsequent developments will use pre-Haar wavelets (as defined below) as measurement functions𝜙⁽

𝑘)

𝑖 and our main near-optimal denoising estimates will be derived from their properties (summarized in Thm.2.2.5).

Let𝛿, ℎ ∈ (0,1). Let(𝜏⁽

𝑘)

𝑖 )_𝑖_∈I(𝑘)be uniformly Lipschitz convex sets forming a nested partition of Ω, i.e., such that Ω = ∪_𝑖_∈I(𝑘)𝜏⁽^𝑘)

𝑖 , 𝑘 ∈ {1, . . . , 𝑞} is a disjoint union except for the boundaries, and 𝜏⁽^𝑘)

𝑖 = ∪_𝑗_∈I(𝑘+1):𝑗^(𝑘)=𝑖𝜏⁽^𝑘+¹⁾

𝑗 , 𝑘 ∈ {1, . . . , 𝑞−1}.

5Here 𝐿(Φ,H₀^𝑠(Ω)) is defined as the set ofH₀^𝑠(Ω) → H₀^𝑠(Ω) functions that are of form 𝑣(𝑢)= Ψ [𝜙^𝜒, 𝑢])

with measureableΨ:R^J → H₀^𝑠(Ω).

Assume that each𝜏⁽^𝑘)

𝑖 , contains a ball of radius𝛿 ℎ^𝑘, and is contained in the ball of radius𝛿⁻¹ℎ^𝑘. Writing |𝜏⁽

𝑘)

𝑖 |for the volume of𝜏⁽

𝑘) 𝑖 , take 𝜙⁽

𝑘)

𝑖 :=1

𝜏^(𝑘)

𝑖

|𝜏⁽

𝑘)

𝑖 |⁻¹² . (2.2.31)

The nesting relation (2.2.1) is then satisfied with 𝜋^{(𝑘 , 𝑘+1)}

𝑖, 𝑗 := |𝜏⁽^𝑘+1)

𝑗 |¹²|𝜏^(𝑘)

𝑖 |⁻¹² for 𝑗⁽^𝑘⁾ =𝑖and𝜋⁽

𝑘 , 𝑘+1)

𝑖, 𝑗 :=0 otherwise.

For𝑘 ∈ {2, . . . , 𝑞}, let J⁽^𝑘) be a finite set of 𝑘-tuples of the form 𝑗 = (𝑗₁, . . . , 𝑗_𝑘) such that{𝑗^(𝑘−1) | 𝑗 ∈ J^(𝑘)} =I⁽^𝑘−1), and for𝑖 ∈ I^(𝑘−1), Card{𝑗 ∈ J^(𝑘) | 𝑗⁽^𝑘−1) = 𝑖} =Card{𝑠 ∈ I⁽^𝑘⁾ |𝑠⁽^𝑘⁻¹⁾ =𝑖} −1. Note that the cardinalities of these sets satisfy (2.2.14).

Write 𝐽^(𝑘) for the J^(𝑘) × J^(𝑘) identity matrix. For 𝑘 = 2, . . . , 𝑞, let 𝑊⁽^𝑘) be a J^(𝑘) × I^(𝑘) matrix such that Im(𝑊^(𝑘),𝑇) =Ker(𝜋⁽^𝑘−1^{, 𝑘)}),𝑊⁽^𝑘)(𝑊^(𝑘))^𝑇 =𝐽⁽^𝑘) and 𝑊^(𝑘)

𝑖, 𝑗 =0 for𝑖⁽^𝑘⁻¹⁾ ≠ 𝑗⁽^𝑘⁻¹⁾.

Theorem 2.2.5. With pre-Haar wavelet measurement functions, it holds true that 1. For𝑘 ∈ {1, . . . , 𝑞} and𝑢 ∈ L⁻¹𝐿²(Ω),

k𝑢−𝑢⁽^𝑘⁾k ≤𝐶 ℎ^{𝑘 𝑠}kL𝑢k_𝐿2(Ω). (2.2.32) 2. Writing Cond(𝑀) for the condition number of a matrix 𝑀, we have for

𝑘 ∈ {1,· · · , 𝑞}

𝐶⁻¹ℎ^{−2(𝑘−1)}^𝑠𝐽⁽^𝑘) ≤ 𝐵⁽^𝑘) ≤ 𝐶 ℎ⁻²^{𝑘 𝑠}𝐽^(𝑘) (2.2.33) andCond(𝐵⁽^𝑘⁾) ≤𝐶 ℎ⁻²^𝑠.

3. For𝑖 ∈ I^(𝑘) and𝑥⁽

𝑘)

𝑖 ∈𝜏⁽

𝑘)

𝑖 ,

k𝜓_𝑖k_H𝑠(Ω\𝐵(𝑥^(𝑘)

𝑖 ,𝑛 ℎ)) ≤ 𝐶 ℎ⁻^𝑠𝑒⁻^𝑛^/^𝐶. (2.2.34)

4. The wavelets𝜓⁽

𝑘) 𝑖 , 𝜒⁽

𝑘)

𝑖 and stiffness matrices 𝐴^(𝑘), 𝐵⁽^𝑘) can be computed to precision 𝜖 (in k · k-energy norm for elements of H^𝑠

0(Ω) and in Frobenius norm for matrices) inO(𝑁log³^𝑑 ^𝑁_𝜖) complexity.

Furthermore the constant𝐶 depends only on𝛿,Ω, 𝑑 , 𝑠, kL k := sup

𝑢∈H₀^𝑠(Ω)

kL𝑢k_H^−𝑠_(Ω) k𝑢k_H^𝑠

0(Ω)

and kL⁻¹k := sup

𝑢∈H₀^𝑠(Ω)

k𝑢k_H^𝑠

0(Ω)

kL𝑢k_H^−𝑠_(Ω) .

(2.2.35)

Proof. (1) and (2) follows from an application of Prop. 4.17 and Theorems 4.14 and 3.19 from [100]. (3) follows from Thm. 2.23 of [100]. 4 follows from the complexity analysis of Alg. 6 of [100]. See [101] for detailed proofs.

Remark 2.2.6. The wavelets 𝜓⁽

𝑘) 𝑖 , 𝜒⁽

𝑘)

𝑖 and stiffness matrices 𝐴⁽^𝑘), 𝐵^(𝑘) can also be computed in O(𝑁log²𝑁log²^𝑑 ^𝑁_𝜖) complexity using the incomplete Cholesky factorization approach of [119].

Theorem 2.2.5.2-3 implies that the gamblets are localized both in the eigenspace of operator L and inΩspace. Further, Theorem2.2.5.1 shows the accuracy of the recovery,𝑢⁽^𝑘), in L norm is bounded by𝐿²norm of L𝑢. This result is used in the proofs of the denoising result shown in the following section.

2.3 Denoising by truncating the gamblet transform Near minimax recovery

In this section, we will present the result that truncating the gamblet transform of 𝜂 =𝑢+𝜁 in a discrete variant of Problem5produces an approximation of𝑢 that is minimax optimal up to a multiplicative constant [153, Sec. 4], i.e., near minimax.

The discretized version ofH^𝑠

0(Ω)is the finite dimensional space spanned by gamblet wavelets, using pre-Haar measurement functions defined in Sec. 2.2, taken to the 𝑞-th level6. In addition, the discrete noise used in this problem, 𝜁 ∈ Ψ⁽^𝑞⁾, is the projection of the noise (2.1.5) ontoΨ⁽^𝑞⁾ (due to (2.2.8)).

Problem 6. Let𝑢 be an unknown element ofΨ^(𝑞) ⊂ H^𝑠

0(Ω)for 𝑞 <∞. Let𝜁 be a centered Gaussian vector inΨ^(𝑞) such that

E [𝜙⁽

𝑞) 𝑖 , 𝜁] [𝜙⁽

𝑞) 𝑗 , 𝜁]

=𝜎²𝛿_{𝑖, 𝑗}. (2.3.1)

Given the noisy observation 𝜂 = 𝑢+𝜁 and a prior bound 𝑀 on kL𝑢k_𝐿2, find an approximation of𝑢 inΨ^(𝑞) that is as accurate as possible in the energy normk · k.

To justify this discrete approximation, recall that by Theorem 2.2.5, we have k𝑢−𝑢⁽^𝑞⁾k ≤ 𝐶 ℎ^{𝑞 𝑠}kL𝑢k_𝐿2(Ω). Hence, with the prior bound onkL𝑢k_𝐿2, this approximation is arbitrarily accurate with 𝑞 large enough. Let 𝜂 be as in Problem 6and let gamblets be defined as in Section2.2with pre-Haar measurement functions. For 𝑙 ∈ {1, . . . , 𝑞}, let

𝜂^(𝑙) :=

𝑙

𝑘=1

[𝜙^{(𝑘), 𝜒}, 𝜂] ·𝜒^(𝑘) (2.3.2)

6Note there are no mathematical constraints to the number of levels taken in the decomposition.

and𝜂⁽⁰⁾ =0 ∈Ψ^(𝑞). When𝑙 < 𝑞,𝜂^(𝑙) is a truncation of the full gamblet transform 𝜂=𝜂^(𝑞) = [𝜙^𝜒, 𝜂] · 𝜒. Let𝑀 > 0 and write

𝑉⁽

𝑞)

𝑀 ={𝑢 ∈Ψ^(𝑞) | kL𝑢k_𝐿2(Ω) ≤ 𝑀}. (2.3.3) Assume that𝜎 >0 and write

𝑙^† =argmin_𝑙∈{₀_,...,𝑞}𝛽_𝑙, (2.3.4) for

𝛽_𝑙 =













ℎ²^𝑠𝑀² if𝑙 =0

𝜎²ℎ⁻⁽²^{𝑠+𝑑)𝑙} +ℎ²^𝑠(𝑙+¹⁾𝑀² if 1 ≤ 𝑙 ≤ 𝑞−1 ℎ⁻⁽²^𝑠⁺^𝑑⁾^𝑞𝜎² if𝑙 =𝑞 .

(2.3.5)

The following theorem asserts that𝜂⁽^𝑙

†) is a near minimax recovery of𝑢, by which we mean that the k · k² recovery error is minimax optimal up to a multiplicative constant (depending only on kL k,kL⁻¹k,Ω, 𝑑 , 𝛿 and whose value can be made explicit using the estimates of [101]). We will also refer to 𝜂^(𝑙

†)

as the smooth recovery of𝑢 because, with probability close to 1, it is nearly as regular in energy norm as𝑢.

Theorem 2.3.1. Suppose𝑣^†(𝜂) = 𝜂^(𝑙

†)

; then there exists a constant 𝐶 depending only onℎ,𝑠,kL k, kL⁻¹k,Ω,𝑑, and𝛿such that

sup

𝑢∈𝑉^(𝑞)

𝑀

k𝑢−𝑣^†(𝜂) k²

< 𝐶 inf

𝑣(𝜂) sup

𝑢∈𝑉^(𝑞)

𝑀

k𝑢−𝑣(𝜂) k²

, (2.3.6)

where the infimum is taken over all measurable functions 𝑣 : Ψ^(𝑞) → Ψ^(𝑞). Fur- thermore, if𝑙^†≠ 0, then with probability at least1−𝜀,

k𝜂^(𝑙

†)k ≤ k𝑢k +𝐶 r

log1 𝜀

𝜎

2𝑠+𝑑 4𝑠+𝑑𝑀

2𝑠+2𝑑

4𝑠+𝑑 . (2.3.7)

Proof. See [153, Sec. 7].

Note that 𝑙^† = 𝑞 occurs (approximately) when 𝑞 is such that ℎ^𝑞 > (^𝜎

𝑀)⁴^𝑠²⁺^𝑑, i.e., when

𝜎 𝑀

< ℎ^𝑞

4𝑠+𝑑

2 , (2.3.8)

and in this case 𝜂^(𝑞) is a near minimax optimal recovery of 𝑢⁽^𝑞). On the other extreme𝑙^†=0 occurs (approximately) when(^𝜎

𝑀)⁴^𝑠²⁺^𝑑 > ℎ, i.e., when 𝜎

𝑀

> ℎ

4𝑠+𝑑

2 , (2.3.9)

and in this case, the zero signal is a near optimal recovery. The signal-to-noise ratio determines which hierarchical level the truncation occurs. This represents the length-scale of the Gaussian conditioning. This can be seen in

𝜂⁽^𝑘⁾ =E 𝜉

[𝜙⁽^𝑘⁾, 𝜉] = [𝜙⁽^𝑘⁾, 𝜂]

, (2.3.10)

which is a conditioning with the level 𝑘 hierarchical pre-Haar wavelets 𝜙^(𝑘). The trade-off between recovering an overly smooth or noisy signal is illustrated in Fig.2.2.

Numerical illustrations Example2.1.1with𝑑 =1

Figure 2.1: [153, Fig. 1], the plots of 𝑎, 𝑓, 𝑢, 𝜂, the near minimax recovery 𝑣(𝜂) =𝜂⁽^𝑙

†), its error from𝑢, and the derivatives of𝑢and𝑣(𝜂).

Figure 2.2: A comparison of𝜂^(𝑙). In this example𝑙^†=4.

Consider Example 2.1.1 with 𝑑 = 1. Take Ω = [0,1] ∈ R, 𝑞 = 10 and 𝜙⁽

𝑘)

𝑖 =

1_[^𝑖−1 2𝑘, ^𝑖

2𝑘] for 1 ≤ 𝑖 ≤ 2^𝑘. Let𝑊⁽^𝑘) be the 2^𝑘−1 by 2^𝑘 matrix with non-zero entries

defined by𝑊_𝑖,₂_𝑖−1= ^√¹

2 and𝑊_𝑖,₂_𝑖 =−^√¹

2. LetL :=−div(𝑎∇·)with 𝑎(𝑥) :=

Ö10

𝑘=1

(1+0.25 cos(2^𝑘𝑥)). (2.3.11) In Fig.2.1we select 𝑓(𝑥) at random uniformly over the unit𝐿²(Ω)-sphere of Φ⁽^𝑞⁾ and let𝜁 be white noise (as in (2.1.5)) with𝜎 =0.001 and𝜂=𝑢+𝜁.

We next consider a case where 𝑓 is smooth, i.e., 𝑓(𝑥) = ^sin^(𝜋𝑥)

𝑥 on𝑥 ∈ (0,1] and

𝑓(0) = 𝜋. Let𝜁 be white noise with standard deviation 𝜎 =0.01. See Fig.2.3for the corresponding numerical illustrations.

Both figures show that (1)𝑣(𝜂)and∇𝑣(𝜂)are accurate approximations of𝑢and∇𝑢 (2) the accuracy of these approximations increases with the regularity of 𝑓.

Figure 2.3: [153, Fig. 2], the plots of𝑎, smooth 𝑓,𝑢,𝜂,𝑣(𝜂) =𝜂^(𝑙

†)

, its error from 𝑢, and the derivatives of𝑢and𝑣(𝜂).

Example2.1.1with𝑑 =2

Consider Example2.1.1with𝑑 =2. TakeΩ = [0,1]²and𝑞 =7. Use the pre-Haar wavelets defined as 𝜙⁽

𝑘)

𝑖, 𝑗 = 1_[^𝑖−1 2𝑘, ^𝑖

2𝑘]×[^𝑗−1

2𝑘,

𝑗

2𝑘] for 1 ≤ 𝑖, 𝑗 ≤ 2^𝑘. Let𝑊^(𝑘) be defined be the 3(4^𝑘−1) by 4^𝑘 matrix defined as in construction 4.13 of [99].

In Fig.2.4we select 𝑓(𝑥) at random uniformly over the unit𝐿²(Ω)-sphere of Φ⁽^𝑞⁾ and let𝜁 be white noise (as in (2.1.5)) with𝜎 =0.001 and𝜂=𝑢+𝜁.

Figure 2.4: [153, Fig. 3], the plots of𝑎, 𝑓, 𝑢, 𝜂, 𝑣(𝜂) =𝜂^(𝑙

†)

, its error from𝑢, and the gradient of𝑢and𝑣(𝜂).

Figure 2.5: The plots of 𝑎, smooth 𝑓, 𝑢, 𝜂, 𝑣(𝜂) = 𝜂^(𝑙

†)

, its error from𝑢, and the gradient of𝑢and𝑣(𝜂)[153, Fig. 4].

LetL =−div(𝑎∇·)with 𝑎(𝑥 , 𝑦) :=

𝑘=1

h 1+ 1

4cos(2^𝑘𝜋(𝑥+𝑦)

1+ 1

4cos(2^𝑘𝜋(𝑥−3𝑦) i .

(2.3.12)

Next consider a case where 𝑓 is smooth, i.e., 𝑓(𝑥 , 𝑦) = cos(3𝑥 + 𝑦) +sin(3𝑦) + sin(7𝑥−5𝑦). Let 𝜁 be white noise with standard deviation𝜎 =0.01. See Fig.2.5 for the corresponding numerical illustrations. As with the 𝑑 = 1 plots, the 𝑑 = 2 plots show the accuracy of the recovery of𝑢 and∇𝑢 and the positive impact of the regularity of 𝑓 on that accuracy.

2.4 Comparisons

Hard- and soft-thresholding

Since hard- and soft-thresholding have been used in Donoho and Johnstone [36–38]

for the near minimax recovery of regular signals, we will compare the accuracy of (2.3.2) with that of hard- and soft-thresholding the Gamblet transform of the noisy signal [153, Sec. 5]. We callhard-thresholdingthe recovery of𝑢with

𝑣(𝜂)=

𝑞

𝑘=1

𝑖∈J⁽^𝑘)

𝐻^𝑡

(𝑘)( [𝜙⁽

𝑘), 𝜒 𝑖

, 𝜂])𝜒⁽^𝑘)

𝑖 (2.4.1)

and

𝐻^𝛽(𝑥) =











𝑥 |𝑥| > 𝛽 0 |𝑥| ≤ 𝛽 .

(2.4.2) We callsoft-thresholdingthe recovery of𝑢with

𝑣(𝜂) =

𝑞

𝑘=1

𝑖∈J⁽^𝑘⁾

𝑆^𝑡

(𝑘)

( [𝜙⁽^{𝑘), 𝜒}

𝑖 , 𝜂])𝜒^(𝑘)

𝑖 (2.4.3)

and

𝑆^𝛽(𝑥) =











𝑥−𝛽sgn(𝑥) |𝑥| > 𝛽

0 |𝑥| ≤ 𝛽 .

(2.4.4)

The parameters (𝑡₁, . . . , 𝑡_𝑞) are adjusted to achieve minimal average errors. Since the mass matrix of 𝜙^𝜒 is comparable to identity (see [153, Thm. 10]) and the bi- orthogonality identities [𝜙

𝜒

𝑖, 𝜒_𝑗] =𝛿_{𝑖, 𝑗}, [𝑓 , 𝜒] is approximately uniformly sampled on the unit sphere ofR^J and the variance of[𝑓 , 𝜒⁽

𝑘)

𝑖 ]can be approximated by 1/|J |. Therefore [𝜙^𝜒, 𝑢] = 𝐵^(𝑘),−1[𝑓 , 𝜒⁽^𝑘)] and (2.2.33) imply that the standard deviation of [𝜙⁽^𝑘⁾^{, 𝜒}, 𝑢] can be approximated by ℎ⁻²^{𝑘 𝑠}/p

|J |. Therefore optimal choices for threshold on the𝑘-th hierarchical level follow the power law𝑡⁽^𝑘⁾ =ℎ⁻²^{𝑘 𝑠}𝑡₀for some parameter𝑡₀.

Regularization

We callregularizationthe recovery of𝑢with𝑣(𝜂) defined as the minimizer of k𝑣(𝜂) −𝜂k²

𝐿²(Ω)+𝛼k𝑣(𝜂) k². (2.4.5) For practical implementation, we consider𝐴_{𝑖, 𝑗} =𝜓˜_𝑖,𝜓˜_𝑗

, the𝑁×𝑁stiffness matrix obtained by discretizingLwith finite elements ˜𝜓₁, . . . ,𝜓˜_𝑁, and write𝜂 =Í𝑁

𝑖=1𝑦_𝑖𝜓˜_𝑖

Dalam dokumen Learning patterns with kernels and learning kernels from patterns (Halaman 37-41)