Chapter I: Introduction
1.6 Patterns and kernels
transform described in Theorem2.3.1is found to be within a constant of a minimax optimal recovery. This denoising is interpretable as a Gaussian conditioning with covariance kernel derived fromL. The length-scale of the conditioning is dependent on the signal-to-noise ratio in the assumption of the problem. A conventional method for denoising smooth signals involves thresholding empirical wavelet coefficients.
We numerically compare the near-minimax optimal recovery with thresholding gamblet transform coefficients in Section2.4.
Our approach [102] to the mode decomposition problem, kernel mode decompo- sition (KMD), is outlined in Section 4. There are two main components within the algorithm: The first, which we name max-pooling, estimates the instantaneous phase and frequency of the lowest frequency mode in a signal and is presented in Section 4.1. It is a close variant of the continuous wavelet transform (CWT) [24], which we summarize in Section 3.2. The second component uses GPR to estimate the instantaneous amplitude and phase of this lowest frequency mode and is summarized in Section4.2. We define a GP with a covariance kernel constructed from Gaussian-windowed trigonometric waves, i.e., Gabor wavelets [48]. GPR then is able to estimate the instantaneous phase and amplitude of the lowest frequency mode. Further, when the base waveform, i.e., π¦π in π£(π‘) = Γ
ππ(π‘)π¦π(ππ(π‘)), of each mode is unknown, GPR can be applied to estimate π¦πas presented in Section 5.1. These algorithms were extended in [102, Sec. 10], showing the method can be constructed to be robust to noise, vanishing amplitudes, and modes with crossing frequencies. We find that patterns within each mode can be estimated with GPR from the sum of modes, even when these patterns are visually indistinguishable in the composite signal.
Learning kernels from patterns
We will discuss the Kernel Flow (KF) algorithm [154] next. At a high level, the algorithm is interpretable as learning kernels from patterns with a method inspired by cross-validation. The algorithm is described in Section 6 and operates under the principle that a kernel is desirable when it can make low error predictions with small samples of the whole data set. This error is quantified by selectingπrandom training points and computing the kernel interpolation with a further random π/2 of these points. We compute the error of the of the interpolation on the otherπ/2 points. Assuming these interpolations are written asπ’β , π andπ’β ,π respectively, the
error is quantified as the KF loss function, π:=
kπ’β , π βπ’β ,πk2
ππ½
kπ’β , πk2
ππ½
, (1.6.1)
and the KF algorithm selects a kernel with optimization. Since this loss function is dependent mainly on the training data, this method selects a kernel based on patterns in that data. We present examples throughout Section6 including on the MNIST dataset. We find that this technique is able to learn kernels which can predict classes accurately only observing one point per class. Additionally, we observe evidence of unsupervised learning since archetypes within each class appear to be learned. A further example of pattern learning applying the KF algorithm can be found in [57], where it has been applied to data in chaotic dynamical systems to learn a model kernel.
An application of the KF algorithm to Artificial Neural Networks (ANNs) will further be demonstrated in Section7to improve key performance statistics in MNIST and CIFAR image classification problems. These ANNs are widely used to address this problem and are defined as the mapping
ππ(π₯)= π(
π) ππ β¦ π(
πβ1)
ππβ1 β¦ Β· Β· Β· β¦ π(1)
π1
(π₯). (1.6.2)
This map has input π₯ and π layers π(π)
ππ
(π§) = π(πππ§ + ππ) parameterized17 by the weights and biases ππ := (ππ, ππ), π := {π1, . . . , ππ}. The output of ππ is in Rπ, where π represents the number of classes in the dataset. This is converted into a classifier by selecting the component with largest value. The parameters π best modeling the patterns of the data are learned by optimizing the error of the classifier, usually with cross-entropy loss18, on the training data.
Kernels can be incorporated into ANNs by allowing ππ to map into a higher di- mensional space and applying kernel interpolation on the result, which leads to an improvement of error rates [103, Sec. 10]. Further improvements can be made by reverting to the standard ππ largest component classifier and constructing a kernel dependent on intermediate-layer output
β(π)(π₯) := π(
π) ππ β¦ π(
πβ1)
ππβ1 β¦ Β· Β· Β· β¦ π(1)
π1
(π₯), (1.6.3)
forπ = 1, . . . , π. The KF loss corresponding to this kernel is then used in tandem with the standard cross-entropy loss, which leads to improvements in testing error,
17Weights,ππ, are linear operators and biases, ππ are vectors. The function π is an arbitrary function, typically taken as the ReLU,π(π§)=max(0, π§).
18This loss is defined in equation (7.0.4).
generalization gap, and robustness to distributional shift. Details on our numerical findings can be found in Section 7.1. Note that kernel interpolation itself is not directly used as a classifier; the KF loss is used only as a regularization of the loss function used in the optimization of the ANN parameters. This application of kernels is a novel method for training and clustering intermediate-layer outputs in conjunction with the final output ππ. We present numerical experiments that show the KF loss function aids in the learning of parameters which most accurately classify patterns in images.
C h a p t e r 2
DENOISING
2.1 Introduction to the denoising problem
[37β39] addressed the problem of recovering of a smooth signal from noisy obser- vations by soft-thresholding empirical wavelet coefficients [37]. More recently, [33]
considered the recovery ofπ₯ β πbased on the observation ofπ π₯+π, whereππis IID.
N (0, π2) andπ is a compact linear operator between Hilbert spaces π andπ, with the prior thatπ₯lies in an ellipsoid defined by the eigenvectors ofπβπ. [33] showed that thresholding the coefficients of the corrupted signalπ π₯+πin the basis formed by the singular value decomposition (SVD) ofπ (which can be computed in (π3) complexity) approached the minimax recovery to a fixed multiplicative constant.
The contributions presented in this section [153] address denoisings in the following formulation. Suppose
L :H0π (Ξ©) β Hβπ (Ξ©) (2.1.1)
is a symmetric positive local1 linear bijection with π β Nβ and regular bounded Ξ© β Rπ (π βN). Letk Β· k be the energy-norm defined by
kπ’k2:=
β«
Ξ©
π’Lπ’ , (2.1.2)
and write
π’, π£
:=
β«
Ξ©
π’Lπ£ (2.1.3)
for the associated scalar product. Further, define
ππ :={π’ β H0π (Ξ©) :Lπ’ β πΏ2(Ξ©)and kLπ’kπΏ2(Ξ©) β€ π}. (2.1.4) Further, let
π βΌ N (0, π2πΏ(π₯βπ¦)) (2.1.5) be white noise in domain Ξ© with variance π2. The following is the continuous version of the denoising problem studied in this section.
Problem 5. Let π’ be an unknown element of ππ, given the noisy observation π =π’+π, find an approximation ofπ’that is as accurate as possible in the energy norm k Β· k.
1Symmetric positive local defined asβ«
Ξ©π’Lπ£=β«
Ξ©π£Lπ’,β«
Ξ©π’Lπ’ >0 forπ’ β 0, andβ«
Ξ©π’Lπ£=0 forπ’, π£β Hπ
0(Ξ©)with disjoint supports, respectively.
This problem will be illustrated in theπ =1 case with the linear differential operator L = βdiv π(π₯)β Β·
where the conductivity π is a uniformly elliptic symmetric π Γ π matrix with entries in πΏβ(Ξ©). This example is of practical importance in groundwater flow modeling (where π is the porosity of the medium) and in electrostatics (whereπis the dielectric constant), and in both applicationsπmay be rough (non-smooth) [7,19].
Example 2.1.1. Assuming
L =βdiv π(π₯)β Β·
:H01(Ξ©) β Hβ1(Ξ©), (2.1.6) Prob.5then corresponds to the problem of recovering the solution of the PDE


ο£²

ο£³
βdiv π(π₯)βπ’(π₯)
= π(π₯) π₯ βΞ©;
π’ =0 on πΞ©,
(2.1.7)
from its noisy observationπ=π’+π with knowledgekπkπΏ2(Ξ©) < π.
This problem is addressed by expressing π in the gamblet transform adapted to operator L and applying a truncation to the series. This method is theoretically proved to yield a recovery within a constant of the minimax optimal recovery [153]. This method is numerically compared to thresholding the gamblet transform coefficients as well asregularization, the minimization of
kπ£(π) βπk2
πΏ2(Ξ©)+πΌkπ£(π) k2. (2.1.8) 2.2 Summary of operator-adapted wavelets
We proceed by reviewingoperator-adapted waveletsas in [153, Sec. 2], also named gambletsin reference to their game theoretic interpretation, and their main properties [99,101,104,119]. They are constructed with a hierarchy of measurement functions and an operator. Theorem2.2.4shows these gamblets are simultaneously associated with Gaussian conditioning, optimal recovery, and game theory. By selecting these measurement functions to bepre-Haar wavelets, the gamblets are localized both in space and in the eigenspace of the operator.
Hierarchy of measurement functions
Letπ β Nβ (used to represent a number of scales). Let(I(π))1β€πβ€π be a hierarchy of labels defined as follows. I(π) is a set of π-tuples consisting of elements π = (π1, . . . , ππ). For 1 β€ π β€ π andπ β I(π),π(π) := (π1, . . . , ππ) andI(π) is the set of
π-tuplesI(π) = {π(π)|π β I(π)}. For 1 β€ π β€ π β€ π and π = (π1, . . . , ππ) β I(π), we write π(π) = (π1, . . . , ππ). We say that π is aI(π) Γ I(π) matrix if its rows and columns are indexed by elements ofI(π) andI(π), respectively.
Let {π(π)
π |π β {1, . . . , π}, π β I(π)} be a nested hierarchy of elements of Hβπ (Ξ©) such that(π(π)
π )πβI(π) are linearly independent and π(π)
π = Γ
πβI(π+1)
π(π , π+1)
π, π π(π+1)
π (2.2.1)
forπ β I(π),π β {1, . . . , πβ1}, whereπ(π , π+1) is anI(π) Γ I(π+1) matrix and
π(π , π+1)π(π+1, π) = πΌ(π). (2.2.2)
In (2.2.2), π(π+1, π) is the transpose of π(π , π+1) and πΌ(π) is the I(π) Γ I(π) identity matrix.
Hierarchy of operator-adapted pre-wavelets Let(π(
π)
π )πβI(π)be the hierarchy of optimal recovery splines associated with(π(
π) π )πβI(π), i.e., forπ β {1, . . . , π}andπ β I(π),
π(
π)
π = Γ
πβI(π)
π΄(
π) π, π Lβ1π(
π)
π , (2.2.3)
where
π΄(π) :=(Ξ(π))β1 (2.2.4)
andΞ(π) is theI(π)Γ I(π) symmetric positive definite Gramian matrix with entries (writing [π, π£] for the duality pairing betweenπ β Hβπ (Ξ©) andπ£ β Hπ
0(Ξ©)) Ξπ, π(π) =[π(
π)
π ,Lβ1π(
π)
π ]. (2.2.5)
Note that π΄(π) is the stiffness matrix of the elements(π(
π)
π )πβI(π) in the sense that π΄(
π) π, π =
π(
π) π , π(
π) π
. (2.2.6)
Writing Ξ¦(π) :=span{π(π)
π | π β I(π)} andπ(π) := span{π(π)
π | π β I(π)}, Ξ¦(π) β Ξ¦(π+1) and Ξ¨(π) = Lβ1Ξ¦(π) imply Ξ¨(π) β Ξ¨(π+1). We further write [π(π), π’] =
[π(π)
π , π’]
πβI(π) βRI
(π). The(π(π)
π )πβI(π) and (π(π)
π )πβI(π) form a bi-orthogonal system in the sense that [π(π)
π , π(π)
π ] =πΏπ, π forπ, π β I(π) (2.2.7) and the
Β·,Β·
-orthogonal projection ofπ’ β Hπ
0(Ξ©)onΞ¨(π) is π’(π) := Γ
πβI(π)
[π(
π) π , π’]π(
π)
π . (2.2.8)
Multiple interpretations of operator adapted pre-wavelets Using operator-adapted pre-wavelets,π(π)
π , we summarize the connections between optimal recovery, game theory, and Gaussian conditioning. First, we define Gaussian fields, a generalization of Gaussian processes.
Definition 2.2.1. The canonical Gaussian field π associated with operator L : Hπ
0(Ξ©) β Hβπ (Ξ©) is defined such that π β¦β [π, π] is the linear isometry from Hβπ (Ξ©) to a Gaussian space characterized by


ο£²

ο£³
[π, π] βΌ N (0,kπkβ2) Cov [π, π],[π, π]
= hπ, πiβ,
(2.2.9)
where kπkβ =supπ’βHπ
0(Ξ©)
β«
Ξ©ππ’
kπ’k is the dual norm ofk Β· k.
Remark 2.2.2. Whenπ > π/2, the evaluation functionalπΏπ₯(π) = π(π₯) is continu- ous. Hence,π|πΏπ₯,π₯βΞ©is naturally isomorphic to a Gaussian process with covariance functionπ(π₯ , π₯0) =hπΏπ₯, πΏπ₯0iβ.
Several notable properties of these pre-wavelets are summarized in the following result. Recall we write[π(π), π’] = [π(π)
π , π’]
πβI(π) βRI
(π). Theorem 2.2.3. Consider pre-wavelets π(
π)
π adapted to operator L constructed with measurement functions π(π)
π . Further, suppose that for π’ β Hπ
0(Ξ©) we define π£β (π’) =π’(π) =Γ
πβI(π)[π(π)
π , π’]π(π)
π . 1. For fixedπ’ β Hπ
0(Ξ©),π£β (π’) is the minimizer of


ο£²

ο£³
Minimizekπk Subject toπ β Hπ
0(Ξ©)and [π(π), π] =[π(π), π’].
(2.2.10)
2. For fixedπ’ β Hπ
0(Ξ©),π£β (π’) is the minimizer of


ο£²

ο£³
Minimizekπ’βπk Subject toπ βspan{π(π)
π :π β I(π)}.
(2.2.11)
3. For canonical Gaussian fieldπ βΌ N (0,Lβ1), π£β (π’) =E
π
[π(π), π] = [π(π), π’]
. (2.2.12)
4. It is true that2
π£β βargminπ£βπΏ(Ξ¦,Hπ
0(Ξ©)) sup
π’βHπ
0(Ξ©)
kπ’βπ£(π’) k
kπ’k (2.2.13)
Proof. (1) is a result of [101, Cor. 3.4] (2) is equivalent to [101, Thm. 12.2] (3) and
(4) are results in [101, Sec. 8.5].
This result shows that the operator-adapted pre-wavelets transform defined by π£β (π’) = π’(π) is an optimal recovery in the sense of Theorem 2.2.3.1-2. Simul- taneously, π£β (π’) are conditional expectations of the canonical Gaussian field with respect to the measurements [π(π),Β·] as in Theorem 2.2.3.3. Another interpreta- tion of the transform is game theoretic as expressed in Theorem2.2.13.4. Equation (2.2.13) represents the adversarial two player game where player I selectsπ’ β Hπ
0(Ξ©) and player II approximatesπ’ withπ£(π’) with measurements [π(π), π’]. Player I and II aim to maximize and minimize the recovery error ofπ£(π’). This game theoretic interpretation inspires the name gamblets, referring to operator-adapted wavelets.
Note that the pre-waveletsπ(π)
π lie on only one level of the hierarchy. The following addresses the construction of a wavelet decomposition ofHπ
0(Ξ©)on all hierarchical levels.
Operator-adapted wavelets
Let(J(π))2β€πβ€πbe a hierarchy of labels such that, writing|J(π)|for the cardinal of J(π),
|J(π)|=|I(π)| β |I(πβ1)|. (2.2.14) Forπ β {2, . . . , π}, letπ(π) be a J(π) Γ I(π) matrix such that3
Ker(π(πβ1, π)) =Im(π(π),π). (2.2.15)
Forπ β {2, . . . , π}andπ β J(π) define π(π)
π := Γ
πβI(π)
π(π)
π, π π(π)
π , (2.2.16)
and write π(π) := span{π(π)
π | π β J(π)}. Then π(π) is the
Β·,Β·
-orthogonal complement ofπ(πβ1) inπ(π), i.e. π(π) = π(πβ1) βπ(π),and
π(π) = π(1) βπ(2) β Β· Β· Β· βπ(π). (2.2.17)
2πΏ(Ξ¦,Hπ
0(Ξ©)) is defined as the set ofHπ
0(Ξ©) β Hπ
0(Ξ©) functions that are of form π£(π’) = Ξ¨ [π(π), π’])
with measureableΨ:RI
(π) β H0π (Ξ©).
3We writeπ(π),π andπ(π),β1for the transpose and inverse of a matrixπ(π).
Forπ β {2, . . . , π}write
π΅(π) :=π(π)π΄(π)π(π),π . (2.2.18) Note thatπ΅(π) is the stiffness matrix of the elements(π(π)
π )πβJ(π), i.e., π΅(π)
π, π = π(π)
π , π(π)
π
. (2.2.19)
Further, forπ β {2, . . . , π}, define
π(π) := π΄(π)π(π),ππ΅(π),β1 (2.2.20)
and, forπ β J(π),
π(π), π
π := Γ
πβI(π)
π(π),π
π, π π(π)
π . (2.2.21)
Then definingπ’(π) as in (2.2.8), it holds true that forπ β {2, . . . , π},π’(π)βπ’(πβ1) is the
Β·,Β·
-orthogonal projection ofπ’ onπ(π) and π’(π) βπ’(πβ1) = Γ
πβJ(π)
[π(
π), π π , π’]π(
π)
π . (2.2.22)
To simplify notations, write J(1) :=I(1), π΅(1) := π΄(1), π(1) := πΌ(1), π(1)
, π
π := π(1)
π
forπ β J(1),J := J(1)βͺ Β· Β· Β· βͺ J(π), ππ := π(
π)
π andπ
π π :=π(
π), π
π forπ β J(π) and4 1β€ π β€ π. Then theπ
π
π and ππ form a bi-orthogonal system, i.e., [π
π
π , ππ] =πΏπ, π forπ, π β J (2.2.23) and
π’(π) =Γ
πβJ
[π
π
π , π’]ππ. (2.2.24)
Simplifying notations further, we will write [ππ, π’] for the J vector with entries [π
π
π , π’]and πfor theJ vector with entries ππso that (2.2.24) can be written π’(π) = [ππ, π’] Β·π . (2.2.25) Further, define the J by J block-diagonal matrix π΅ defined as π΅π, π = π΅(π)
π, π if π, π β J(π) and π΅π, π = 0 otherwise. Note that it holds that π΅π, π =
ππ, ππ
. When π =βandβͺβ
π=1Ξ¦(π) is dense inHβπ (Ξ©), then, writingπ(1) := π(1), Hπ
0(Ξ©) =ββ
π=1π(π), (2.2.26)
4The dependence onπis left implicit to simplify notation, forπ β Jthere exists a uniqueπsuch thatπβ J(π).
π’(π) = π’, and (2.2.24) is the corresponding multi-resolution decomposition of π’. When π < β, π’(π) is the projection of π’ on βπ
π=1π(π) and (2.2.25) is the corresponding multi-resolution decomposition. Note that the optimal recovery, game theory, and Gaussian conditioning results in Theorem 2.2.3 also holds for wavelets.
Theorem 2.2.4. Consider pre-wavelets ππ adapted to operatorL constructed with measurement functionsππ. Further, suppose that forπ’ β Hπ
0(Ξ©), we defineπ£β (π’)= π’(π) = [ππ, π’] Β·π.
1. For fixedπ’ β Hπ
0(Ξ©),π£β (π’) is the minimizer of


ο£²

ο£³
Minimizekπk Subject toπ β Hπ
0(Ξ©) and[ππ, π] =[ππ, π’].
(2.2.27)
2. For fixedπ’ β Hπ
0(Ξ©),π£β (π’) is the minimizer of


ο£²

ο£³
Minimizekπ’βπk
Subject toπ βspan{ππ :π β J }.
(2.2.28)
3. For canonical Gaussian fieldπ βΌ N (0,Lβ1), π£β (π’) =E
π
[ππ, π] = [ππ, π’]
. (2.2.29)
4. It is true that5
π£β βargminπ£βπΏ(Ξ¦,H0π (Ξ©)) sup
π’βHπ
0(Ξ©)
kπ’βπ£(π’) k
kπ’k . (2.2.30)
Pre-Haar wavelet measurement functions
The gamblets used in the subsequent developments will use pre-Haar wavelets (as defined below) as measurement functionsπ(
π)
π and our main near-optimal denoising estimates will be derived from their properties (summarized in Thm.2.2.5).
LetπΏ, β β (0,1). Let(π(
π)
π )πβI(π)be uniformly Lipschitz convex sets forming a nested partition of Ξ©, i.e., such that Ξ© = βͺπβI(π)π(π)
π , π β {1, . . . , π} is a disjoint union except for the boundaries, and π(π)
π = βͺπβI(π+1):π(π)=ππ(π+1)
π , π β {1, . . . , πβ1}.
5Here πΏ(Ξ¦,H0π (Ξ©)) is defined as the set ofH0π (Ξ©) β H0π (Ξ©) functions that are of form π£(π’)= Ξ¨ [ππ, π’])
with measureableΞ¨:RJ β H0π (Ξ©).
Assume that eachπ(π)
π , contains a ball of radiusπΏ βπ, and is contained in the ball of radiusπΏβ1βπ. Writing |π(
π)
π |for the volume ofπ(
π) π , take π(
π)
π :=1
π(π)
π
|π(
π)
π |β12 . (2.2.31)
The nesting relation (2.2.1) is then satisfied with π(π , π+1)
π, π := |π(π+1)
π |12|π(π)
π |β12 for π(π) =πandπ(
π , π+1)
π, π :=0 otherwise.
Forπ β {2, . . . , π}, let J(π) be a finite set of π-tuples of the form π = (π1, . . . , ππ) such that{π(πβ1) | π β J(π)} =I(πβ1), and forπ β I(πβ1), Card{π β J(π) | π(πβ1) = π} =Card{π β I(π) |π (πβ1) =π} β1. Note that the cardinalities of these sets satisfy (2.2.14).
Write π½(π) for the J(π) Γ J(π) identity matrix. For π = 2, . . . , π, let π(π) be a J(π) Γ I(π) matrix such that Im(π(π),π) =Ker(π(πβ1, π)),π(π)(π(π))π =π½(π) and π(π)
π, π =0 forπ(πβ1) β π(πβ1).
Theorem 2.2.5. With pre-Haar wavelet measurement functions, it holds true that 1. Forπ β {1, . . . , π} andπ’ β Lβ1πΏ2(Ξ©),
kπ’βπ’(π)k β€πΆ βπ π kLπ’kπΏ2(Ξ©). (2.2.32) 2. Writing Cond(π) for the condition number of a matrix π, we have for
π β {1,Β· Β· Β· , π}
πΆβ1ββ2(πβ1)π π½(π) β€ π΅(π) β€ πΆ ββ2π π π½(π) (2.2.33) andCond(π΅(π)) β€πΆ ββ2π .
3. Forπ β I(π) andπ₯(
π)
π βπ(
π)
π ,
kππkHπ (Ξ©\π΅(π₯(π)
π ,π β)) β€ πΆ ββπ πβπ/πΆ. (2.2.34)
4. The waveletsπ(
π) π , π(
π)
π and stiffness matrices π΄(π), π΅(π) can be computed to precision π (in k Β· k-energy norm for elements of Hπ
0(Ξ©) and in Frobenius norm for matrices) inO(πlog3π ππ) complexity.
Furthermore the constantπΆ depends only onπΏ,Ξ©, π , π , kL k := sup
π’βH0π (Ξ©)
kLπ’kHβπ (Ξ©) kπ’kHπ
0(Ξ©)
and kLβ1k := sup
π’βH0π (Ξ©)
kπ’kHπ
0(Ξ©)
kLπ’kHβπ (Ξ©) .
(2.2.35)
Proof. (1) and (2) follows from an application of Prop. 4.17 and Theorems 4.14 and 3.19 from [100]. (3) follows from Thm. 2.23 of [100]. 4 follows from the complexity analysis of Alg. 6 of [100]. See [101] for detailed proofs.
Remark 2.2.6. The wavelets π(
π) π , π(
π)
π and stiffness matrices π΄(π), π΅(π) can also be computed in O(πlog2πlog2π ππ) complexity using the incomplete Cholesky factorization approach of [119].
Theorem 2.2.5.2-3 implies that the gamblets are localized both in the eigenspace of operator L and inΞ©space. Further, Theorem2.2.5.1 shows the accuracy of the recovery,π’(π), in L norm is bounded byπΏ2norm of Lπ’. This result is used in the proofs of the denoising result shown in the following section.
2.3 Denoising by truncating the gamblet transform Near minimax recovery
In this section, we will present the result that truncating the gamblet transform of π =π’+π in a discrete variant of Problem5produces an approximation ofπ’ that is minimax optimal up to a multiplicative constant [153, Sec. 4], i.e., near minimax.
The discretized version ofHπ
0(Ξ©)is the finite dimensional space spanned by gamblet wavelets, using pre-Haar measurement functions defined in Sec. 2.2, taken to the π-th level6. In addition, the discrete noise used in this problem, π β Ξ¨(π), is the projection of the noise (2.1.5) ontoΞ¨(π) (due to (2.2.8)).
Problem 6. Letπ’ be an unknown element ofΞ¨(π) β Hπ
0(Ξ©)for π <β. Letπ be a centered Gaussian vector inΞ¨(π) such that
E [π(
π) π , π] [π(
π) π , π]
=π2πΏπ, π. (2.3.1)
Given the noisy observation π = π’+π and a prior bound π on kLπ’kπΏ2, find an approximation ofπ’ inΞ¨(π) that is as accurate as possible in the energy normk Β· k.
To justify this discrete approximation, recall that by Theorem 2.2.5, we have kπ’βπ’(π)k β€ πΆ βπ π kLπ’kπΏ2(Ξ©). Hence, with the prior bound onkLπ’kπΏ2, this approx- imation is arbitrarily accurate with π large enough. Let π be as in Problem 6and let gamblets be defined as in Section2.2with pre-Haar measurement functions. For π β {1, . . . , π}, let
π(π) :=
π
Γ
π=1
[π(π), π, π] Β·π(π) (2.3.2)
6Note there are no mathematical constraints to the number of levels taken in the decomposition.
andπ(0) =0 βΞ¨(π). Whenπ < π,π(π) is a truncation of the full gamblet transform π=π(π) = [ππ, π] Β· π. Letπ > 0 and write
π(
π)
π ={π’ βΞ¨(π) | kLπ’kπΏ2(Ξ©) β€ π}. (2.3.3) Assume thatπ >0 and write
πβ =argminπβ{0,...,π}π½π, (2.3.4) for
π½π =



ο£²



ο£³
β2π π2 ifπ =0
π2ββ(2π +π)π +β2π (π+1)π2 if 1 β€ π β€ πβ1 ββ(2π +π)ππ2 ifπ =π .
(2.3.5)
The following theorem asserts thatπ(π
β ) is a near minimax recovery ofπ’, by which we mean that the k Β· k2 recovery error is minimax optimal up to a multiplicative constant (depending only on kL k,kLβ1k,Ξ©, π , πΏ and whose value can be made explicit using the estimates of [101]). We will also refer to π(π
β )
as the smooth recovery ofπ’ because, with probability close to 1, it is nearly as regular in energy norm asπ’.
Theorem 2.3.1. Supposeπ£β (π) = π(π
β )
; then there exists a constant πΆ depending only onβ,π ,kL k, kLβ1k,Ξ©,π, andπΏsuch that
sup
π’βπ(π)
π
E
kπ’βπ£β (π) k2
< πΆ inf
π£(π) sup
π’βπ(π)
π
E
kπ’βπ£(π) k2
, (2.3.6)
where the infimum is taken over all measurable functions π£ : Ξ¨(π) β Ξ¨(π). Fur- thermore, ifπβ β 0, then with probability at least1βπ,
kπ(π
β )k β€ kπ’k +πΆ r
log1 π
π
2π +π 4π +ππ
2π +2π
4π +π . (2.3.7)
Proof. See [153, Sec. 7].
Note that πβ = π occurs (approximately) when π is such that βπ > (π
π)4π 2+π, i.e., when
π π
< βπ
4π +π
2 , (2.3.8)
and in this case π(π) is a near minimax optimal recovery of π’(π). On the other extremeπβ =0 occurs (approximately) when(π
π)4π 2+π > β, i.e., when π
π
> β
4π +π
2 , (2.3.9)
and in this case, the zero signal is a near optimal recovery. The signal-to-noise ratio determines which hierarchical level the truncation occurs. This represents the length-scale of the Gaussian conditioning. This can be seen in
π(π) =E π
[π(π), π] = [π(π), π]
, (2.3.10)
which is a conditioning with the level π hierarchical pre-Haar wavelets π(π). The trade-off between recovering an overly smooth or noisy signal is illustrated in Fig.2.2.
Numerical illustrations Example2.1.1withπ =1
Figure 2.1: [153, Fig. 1], the plots of π, π, π’, π, the near minimax recovery π£(π) =π(π
β ), its error fromπ’, and the derivatives ofπ’andπ£(π).
Figure 2.2: A comparison ofπ(π). In this exampleπβ =4.
Consider Example 2.1.1 with π = 1. Take Ξ© = [0,1] β R, π = 10 and π(
π)
π =
1[πβ1 2π, π
2π] for 1 β€ π β€ 2π. Letπ(π) be the 2πβ1 by 2π matrix with non-zero entries
defined byππ,2πβ1= β1
2 andππ,2π =ββ1
2. LetL :=βdiv(πβΒ·)with π(π₯) :=
Γ10
π=1
(1+0.25 cos(2ππ₯)). (2.3.11) In Fig.2.1we select π(π₯) at random uniformly over the unitπΏ2(Ξ©)-sphere of Ξ¦(π) and letπ be white noise (as in (2.1.5)) withπ =0.001 andπ=π’+π.
We next consider a case where π is smooth, i.e., π(π₯) = sin(ππ₯)
π₯ onπ₯ β (0,1] and
π(0) = π. Letπ be white noise with standard deviation π =0.01. See Fig.2.3for the corresponding numerical illustrations.
Both figures show that (1)π£(π)andβπ£(π)are accurate approximations ofπ’andβπ’ (2) the accuracy of these approximations increases with the regularity of π.
Figure 2.3: [153, Fig. 2], the plots ofπ, smooth π,π’,π,π£(π) =π(π
β )
, its error from π’, and the derivatives ofπ’andπ£(π).
Example2.1.1withπ =2
Consider Example2.1.1withπ =2. TakeΞ© = [0,1]2andπ =7. Use the pre-Haar wavelets defined as π(
π)
π, π = 1[πβ1 2π, π
2π]Γ[πβ1
2π,
π
2π] for 1 β€ π, π β€ 2π. Letπ(π) be defined be the 3(4πβ1) by 4π matrix defined as in construction 4.13 of [99].
In Fig.2.4we select π(π₯) at random uniformly over the unitπΏ2(Ξ©)-sphere of Ξ¦(π) and letπ be white noise (as in (2.1.5)) withπ =0.001 andπ=π’+π.
Figure 2.4: [153, Fig. 3], the plots ofπ, π, π’, π, π£(π) =π(π
β )
, its error fromπ’, and the gradient ofπ’andπ£(π).
Figure 2.5: The plots of π, smooth π, π’, π, π£(π) = π(π
β )
, its error fromπ’, and the gradient ofπ’andπ£(π)[153, Fig. 4].
LetL =βdiv(πβΒ·)with π(π₯ , π¦) :=
7
Γ
π=1
h 1+ 1
4cos(2ππ(π₯+π¦)
1+ 1
4cos(2ππ(π₯β3π¦) i .
(2.3.12)
Next consider a case where π is smooth, i.e., π(π₯ , π¦) = cos(3π₯ + π¦) +sin(3π¦) + sin(7π₯β5π¦). Let π be white noise with standard deviationπ =0.01. See Fig.2.5 for the corresponding numerical illustrations. As with the π = 1 plots, the π = 2 plots show the accuracy of the recovery ofπ’ andβπ’ and the positive impact of the regularity of π on that accuracy.
2.4 Comparisons
Hard- and soft-thresholding
Since hard- and soft-thresholding have been used in Donoho and Johnstone [36β38]
for the near minimax recovery of regular signals, we will compare the accuracy of (2.3.2) with that of hard- and soft-thresholding the Gamblet transform of the noisy signal [153, Sec. 5]. We callhard-thresholdingthe recovery ofπ’with
π£(π)=
π
Γ
π=1
Γ
πβJ(π)
π»π‘
(π)( [π(
π), π π
, π])π(π)
π (2.4.1)
and
π»π½(π₯) =


ο£²

ο£³
π₯ |π₯| > π½ 0 |π₯| β€ π½ .
(2.4.2) We callsoft-thresholdingthe recovery ofπ’with
π£(π) =
π
Γ
π=1
Γ
πβJ(π)
ππ‘
(π)
( [π(π), π
π , π])π(π)
π (2.4.3)
and
ππ½(π₯) =


ο£²

ο£³
π₯βπ½sgn(π₯) |π₯| > π½
0 |π₯| β€ π½ .
(2.4.4)
The parameters (π‘1, . . . , π‘π) are adjusted to achieve minimal average errors. Since the mass matrix of ππ is comparable to identity (see [153, Thm. 10]) and the bi- orthogonality identities [π
π
π, ππ] =πΏπ, π, [π , π] is approximately uniformly sampled on the unit sphere ofRJ and the variance of[π , π(
π)
π ]can be approximated by 1/|J |. Therefore [ππ, π’] = π΅(π),β1[π , π(π)] and (2.2.33) imply that the standard deviation of [π(π), π, π’] can be approximated by ββ2π π /p
|J |. Therefore optimal choices for threshold on theπ-th hierarchical level follow the power lawπ‘(π) =ββ2π π π‘0for some parameterπ‘0.
Regularization
We callregularizationthe recovery ofπ’withπ£(π) defined as the minimizer of kπ£(π) βπk2
πΏ2(Ξ©)+πΌkπ£(π) k2. (2.4.5) For practical implementation, we considerπ΄π, π =πΛπ,πΛπ
, theπΓπstiffness matrix obtained by discretizingLwith finite elements Λπ1, . . . ,πΛπ, and writeπ =Γπ
π=1π¦ππΛπ