Learning Theory for Markovian Models with Linear Hypoth-

Chapter IV: Machine learning of model error

4.5 Underpinning Theory

4.5.2 Learning Theory for Markovian Models with Linear Hypoth-

in the context of learning 𝑚^† in (4.2) from a trajectory over time horizon𝑇. We study ergodic continuous-time models in the setting of Section4.4.1. To this end we consider the very general linear hypothesis class given by

M ={𝑚: R^𝑑^𝑥 →R^𝑑^𝑥 | ∃𝜃 ∈R^𝑝: 𝑚(𝑥) =

𝑝

ℓ=1

𝜃_ℓ𝑓_ℓ(𝑥)}; (4.31)

we note that if the {𝑓_ℓ} are i.i.d. draws of function 𝜙 in the case 𝐷 = 𝑑_𝑥 then this too reduces to a random features model, but that our analysis in the context of statistical learning does not rely on the random features structure. In fact our analysis can be used to provide learning theory for other linear settings, where{𝑓_ℓ} represents a dictionary of hypothesized features whose coefficients are to be learnt from data. Nonetheless, universal approximation for random features [335] provides an important example of an approximation class for which the loss functionI_∞ may be made arbitrarily small by choice of 𝑝 large enough and appropriate choice of parameters, and the reader may find it useful to focus on this case. We also note that the theory we present in this subsection is readily generalized to working with hypothesis class (4.26).

We make the following ergodicity assumption about the data generation process:

Assumption 4.5.1. Equation (4.2) possesses a compact attractor A supporting invariant measure𝜇.Furthermore the dynamical system onAis ergodic with respect to 𝜇 and satisfies a central limit theorem of the following form: for all Hölder continuous𝜑:R^𝑑^𝑥 ↦→R, there is𝜎² =𝜎²(𝜑) such that

√ 𝑇

1 𝑇

∫ 𝑇 0

𝜑 𝑥(𝑡) 𝑑 𝑡−

∫

R^{𝑑 𝑥}

𝜑 𝑥 𝜇(𝑑𝑥)

⇒ 𝑁(0, 𝜎²) (4.32) where⇒denotes convergence in distribution with respect to𝑥(0) ∼ 𝜇. Furthermore a law of the iterated logarithm holds: almost surely with respect to𝑥(0) ∼ 𝜇,

limsup_𝑇→∞

𝑇 log log𝑇

¹₂ 1 𝑇

∫ ^𝑇

𝜑 𝑥(𝑡) 𝑑 𝑡−

∫

R^{𝑑 𝑥}

𝜑 𝑥 𝜇(𝑑𝑥)

=𝜎 . (4.33) Remark 4.5.2. Note that in both (4.32) and (4.33)𝜑(·)is only evaluated on (compact) A obviating the need for any boundedness assumptions on 𝜑(·). In the work of Melbourne and co-workers, Assumption 4.5.1 is proven to hold for a class of differential equations, including the Lorenz ’63 model at, and in a neighbourhood of, the classical parameter values: in [178] the central limit theorem is established;

and in [27] the continuity of𝜎in𝜑is proven. Whilst it is in general very difficult to prove such results for any given chaotic dynamical system, there is strong empirical evidence for such results in many chaotic dynamical systems that arise in practice.

This combination of theory and empirical evidence justify studying the learning of model error under Assumption4.5.1. Tran and Ward [408] were the first to make use of the theory of Melbourne and coworkers to study learning of chaotic differential equations from time-series.

Given𝑚from hypothesis classM defined by (4.31) we define 𝜃^∗_∞ =arg min

𝜃∈R^𝑝

I_∞ 𝑚(·;𝜃)

=arg min

𝜃∈R^𝑝

L𝜇 𝑚(·;𝜃)

(4.34) and

𝜃^∗

𝑇 =arg min

𝜃∈R^𝑝

I𝑇 𝑚(·;𝜃)

. (4.35)

(Regularization is not needed in this setting because the data is plentiful—a continuous- time trajectory—and the number of parameters is finite.) Then 𝜃^∗_∞, 𝜃^∗

𝑇 solve the linear systems

𝐴_∞𝜃^∗_∞ =𝑏_∞, 𝐴_𝑇𝜃^∗

𝑇 =𝑏_𝑇 where

(𝐴_∞)𝑖 𝑗 =

∫

R^{𝑑 𝑥}

𝑓_𝑖(𝑥), 𝑓_𝑗(𝑥)

𝜇(𝑑𝑥), (𝑏_∞)𝑗 =

∫

R^{𝑑 𝑥}

𝑚^†(𝑥), 𝑓_𝑗(𝑥)

𝜇(𝑑𝑥), (𝐴_𝑇)𝑖 𝑗 = 1

𝑇

∫ ^𝑇

𝑓_𝑖 𝑥(𝑡)

, 𝑓_𝑗 𝑥(𝑡) 𝑑 𝑡 , (𝑏_𝑇)𝑗 = 1 𝑇

∫ ^𝑇

𝑚^† 𝑥(𝑡)

, 𝑓_𝑗 𝑥(𝑡) 𝑑 𝑡 . These facts can be derived analogously to the derivation in Section4.8.5. Given𝜃^∗_∞ and𝜃^∗

𝑇 we also define

𝑚^∗_∞ =𝑚(·;𝜃^∗_∞), 𝑚^∗

𝑇 =𝑚(·;𝜃^∗

𝑇).

Recall that it is assumed that 𝑓^†, 𝑓₀, and𝑚^†are𝐶¹.We make the following assumption regarding the vector fields defining hypothesis classM.

Assumption 4.5.3. The functions {𝑓_ℓ}^𝑝

ℓ=0 appearing in definition (4.31) of the hypothesis classM are Hölder continuous onR^𝑑^𝑥. In addition, the matrix 𝐴_∞ is invertible.

Theorem 4.5.4. Let Assumptions4.5.1and4.5.3hold. Then the scaled excess risk

√

𝑇 𝑅_𝑇 in (4.14) (resp. scaled generalization error√

𝑇|𝐺_𝑇|in (4.15)) is bounded above bykE𝑅k (resp. kE𝐺k), where random variableE𝑅 ∈R^𝑝 (resp. E𝐺 ∈R^𝑝⁺¹) converges in distribution to𝑁(0,Σ𝑅)(resp. 𝑁(0,Σ𝐺)) w.r.t. 𝑥(0) ∼ 𝜇as𝑇 → ∞. Furthermore, there is constant𝐶 >0such that, almost surely w.r.t. 𝑥(0) ∼ 𝜇,

limsup_𝑇→∞

𝑇 log log𝑇

¹₂

𝑅_𝑇 + |𝐺_𝑇|

≤ 𝐶 . The proof is provided in Section4.8.1.

Remark 4.5.5. The convergence in distribution shows that, with high probability with respect to initial data, the excess risk and the generalization error are bounded above by terms of size 1/√

𝑇 .This can be improved to give an almost sure result, at the cost of the factor ofp

log log𝑇. The theorem shows that (ignoring log factors and acknowledging the probabilistic nature of any such statements) trajectories of lengthO (𝜖⁻²)are required to produce bounds on the excess risk and generalization error of sizeO (𝜖).

The bounds on excess risk and generalization error also show that empirical risk minimization (ofI𝑇) approaches the theoretically analyzable concept of risk minimization (ofI_∞) over hypothesis class (4.31). The sum of the excess risk𝑅_𝑇 and the generalization error𝐺_𝑇 gives

𝐸_𝑇 :=I𝑇(𝑚^∗

𝑇) − I_∞(𝑚^∗_∞). We note that I𝑇(𝑚^∗

𝑇) is computable, once the approximate solution𝑚^∗

𝑇 has been identified; thus, when combined with an estimate for𝐸_𝑇, this leads to an estimate for the risk associated with the hypothesis class used.

If the approximating spaceM is rich enough, then approximation theory may be combined with Theorem4.5.4 to estimate the trajectory error resulting from the learned dynamical system. Such an approach is pursued in Proposition 3 of [453] for SDEs. Furthermore, in that setting, knowledge of rate of mixing/decay of correlations for SDEs may be used to quantify constants appearing in the error bounds. It would be interesting to pursue such an analysis for chaotic ODEs with known mixing rates/decay of correlations. Such results on mixing are less well-developed, however, for chaotic ODEs; see discussion of this point in [178], and the recent work [27].

Work by Zhang, Harlim, and Li [452] demonstrates that error bounds on learned model error terms can be extended to bound error on reproduction of invariant statistics for ergodic SDEs. Moreover, E, Ma, and Wu [120] provide a direction for proving similar bounds on model error learning using nonlinear function classes (e.g.

two-layer neural networks).

Finally we remark on the dependence of the risk and generalization error bounds on the size of the model error. It is intuitive that the amount of data required to learn model error should decrease as the size of the model error decreases. This is demonstrated numerically in Section4.6.2.3(c.f. Figures4.2aand4.2b). Here we comment that Theorem4.5.4also exhibits this feature: examination of the proof in Section4.8.1shows that all upper bounds on terms appearing in the excess and

generalization error are proportional to𝑚^†itself or to𝑚^∗_∞, its approximation given an infinite amount of data; note that𝑚^∗_∞ =𝑚^†if the hypothesis class contains the truth.

Dalam dokumen Incomplete Models and Noisy Data (Halaman 125-129)