• Tidak ada hasil yang ditemukan

Learning Theory for Markovian Models with Linear Hypoth-

Dalam dokumen Incomplete Models and Noisy Data (Halaman 125-129)

Chapter IV: Machine learning of model error

4.5 Underpinning Theory

4.5.2 Learning Theory for Markovian Models with Linear Hypoth-

in the context of learning π‘šβ€  in (4.2) from a trajectory over time horizon𝑇. We study ergodic continuous-time models in the setting of Section4.4.1. To this end we consider the very general linear hypothesis class given by

M ={π‘š: R𝑑π‘₯ β†’R𝑑π‘₯ | βˆƒπœƒ ∈R𝑝: π‘š(π‘₯) =

𝑝

Γ•

β„“=1

πœƒβ„“π‘“β„“(π‘₯)}; (4.31)

we note that if the {𝑓ℓ} are i.i.d. draws of function πœ™ in the case 𝐷 = 𝑑π‘₯ then this too reduces to a random features model, but that our analysis in the context of statistical learning does not rely on the random features structure. In fact our analysis can be used to provide learning theory for other linear settings, where{𝑓ℓ} represents a dictionary of hypothesized features whose coefficients are to be learnt from data. Nonetheless, universal approximation for random features [335] provides an important example of an approximation class for which the loss functionI∞ may be made arbitrarily small by choice of 𝑝 large enough and appropriate choice of parameters, and the reader may find it useful to focus on this case. We also note that the theory we present in this subsection is readily generalized to working with hypothesis class (4.26).

We make the following ergodicity assumption about the data generation process:

Assumption 4.5.1. Equation (4.2) possesses a compact attractor A supporting invariant measureπœ‡.Furthermore the dynamical system onAis ergodic with respect to πœ‡ and satisfies a central limit theorem of the following form: for all HΓΆlder continuousπœ‘:R𝑑π‘₯ ↦→R, there is𝜎2 =𝜎2(πœ‘) such that

√ 𝑇

1 𝑇

∫ 𝑇 0

πœ‘ π‘₯(𝑑) 𝑑 π‘‘βˆ’

∫

R𝑑 π‘₯

πœ‘ π‘₯ πœ‡(𝑑π‘₯)

!

β‡’ 𝑁(0, 𝜎2) (4.32) whereβ‡’denotes convergence in distribution with respect toπ‘₯(0) ∼ πœ‡. Furthermore a law of the iterated logarithm holds: almost surely with respect toπ‘₯(0) ∼ πœ‡,

limsupπ‘‡β†’βˆž

𝑇 log log𝑇

12 1 𝑇

∫ 𝑇

0

πœ‘ π‘₯(𝑑) 𝑑 π‘‘βˆ’

∫

R𝑑 π‘₯

πœ‘ π‘₯ πœ‡(𝑑π‘₯)

!

=𝜎 . (4.33) Remark 4.5.2. Note that in both (4.32) and (4.33)πœ‘(Β·)is only evaluated on (compact) A obviating the need for any boundedness assumptions on πœ‘(Β·). In the work of Melbourne and co-workers, Assumption 4.5.1 is proven to hold for a class of differential equations, including the Lorenz ’63 model at, and in a neighbourhood of, the classical parameter values: in [178] the central limit theorem is established;

and in [27] the continuity of𝜎inπœ‘is proven. Whilst it is in general very difficult to prove such results for any given chaotic dynamical system, there is strong empirical evidence for such results in many chaotic dynamical systems that arise in practice.

This combination of theory and empirical evidence justify studying the learning of model error under Assumption4.5.1. Tran and Ward [408] were the first to make use of the theory of Melbourne and coworkers to study learning of chaotic differential equations from time-series.

Givenπ‘šfrom hypothesis classM defined by (4.31) we define πœƒβˆ—βˆž =arg min

πœƒβˆˆR𝑝

I∞ π‘š(Β·;πœƒ)

=arg min

πœƒβˆˆR𝑝

Lπœ‡ π‘š(Β·;πœƒ)

(4.34) and

πœƒβˆ—

𝑇 =arg min

πœƒβˆˆR𝑝

I𝑇 π‘š(Β·;πœƒ)

. (4.35)

(Regularization is not needed in this setting because the data is plentifulβ€”a continuous- time trajectoryβ€”and the number of parameters is finite.) Then πœƒβˆ—βˆž, πœƒβˆ—

𝑇 solve the linear systems

π΄βˆžπœƒβˆ—βˆž =π‘βˆž, π΄π‘‡πœƒβˆ—

𝑇 =𝑏𝑇 where

(𝐴∞)𝑖 𝑗 =

∫

R𝑑 π‘₯

𝑓𝑖(π‘₯), 𝑓𝑗(π‘₯)

πœ‡(𝑑π‘₯), (π‘βˆž)𝑗 =

∫

R𝑑 π‘₯

π‘šβ€ (π‘₯), 𝑓𝑗(π‘₯)

πœ‡(𝑑π‘₯), (𝐴𝑇)𝑖 𝑗 = 1

𝑇

∫ 𝑇

0

𝑓𝑖 π‘₯(𝑑)

, 𝑓𝑗 π‘₯(𝑑) 𝑑 𝑑 , (𝑏𝑇)𝑗 = 1 𝑇

∫ 𝑇

0

π‘šβ€  π‘₯(𝑑)

, 𝑓𝑗 π‘₯(𝑑) 𝑑 𝑑 . These facts can be derived analogously to the derivation in Section4.8.5. Givenπœƒβˆ—βˆž andπœƒβˆ—

𝑇 we also define

π‘šβˆ—βˆž =π‘š(Β·;πœƒβˆ—βˆž), π‘šβˆ—

𝑇 =π‘š(Β·;πœƒβˆ—

𝑇).

Recall that it is assumed that 𝑓†, 𝑓0, andπ‘šβ€ are𝐢1.We make the following assumption regarding the vector fields defining hypothesis classM.

Assumption 4.5.3. The functions {𝑓ℓ}𝑝

β„“=0 appearing in definition (4.31) of the hypothesis classM are HΓΆlder continuous onR𝑑π‘₯. In addition, the matrix 𝐴∞ is invertible.

Theorem 4.5.4. Let Assumptions4.5.1and4.5.3hold. Then the scaled excess risk

√

𝑇 𝑅𝑇 in (4.14) (resp. scaled generalization error√

𝑇|𝐺𝑇|in (4.15)) is bounded above bykE𝑅k (resp. kE𝐺k), where random variableE𝑅 ∈R𝑝 (resp. E𝐺 ∈R𝑝+1) converges in distribution to𝑁(0,Σ𝑅)(resp. 𝑁(0,Σ𝐺)) w.r.t. π‘₯(0) ∼ πœ‡as𝑇 β†’ ∞. Furthermore, there is constant𝐢 >0such that, almost surely w.r.t. π‘₯(0) ∼ πœ‡,

limsupπ‘‡β†’βˆž

𝑇 log log𝑇

12

𝑅𝑇 + |𝐺𝑇|

≀ 𝐢 . The proof is provided in Section4.8.1.

Remark 4.5.5. The convergence in distribution shows that, with high probability with respect to initial data, the excess risk and the generalization error are bounded above by terms of size 1/√

𝑇 .This can be improved to give an almost sure result, at the cost of the factor ofp

log log𝑇. The theorem shows that (ignoring log factors and acknowledging the probabilistic nature of any such statements) trajectories of lengthO (πœ–βˆ’2)are required to produce bounds on the excess risk and generalization error of sizeO (πœ–).

The bounds on excess risk and generalization error also show that empirical risk minimization (ofI𝑇) approaches the theoretically analyzable concept of risk mini- mization (ofI∞) over hypothesis class (4.31). The sum of the excess risk𝑅𝑇 and the generalization error𝐺𝑇 gives

𝐸𝑇 :=I𝑇(π‘šβˆ—

𝑇) βˆ’ I∞(π‘šβˆ—βˆž). We note that I𝑇(π‘šβˆ—

𝑇) is computable, once the approximate solutionπ‘šβˆ—

𝑇 has been identified; thus, when combined with an estimate for𝐸𝑇, this leads to an estimate for the risk associated with the hypothesis class used.

If the approximating spaceM is rich enough, then approximation theory may be combined with Theorem4.5.4 to estimate the trajectory error resulting from the learned dynamical system. Such an approach is pursued in Proposition 3 of [453] for SDEs. Furthermore, in that setting, knowledge of rate of mixing/decay of correlations for SDEs may be used to quantify constants appearing in the error bounds. It would be interesting to pursue such an analysis for chaotic ODEs with known mixing rates/decay of correlations. Such results on mixing are less well-developed, however, for chaotic ODEs; see discussion of this point in [178], and the recent work [27].

Work by Zhang, Harlim, and Li [452] demonstrates that error bounds on learned model error terms can be extended to bound error on reproduction of invariant statistics for ergodic SDEs. Moreover, E, Ma, and Wu [120] provide a direction for proving similar bounds on model error learning using nonlinear function classes (e.g.

two-layer neural networks).

Finally we remark on the dependence of the risk and generalization error bounds on the size of the model error. It is intuitive that the amount of data required to learn model error should decrease as the size of the model error decreases. This is demonstrated numerically in Section4.6.2.3(c.f. Figures4.2aand4.2b). Here we comment that Theorem4.5.4also exhibits this feature: examination of the proof in Section4.8.1shows that all upper bounds on terms appearing in the excess and

generalization error are proportional toπ‘šβ€ itself or toπ‘šβˆ—βˆž, its approximation given an infinite amount of data; note thatπ‘šβˆ—βˆž =π‘šβ€ if the hypothesis class contains the truth.

Dalam dokumen Incomplete Models and Noisy Data (Halaman 125-129)