Chapter IV: Machine learning of model error
4.5 Underpinning Theory
4.5.2 Learning Theory for Markovian Models with Linear Hypoth-
in the context of learning πβ in (4.2) from a trajectory over time horizonπ. We study ergodic continuous-time models in the setting of Section4.4.1. To this end we consider the very general linear hypothesis class given by
M ={π: Rππ₯ βRππ₯ | βπ βRπ: π(π₯) =
π
Γ
β=1
πβπβ(π₯)}; (4.31)
we note that if the {πβ} are i.i.d. draws of function π in the case π· = ππ₯ then this too reduces to a random features model, but that our analysis in the context of statistical learning does not rely on the random features structure. In fact our analysis can be used to provide learning theory for other linear settings, where{πβ} represents a dictionary of hypothesized features whose coefficients are to be learnt from data. Nonetheless, universal approximation for random features [335] provides an important example of an approximation class for which the loss functionIβ may be made arbitrarily small by choice of π large enough and appropriate choice of parameters, and the reader may find it useful to focus on this case. We also note that the theory we present in this subsection is readily generalized to working with hypothesis class (4.26).
We make the following ergodicity assumption about the data generation process:
Assumption 4.5.1. Equation (4.2) possesses a compact attractor A supporting invariant measureπ.Furthermore the dynamical system onAis ergodic with respect to π and satisfies a central limit theorem of the following form: for all HΓΆlder continuousπ:Rππ₯ β¦βR, there isπ2 =π2(π) such that
β π
1 π
β« π 0
π π₯(π‘) π π‘β
β«
Rπ π₯
π π₯ π(ππ₯)
!
β π(0, π2) (4.32) whereβdenotes convergence in distribution with respect toπ₯(0) βΌ π. Furthermore a law of the iterated logarithm holds: almost surely with respect toπ₯(0) βΌ π,
limsupπββ
π log logπ
12 1 π
β« π
0
π π₯(π‘) π π‘β
β«
Rπ π₯
π π₯ π(ππ₯)
!
=π . (4.33) Remark 4.5.2. Note that in both (4.32) and (4.33)π(Β·)is only evaluated on (compact) A obviating the need for any boundedness assumptions on π(Β·). In the work of Melbourne and co-workers, Assumption 4.5.1 is proven to hold for a class of differential equations, including the Lorenz β63 model at, and in a neighbourhood of, the classical parameter values: in [178] the central limit theorem is established;
and in [27] the continuity ofπinπis proven. Whilst it is in general very difficult to prove such results for any given chaotic dynamical system, there is strong empirical evidence for such results in many chaotic dynamical systems that arise in practice.
This combination of theory and empirical evidence justify studying the learning of model error under Assumption4.5.1. Tran and Ward [408] were the first to make use of the theory of Melbourne and coworkers to study learning of chaotic differential equations from time-series.
Givenπfrom hypothesis classM defined by (4.31) we define πββ =arg min
πβRπ
Iβ π(Β·;π)
=arg min
πβRπ
Lπ π(Β·;π)
(4.34) and
πβ
π =arg min
πβRπ
Iπ π(Β·;π)
. (4.35)
(Regularization is not needed in this setting because the data is plentifulβa continuous- time trajectoryβand the number of parameters is finite.) Then πββ, πβ
π solve the linear systems
π΄βπββ =πβ, π΄ππβ
π =ππ where
(π΄β)π π =
β«
Rπ π₯
ππ(π₯), ππ(π₯)
π(ππ₯), (πβ)π =
β«
Rπ π₯
πβ (π₯), ππ(π₯)
π(ππ₯), (π΄π)π π = 1
π
β« π
0
ππ π₯(π‘)
, ππ π₯(π‘) π π‘ , (ππ)π = 1 π
β« π
0
πβ π₯(π‘)
, ππ π₯(π‘) π π‘ . These facts can be derived analogously to the derivation in Section4.8.5. Givenπββ andπβ
π we also define
πββ =π(Β·;πββ), πβ
π =π(Β·;πβ
π).
Recall that it is assumed that πβ , π0, andπβ areπΆ1.We make the following assumption regarding the vector fields defining hypothesis classM.
Assumption 4.5.3. The functions {πβ}π
β=0 appearing in definition (4.31) of the hypothesis classM are HΓΆlder continuous onRππ₯. In addition, the matrix π΄β is invertible.
Theorem 4.5.4. Let Assumptions4.5.1and4.5.3hold. Then the scaled excess risk
β
π π π in (4.14) (resp. scaled generalization errorβ
π|πΊπ|in (4.15)) is bounded above bykEπ k (resp. kEπΊk), where random variableEπ βRπ (resp. EπΊ βRπ+1) converges in distribution toπ(0,Ξ£π )(resp. π(0,Ξ£πΊ)) w.r.t. π₯(0) βΌ πasπ β β. Furthermore, there is constantπΆ >0such that, almost surely w.r.t. π₯(0) βΌ π,
limsupπββ
π log logπ
12
π π + |πΊπ|
β€ πΆ . The proof is provided in Section4.8.1.
Remark 4.5.5. The convergence in distribution shows that, with high probability with respect to initial data, the excess risk and the generalization error are bounded above by terms of size 1/β
π .This can be improved to give an almost sure result, at the cost of the factor ofp
log logπ. The theorem shows that (ignoring log factors and acknowledging the probabilistic nature of any such statements) trajectories of lengthO (πβ2)are required to produce bounds on the excess risk and generalization error of sizeO (π).
The bounds on excess risk and generalization error also show that empirical risk minimization (ofIπ) approaches the theoretically analyzable concept of risk mini- mization (ofIβ) over hypothesis class (4.31). The sum of the excess riskπ π and the generalization errorπΊπ gives
πΈπ :=Iπ(πβ
π) β Iβ(πββ). We note that Iπ(πβ
π) is computable, once the approximate solutionπβ
π has been identified; thus, when combined with an estimate forπΈπ, this leads to an estimate for the risk associated with the hypothesis class used.
If the approximating spaceM is rich enough, then approximation theory may be combined with Theorem4.5.4 to estimate the trajectory error resulting from the learned dynamical system. Such an approach is pursued in Proposition 3 of [453] for SDEs. Furthermore, in that setting, knowledge of rate of mixing/decay of correlations for SDEs may be used to quantify constants appearing in the error bounds. It would be interesting to pursue such an analysis for chaotic ODEs with known mixing rates/decay of correlations. Such results on mixing are less well-developed, however, for chaotic ODEs; see discussion of this point in [178], and the recent work [27].
Work by Zhang, Harlim, and Li [452] demonstrates that error bounds on learned model error terms can be extended to bound error on reproduction of invariant statistics for ergodic SDEs. Moreover, E, Ma, and Wu [120] provide a direction for proving similar bounds on model error learning using nonlinear function classes (e.g.
two-layer neural networks).
Finally we remark on the dependence of the risk and generalization error bounds on the size of the model error. It is intuitive that the amount of data required to learn model error should decrease as the size of the model error decreases. This is demonstrated numerically in Section4.6.2.3(c.f. Figures4.2aand4.2b). Here we comment that Theorem4.5.4also exhibits this feature: examination of the proof in Section4.8.1shows that all upper bounds on terms appearing in the excess and
generalization error are proportional toπβ itself or toπββ, its approximation given an infinite amount of data; note thatπββ =πβ if the hypothesis class contains the truth.