Chapter IV: Machine learning of model error
4.5 Underpinning Theory
4.5.3 Non-Markovian Modeling with Recurrent Neural Networks . 116
generalization error are proportional toπβ itself or toπββ, its approximation given an infinite amount of data; note thatπββ =πβ if the hypothesis class contains the truth.
Note that Theorem4.5.9can be satisfied by any parametric function class possessing a universal approximation property for maps between finite-dimensional Euclidean spaces, such as neural networks, polynomials and random feature methods. The next theorem transfers this universal approximation property for maps between Euclidean spaces to a universal approximation property for representation of model error with memory; this is a form of infinite dimensional approximation since, via its own dynamics, the memory variableπ maps the past history ofπ₯ into the model error correction term in the dynamics forπ₯.
Theorem 4.5.10. Let Assumptions4.5.6-4.5.9hold. Fix anyπ > 0 and π0 > 0, letπ₯(Β·), π¦(Β·) denote the solution of (4.8)withπ = 1and letπ₯πΏ(Β·), ππΏ(Β·) denote the solution of (4.18) with parameters π β Rπ. Then, for any πΏ > 0 and anyπ > 0, there is a parameter dimensionπ=ππΏ βNand parameterizationπ =ππΏ βRππΏ with the property that, for any initial condition π₯(0), π¦(0)
β π΅(0, π0)for (4.8), there is initial condition(π₯πΏ(0), ππΏ(0)) βRππ₯+ππ¦ for(4.18), such that
sup
π‘β[0,π]
kπ₯βπ₯πΏk β€πΏ.
The proof is provided in Section4.8.2; it is a direct consequence of the approximation power of π1, π2and the Gronwall Lemma.
Remark 4.5.11. Note that this existence theorem also holds forππ > ππ¦ by freezing the dynamics in the excess dimensions and initializing it at, for example, 0. However it is possible for augmentations withππ > ππ¦to introduce numerical instability when imperfectly initialized in the excess dimensions, despite their provable correctness when perfectly initialized (see Section4.6.4). Nevertheless, we did not encounter such issues when training the general model class on the examples considered in this paper β see Section4.6.3).
4.5.3.2 Linear Coupling
We now study a particular form RNN in which the coupling term π1 appearing in (4.18) is linear and depends only on the hidden variable:
Β€
π₯ = π0(π₯) +πΆ π (4.36a)
Β€
π =π(π΄π+π΅π₯+π). (4.36b)
Hereπis an activation function. The specific linear coupling form is of particular interest because of the connection we make (see Remark4.5.18below) to reservoir
computing. The goal is to choose π΄, π΅, πΆ , πso that output{π₯(π‘)}π‘β₯0matches output of (4.8), without observation of {π¦(π‘)}π‘β₯0 or knowledge of πβ and πβ . As in the general case from the preceding subsection, inherent in choosing these matrices π΄, π΅, πΆ and vectorπis a choice of embedding dimension for variableπ which will typically be larger than dimension ofπ¦ itself. The idea is to create a recurrent state πof sufficiently large dimensionππ whose evolution equation takesπ₯as input and, after a final linear transformation, approximates the missing dynamicsπβ (π₯ , π¦). There is existing approximation theory for discrete-time RNNs [370] showing that a discrete-time analog of our linear coupling set-up can be used to approximate discrete-time systems arbitrarily well; see also Theorem 3 of [165]. There is also a general approximation theorem using continuous-time RNNs proved in [139], but it does not apply to the linear-coupling setting. We thus extend the work in these three papers to the context of residual-based learning as in (4.36). We state the theorem after making three assumptions upon which it rests:
Assumption 4.5.12. Functions πβ , πβ , π0are all globally Lipschitz.
Note that this implies thatπβ is also globally Lipschitz.
Assumption 4.5.13. Letπ0βπΆ1(R;R)be bounded and monotonic, with bounded first derivative. Thenπ(π’) defined byπ(π’)π =π0(π’π) satisfiesπ βπΆ1(Rπ;Rπ). Assumption 4.5.14. Fixπ > 0. There exist π0 βR, ππ βRsuch that, for equation (4.8), π₯(0), π¦(0)
β π΅(0, π0)implies that π₯(π‘), π¦(π‘)
βπ΅(0, ππ) βπ‘ β [0, π]. Theorem 4.5.15. Let Assumptions4.5.12-4.5.14hold. Fix anyπ >0and π0> 0, let π₯(Β·), π¦(Β·) denote the solution of (4.8) with π = 1 and let π₯πΏ(Β·), ππΏ(Β·) denote the solution of (4.36) with parameters π β Rπ. Then, for any πΏ > 0 and any π > 0, there is embedding dimension ππ β N, parameter dimensionπ = ππΏ β N and parameterization π = ππΏ = {π΄πΏ, π΅πΏ, πΆπΏ, ππΏ} with the property that, for any initial condition π₯(0), π¦(0)
β π΅(0, π0) for (4.8), there is initial condition (π₯πΏ(0), ππΏ(0)) βRππ₯+ππ for(4.36), such that
sup
π‘β[0,π]
kπ₯βπ₯πΏk β€πΏ.
The complete proof is provided in Section4.8.3; here we describe its basic structure.
Defineπ(π‘):=πβ π₯(π‘), π¦(π‘)
and, with the aim of finding a differential equation for π(π‘), recall (4.8) withπ=1 and define the vector field
ββ (π₯ , π¦) :=βπ₯πβ (π₯ , π¦) [π0(π₯) +πβ (π₯ , π¦)] + βπ¦πβ (π₯ , π¦)πβ (π₯ , π¦). (4.37)
SinceπΒ€(π‘)is the time derivative ofπβ π₯(π‘), π¦(π‘)
, when(π₯ , π¦)solve (4.8) we have
Β€
π= ββ (π₯ , π¦).
Motivated by these observations, we now introduce a new system of autonomous ODEs for the variables(π₯ , π¦, π) β Rππ₯ ΓRππ¦ ΓRππ₯:
Β€
π₯ = π0(π₯) +π (4.38a)
Β€
π¦=πβ (π₯ , π¦) (4.38b)
Β€
π=ββ (π₯ , π¦). (4.38c)
To avoid a proliferation of symbols we use the same letters for(π₯ , π¦)solving equation (4.38) as for(π₯ , π¦)solving equation (4.8). We now showπ =πβ (π₯ , π¦)is an invariant manifold for (4.38); clearly, on this manifold, the dynamics of (π₯ , π¦) governed by (4.38) reduces to the dynamics of (π₯ , π¦) governed by (4.8). Thus π(π‘) must be initialized at πβ π₯(0), π¦(0)
to ensure equivalence between the solution of (4.38) and (4.8).
The desired invariance of manifold π=πβ (π₯ , π¦) under the dynamics (4.38) follows from the identity
π π π‘
πβπβ (π₯ , π¦)
=ββπ₯πβ (π₯ , π¦) πβπβ (π₯ , π¦)
. (4.39)
The identity is derived by noting that, recalling (4.37) for the definition of ββ , and then using (4.38),
π π π‘
π= ββ (π₯ , π¦)
=βπ₯πβ (π₯ , π¦) [π0(π₯) +πβ (π₯ , π¦)] + βπ¦πβ (π₯ , π¦)πβ (π₯ , π¦)
=βπ₯πβ (π₯ , π¦) [π0(π₯) +π)] + βπ¦πβ (π₯ , π¦)πβ (π₯ , π¦)
β βπ₯πβ (π₯ , π¦) πβπβ (π₯ , π¦)
= π π π‘
πβ (π₯ , π¦) β βπ₯πβ (π₯ , π¦) πβπβ (π₯ , π¦) .
We emphasize this calculation is performed under the dynamics defined by (4.38).
The proof of the RNN approximation property proceeds by approximating vector fieldsπβ (π₯ , π¦), ββ (π₯ , π¦)by neural networks and introducing linear transformations of π¦andπto rewrite the approximate version of system (4.38) in the form (4.36). The effect of the approximation of the vector fields on the true solution is then propagated through the system and its effect controlled via a straightforward Gronwall argument.
Remark 4.5.16. The details of training continuous-time RNNs to ensure accuracy and long-time stability are a subject of current research [72,125,304,82] and in this paper we confine the training of RNNs to an example in the general setting, and not the case of linear coupling. Discrete-time RNN training, on the other hand, is much more mature, and has produced satisfactory accuracy and stability for settings with uniform sample rates that are consistent across train and testing scenarios [165]. The form with linear coupling is widely studied in discrete time models. Furthermore, sophisticated variants on RNNs, such as Long-Short Term Memory (LSTM) RNNs [177] and Gated Recurrent Units (GRU) [87], are often more effective, although similar in nature RNNs. However, the potential formulation, implementation and advantages of these variants in the continuous-time setting [299] is not yet understood.
We refer readers to [149] for background on discrete RNN implementations and backpropagation through time (BPTT). For implementations of continuous-time RNNs, it is common to leverage the success of the automatic BPTT code written in PyTorch and Tensorflow by discretizing (4.36) with an ODE solver that is compatible with these autodifferentiation tools (e.g. torchdiffeqby [358],NbedDynby [304], andAD-ENKFby [82]). This compatibility can also be achieved by use of explicit Runge-Kutta schemes [332]. Note that the discretization of (4.36) can (and perhaps should) be much finer than the data sampling rateΞπ‘, but that this requires reliable estimation ofπ₯(π‘),π₯Β€(π‘)from discrete data.
Remark 4.5.17. The need for data assimilation [21,227,346] to learn the initialization of recurrent neural networks may be understood as follows. Sinceπβ is not known andπ¦is not observed (and in particularπ¦(0)is not known) the desired initialization for (4.38), and thus also for approximations of this equation in which πβ and ββ are replaced by neural networks, is not known. Hence, if an RNN is trained on a particular trajectory, the initial condition that is required for accurate approximation of (4.8) from an unseen initial condition is not known. Furthermore the invariant manifoldπ =πβ (π₯ , π¦) may be unstable under numerical approximation. However if some observations of the trajectory starting at the new initial condition are used, then data assimilation techniques can potentially learn the initialization for the RNN and also stabilize the invariant manifold. Ad hoc initialization methods are common practice [170,88,26,315], and rely on forcing the learned RNN with a short sequence of observed data to synchronize the hidden state. The success of these approaches likely rely on RNNsβ abilities to emulate data assimilators [166]; however, a more careful treatment of the initialization problem may enable substantial advances.
Remark 4.5.18. Reservoir computing (RC) is a variant on RNNs which has the advantage of leading to a quadratic optimization problem [190,272,154]. Within the context of the continuous-time RNN (4.36) they correspond to randomizing(π΄, π΅, π) in (4.36b) and then choosing only parameterπΆ to fit the data. To be concrete, this leads to
π(π‘) =Gπ‘ {π₯(π )}π‘π =0; π(0), π΄, π΅, π
;
here Gπ‘ may be viewed as a random function of the path-history ofπ₯ upto timeπ‘ and of the initial condition forπ .ThenπΆis determined by minimizing the quadratic function
Jπ(πΆ) = 1 2π
β« π 0
k Β€π₯(π‘) β π0(π₯(π‘)) βπΆ π(π‘) k2π π‘+ π 2kπΆk2.
This may be viewed as a random feature approach on the Banach spaceXπ; the use of random features for learning of mappings between Banach spaces is studied by Nelsen and Stuart [295], and connections between random features and reservoir computing were introduced by Dong et al. [109]. In the specific setting described here, care will be needed in choosing probability measure on(π΄, π΅, π)to ensure a well-behaved map Gπ‘; furthermore data assimilation ideas [21,227,346] will be needed to learn an appropriateπ(0)in the prediction phase, as discussed in Remark 4.5.17for RNNs.