Illustrations - Texts in Applied Mathematics 62

Theorem 3.12. Consider the data-assimilation problem for deterministic dynamics: (2.3), (2.2) with Ψ∈ C¹(Rⁿ,Rⁿ) and h∈ C¹(Rⁿ,R^m). Then:

• (i) The inﬁmum of Idet(· ; y) given in (2.29) is attained at at least one point v₀ in Rⁿ. It follows that the density (v₀) =P(v₀|y) on Rⁿ associated with the posterior probability ν given by Theorem2.11 is maximized at v₀.

• (ii) Furthermore, if B(z, δ) denotes a ball in Rⁿ of radius δ centered at z, then

δ→0lim P^ν

B(z₁, δ) P^ν

B(z₂, δ) = exp(Idet(z₂; y)− Idet(z₁; y)).

As in the case of stochastic dynamics, we do not discuss optimization methods to perform minimization associated with variational problems; this is because optimization is a well-established and mature research area that is hard to do justice to within the conﬁnes of this book. However, we conclude this section with an example that illustrates certain advantages of the Bayesian perspective over the optimization or variational perspective. Recall from Theorem 2.15 that the Bayesian posterior distribution is continuous with respect to small changes in the data. In contrast, computation of the global maximizer of the probability may be discontinuous as a function of data. To illustrate this, consider the probability measure μ onR with Lebesgue density proportional to exp

−V(u) , where V(u) = 1

4(1− u²)²+ u. (3.23)

It is a straightforward application of the methodology behind the proof of Theorem2.15to show that μ is Lipschitz continuous in , with respect to the Hellinger metric. Furthermore, the methodology behind Theorems 3.10 and 3.12 shows that the probability with respect to this measure is maximized whenever V is minimized. The global minimum, however, changes discontinuously, even though the posterior distribution changes smoothly. This is illustrated in Figure 3.1, where the left-hand panel shows the continuous evolution of the probability density function, while the right-hand panel shows the discontinuity in the global maximizer of the probability (minimizer of V) as passes through zero. The explanation for this diﬀerence between the fully Bayesian approach and MAP estimation is as follows. The measure μhas two peaks, for small , close to±1. The Bayesian approach accounts for both of these peaks simultaneously and weights their contribution to expectations. In contrast, the MAP estimation approach leads to a global minimum located near u =−1 for > 0 and near u = +1 for < 0, resulting in a discontinuity.

Fig. 3.1: Plot of (3.23) shows discontinuity of the global maximum as a function of .

Fig. 3.2: Comparison of the posterior for Example2.4for r = 4 using random walk metropolis and equation (2.29) directly as in thematlab program p2.m. We have used J = 5, C0= 0.01, m₀ = 0.5, γ = 0.2, and true initial condition v₀ = 0.3; see also p3.m in Section 5.2.1. We have used N = 10⁸ samples from the MCMC algorithm.

of the output of the Markov chain) with the true posterior pdf ρ. The two distributions are almost indistinguishable when plotted together in Figure3.2a; in Figure 3.2b, we plot their diﬀerence, which, as we can see, is small relative to the true value. We deduce that the number of samples used, N = 10⁸, results here in an accurate sampling of the posterior.

We now turn to the use of MCMC methods to sample the smoothing pdf P(v|y) in the case of stochastic dynamics (2.1), using the independence dynamics sampler and both pCN methods. Before describing the application of numerical methods, we study the ergodicity of the independence dynamics sampler in a simple but illustrative setting. For simplicity, assume that the observation operator h is bounded, so that for all u ∈ R^N, |h(u)| ≤ hmax. Then, recalling the notation Y_j={y}^j₌₁ from the ﬁltering problem, we have

Φ(u; y)≤

J−1

j=0

|Γ⁻¹²y_j+1|²+|Γ⁻¹²h(u_j+1)|²

≤ |Γ⁻¹²|²^J−1

j=0

|yj+1|²+ J h²_max

≤ |Γ⁻¹²|²

|YJ|²+ J h²_max

=: Φ_max.

Since Φ ≥ 0, this shows that every proposed step is accepted with probability exceeding e^−Φ^max, and hence that since proposals are made with the prior measure μ₀ describing the unobserved stochastic dynamics,

p(u, A)≥ e^−Φ^maxμ₀(A).

Thus Theorem3.3applies, and in particular, (3.10) and (3.11) hold, with ε = e^−Φ^max, under these assumptions. This positive result about the ergodicity of the MCMC method also in-dicates the potential diﬃculties with the independence dynamics sampler. The independence sampler relies on draws from the prior matching the data well. Where the data set is large (J  1) or the noise covariance small (|Γ | 1), this will happen infrequently, because Φmax

will be large, and the MCMC method will reject frequently and be inefficient. To illustrate this, we consider application of the method to Example 2.3, using the same parameters as in Figure 2.3; specifically, we take α = 2.5 and Σ = σ² = 1. We now sample the posterior distribution and then plot the resulting accept–reject ratio a for the independence dynamics sampler, employing different values of noise Γ and different sizes of the data set J . This is illustrated in Figure3.3.

Fig. 3.3: Accept–reject probability of the independence sampler for Example2.3for α = 2.5, Σ = σ²= 1, and Γ = γ² for diﬀerent values of γ and J .

In addition, in Figure 3.4, we plot the output and the running average of the output projected into the ﬁrst element of the vector v^(k), the initial condition—recall that we are deﬁning a Markov chain on R^J+1—for N = 10⁵ steps. Figure 3.4a clearly exhibits the fact that there are many rejections caused by the low average acceptance probability. Figure3.4b

Fig. 3.4: Output and running average of the independence dynamics sampler after K = 10⁵ steps, for Example2.3for α = 2.5, Σ = σ²= 1, and Γ = γ²= 1, with J = 10; see also p4.m in Section5.2.2.

shows that the running average has not converged after 10⁵ steps, indicating that the chains needs to be run for longer. If we run the Markov chainover N = 10⁸ steps, then we do get convergence. This is illustrated in Figure3.5. In Figure3.5a, we see that the running average has converged to its limiting value when this many steps are used. In Figure3.5b, we plot the marginal probability distribution for the ﬁrst element of v^(k), calculated from this converged Markov chain.

Fig. 3.5: Running average and probability density of the ﬁrst element of v^(k)for the indepen-dence dynamics sampler after K = 10⁸ steps, for Example 2.3for α = 2.5, Σ = σ²= 1, and Γ = γ², with γ = 1 and J = 10; see also p4.m in Section5.2.2.

In order to get faster convergence when sampling the posterior distribution, we turn to application of the pCN method. Unlike the independence dynamics sampler, this contains a tunable parameter that can vary the size of the proposals. In particular, the possibility of

making small moves, with resultant higher acceptance probability, makes this a more ﬂexible method than the independence dynamics sampler. In Figure 3.6, we show application of the pCN sampler, again considering Example2.3for α = 2.5, Σ = σ²= 1, and Γ = γ²= 1, with J = 10, the same parameters used in Figure3.4.

In the case that the dynamics significantly influence the trajectory, i.e., the regime of large Ψ or small σ, it may be the case that the standard pCN method is ineffective, due to large effects of the G term, and the improbability of Gaussian samples being close to samples of the prior on the dynamics. The pCN dynamics sampler, recall, acts on the space comprising the initial condition and forcing, both of which are Gaussian under the prior, and so may sometimes have an advantage given that pCN-type methods are based on Gaussian proposals. The use of this method is explored in Figure 3.7 for Example 2.3 for α = 2.5, Σ = σ²= 1, and Γ = γ²= 1, with J = 10.

Fig. 3.6: Trace plot and running average of the ﬁrst element of v^(k)for the pCN sampler after K = 10⁵ steps, for Example2.3with α = 2.5, Σ = σ²= 1, and Γ = γ²= 1, with J = 10; see also p5.m in Section5.2.3.

We now turn to variational methods; recall Theorems3.10and3.12in the stochastic and deterministic cases respectively. In Figure 3.8a, we plot the MAP (4DVAR) estimator for our Example 2.1, choosing exactly the same parameters and data as for Figure 2.10a, in the case J = 10². In this case, the function Idet(· ; y) is quadratic and has a unique global minimum. A straightforward minimization routine will easily find this: we employed standard matlab optimization software initialized at three different points. From all three starting points chosen, the algorithm finds the correct global minimizer.

In Figure 3.8b, we plot the MAP (4DVAR) estimator for our Example 2.4 for the case r = 4 choosing exactly the same parameters and data as for Figure2.13. We again employ a matlab optimization routine, and we again initialize it at three different points. The value obtained for our MAP estimator depends crucially on the choice of initial condition in our minimization procedure, in particular on the choices of starting point presented: for the three initializations shown, it is only when we start from 0.2 that we are able to find the global minimum ofIdet(v₀; y). By Theorem3.12, this global minimum corresponds to the maximum of the posterior distribution, and we see that finding the MAP estimator is a difficult task for this problem. Starting with the other two initial conditions displayed, we converge to one of the many local minima ofIdet(v₀; y); these local minima are in fact regions of very low probability,

Fig. 3.7: Trace plot and running average of the ﬁrst element of v^(k) for the pCN dynamics sampler after K = 10⁵ steps, for Example2.3 with α = 2.5, Σ = σ² = 1, and Γ = γ² = 1, with J = 10; see also p6.m in Section5.2.3.

Fig. 3.8: Finding local minima of I(v₀; y) for Examples2.1and2.4. The values and the data used are the same as for Figures2.10a and2.13b. (◦, , 2) denote three diﬀerent initial con-ditions for starting the minimization process: (−8, −2, 8) for Example2.1and (0.05, 0.2, 0.4) for Example2.4.

as we can see in Figure2.13a. This illustrates the care required in computing 4DVAR solutions in cases in which the forward problem exhibits sensitivity to initial conditions.

Figure3.9 shows application of the w4DVAR method, or MAP estimator given by Theo-rem3.10, in the case of Example2.3 with parameters set at J = 5, γ = σ = 0.1. In contrast to the previous example, this is no longer a one-dimensional minimization problem: we are minimizingI(v; y) given by (2.21) over v∈ R⁶, given the data y∈ R⁵. The ﬁgure shows that there are at least two local minimizers for this problem, with v⁽¹⁾closer to the truth than v⁽²⁾, and with I(v⁽¹⁾; y) considerably smaller than I(v⁽²⁾; y). However, v⁽²⁾ has a larger basin of attraction for the optimization software used: many initial conditions lead to v⁽²⁾, while fewer lead to v⁽¹⁾. Furthermore, while we believe that v⁽¹⁾ is the global minimizer, it is diﬃcult

to state this with certainty, even for this relatively low-dimensional model. To get greater certainty, an exhaustive and expensive search of the six-dimensional parameter space would be needed.

Fig. 3.9: Weak constraint 4DVAR for J = 5, γ = σ = 0.1, illustrating two local minimizers v⁽¹⁾ and v⁽²⁾; see also p7.m in Section5.2.5.

Dalam dokumen Texts in Applied Mathematics 62 (Halaman 86-92)