• Tidak ada hasil yang ditemukan

Accounting for observation uncertainty

This example clearly shows the power of Gaussian process modeling for data interpolation.

From Figure 3.1, it is obvious that the point selection algorithm tends to pick points on the boundary of the original set. This is expected, and is because the Gaussian process model needs these points in order to maintain accuracy over the entire region. Only a relatively small number of points are needed at the interior because of the interpolative accuracy of the model.

It is also interesting to note that the decrease in maximum prediction error is not strictly monotonic. Adding some points may actually worsen the predictive capability of the Gaus- sian process model in other regions of the parameter space. Nevertheless, until matrix ill- conditioning issues begin to take effect, the overall trend should still show a decrease in maxi- mum prediction error.

underlying process, then the covariance matrix associated with the training points can be writ- ten asQ≡λR+Σexp, whereΣexpis the covariance matrix that characterizes the observation noise. If the observations are independent, thenΣexp = diag(λexp,1, . . . , λexp,m), whereλexp,i

is the variance associated with the observation error for theith observation. The observation variances may be assumed known, but it would also be possible to estimate them based on the training data using maximum likelihood estimation (which would only be meaningful if repeated observations are present).

The preceding sections used a special notation in which the data covariance matrix is de- composed asλR, the product of the process variance and the data correlation matrix. However, as is apparent from the above definition of the new data covariance matrix, this separation is no longer possible. Previously,r(x)was defined as the vector of correlations betweenxand the training points, butk(x)is now used instead, which is defined as the vector of covariances betweenx and the training points. Note thatk(x)is not a function ofΣexp.

Using the new notation, the conditional distribution of a pointx can be expressed by the mean and variance (previously Eqs. (3.4) and (3.5)):

E

Y(x)|Y

=fT(x)β+kT(x)Q−1(Y −F β) (3.27)

and

Var

Y(x)|Y

=λ−kTQ−1k. (3.28)

Note that when the experimental variances are zero, these equations are equivalent to Eqs. (3.4) and (3.5). Also, even if theλexp are large, the variance predicted by Eq. (3.28) at or near one of the training points still has an upper bound of λ, even though the uncertainty associated

with that observation, λexp,i, may be greater thanλ. This reinforces the importance that the parameter selection process plays for the predictions and their uncertainty estimates.

3.5.1 Parameter estimation

Maximum Likelihood Estimation (as discussed in Section3.3) can be used for this model also, but some additional considerations come into play. First, consider the negative log of the likelihood function, as before. The likelihood function is again derived from Eq. (3.14), but the multiplicative constant 1/2 will be retained here so that the likelihood can be combined with a prior distribution for the Gaussian process parameters, if desired. The negative log likelihood is

N L= 1

2log|Q|+1

2(Y −F β)TQ−1(Y −F β). (3.29) Recall from Section 3.3 that by taking the derivative ofN Lwith respect toλ, it was possible to find its conditional optimum value. However, now thatλcan no longer be separated from the data covariance matrix, that result no longer applies. Moreover, a bigger problem has arisen:

with non-zero experimental variancesλexp, the optimum value ofλmay tend to zero, which is obviously infeasible and of no practical use. There are two possibilities for dealing with this problem:

1. One could include a prior distribution forλ that naturally counteracts the insistence of the likelihood forλto go to zero. Thus, instead of searching for values that maximize the likelihood function, one would search for parameter values that maximize the posterior distribution (this is sometimes referred to as maximuma posteriori(MAP) estimation).

2. Alternatively, one could work with a “penalized” or “restricted” likelihood function. In this case, an additional term is simply added toN L.

In practice, these two alternatives are essentially the same. However, some analysts may be reluctant to include a prior distribution forλ because of the apparent subjectivity involved with choosing an appropriate prior. Thus, restricted maximum likelihood estimation (RMLE) is presented here.

The RMLE method was first proposed by Patterson and Thompson (1971), and Harville (1974) later presented a more convenient representation. The motivation behind the develop- ment of RMLE appears to be the fact that when the covariance parameters are chosen based on regular MLE, their maximum likelihood estimates take no account of the loss in degrees of freedom that results from estimatingβ. The idea is based on re-formulating the likelihood as a function of error contrasts, where an error contrast is simply any linear combinationbTY of the observations such thatE

bTY

= 0. The technique is thus based on maximizing the like-

lihood function associated with a particular set ofm−q linearly independent error contrasts, rather than the full likelihood function.

The resulting likelihood function, which will be denoted byN LR, is

N LR= 1

2log|Q|+ 1 2

Y −FβˆT

Q−1

Y −Fβˆ +1

2log

FTQ−1F

, (3.30)

where βˆ is the same as Eq. (3.26), but with R replaced by Q. The only differences from Eq. (3.29) are the additional term at the end and the replacement of β by β. The use ofˆ βˆ directly inside the likelihood function does not add anything new, however, sinceβˆwould have been chosen as the optimal value anyway. The additional term in Eq. (3.30) will effectively prevent the optimum value ofλfrom being zero by penalizing small values ofλ. Further, the use of the restricted likelihood function takes appropriate account of the fact thatqdegrees of freedom are lost in the estimation ofβ.

3.5.2 Gradient information

As before, the gradients of the negative log likelihood can be made available to the optimization algorithm being used to significantly improve the performance. The gradients ofN LR differ from those presented in Section 3.3.2 because of the inclusion of λexp and because of the additional term in the likelihood function. The derivations are based on matrix calculus, and the relevant equations are given below.

The gradient ofN LRwith respect to any covariance parameter,θ, is given by

∂N LR

∂θ = 1 2trace

Q−1

−1 2

Y −FβˆT

Q−1QQ˙ −1

Y −Fβˆ

−1 2traceh

F FTQ−1F−1

FTQ−1QQ˙ −1i

, (3.31)

where Q˙ is the matrix of derivatives of Q with respect to θ. For the log correlation scale parameter,ω, the matrix of derivatives is

∂Q

∂ωk =

−eωk

x(i)k −x(j)k 2

λc x(i),x(j)

i,j

. (3.32)

When dealing with a covariance matrix that can not be decomposed into a variance term and a correlation matrix, it makes sense to work with the log of λ, since it will need to be optimized numerically. Definingγ = log(λ)gives

∂Q

∂γ = h

eγc x(i),x(j)i

i,j. (3.33)

3.5.3 Computational considerations

As mentioned above, one result of the inclusion of measurement uncertainties in the Gaussian process model is that an analytical optimum value of the process variance,λ, is no longer avail- able. Thus, unlike before, it becomes necessary to choose a starting value forλ(or preferably the log of λ). The importance of the starting value is that a bad choice can lead to numerical problems with the likelihood computations, which will in turn cause trouble for the numerical optimization algorithm. This may happen if one attempts to compute N LRwith a value ofλ that is largely inconsistent with the scale of the observed response values.

One possibility is to set the initial value ofλequal to the variance of the observed response values, which will generally be on the same scale as the process variance. An alternative ap- proach is to first scale the data inY to have unit variance, in which case unity is an appropriate starting value forλ.

Several steps must be taken if the response values are to be scaled. Consider that one simply wants to transform the original response values Y to have unit variance. Denote the sample variance of the observed response values as s2Y. The transformation is effected by dividing each value inY bysY. The observation variances,λexp,i, must also be scaled accordingly, by dividing each bys2Y. Finally, it will be necessary to rescale the conditional mean and variance after applying Eqs. (3.27) and (3.28). This is done by simply multiplying the conditional expected value bysY and the conditional variance bys2Y.