4.2 Uncertainty propagation
4.2.1 Background
matrix (Srivastava, 2002). If the principal components are given by AT(k)x, then the test of H0 : AT(k)µ=AT(k)µ0 is based on the statistic
f −k+ 1 f k n
AT(k)(¯x−µ0)T
AT(k)SA(k)−1
AT(k)(¯x−µ0)
, (4.8)
which has anF-distribution withkandf−k+1degrees of freedom underH0. If the covariance is not estimated from the data, then the test statistic becomes
n
AT(k)(¯x−µ0)T
AT(k)ΣA(k)−1
AT(k)(¯x−µ0)
, (4.9)
which has a chi-square distribution onkdegrees of freedom under the null hypothesis.
geared towards uncertainty propagation when this relationship is only observable by running the simulation for a particular set of inputs.
The most straightforward approach for this case is known as Monte Carlo simulation.
Monte Carlo simulation is a sampling based approach, in which a large number (typically tens of thousands) of random realizations of the input parameters are generated, and the sim- ulator is run for each sample. The output samples are then used to make inference about the model output distribution, for example estimating the mean, estimating a reliability level, or plotting a histogram.
The problem with basic Monte Carlo sampling, however, is that it requires a very large number of evaluations of the computer simulation in order to accurately characterize the out- put distribution, and in practical applications, obtaining this number of evaluations is often not feasible. For this reason, there are several techniques available for reducing the variance in sampling-based estimators. Latin hypercube sampling (McKay et al., 1979) is a common approach that has the goal of attaining a more even distribution of the sample points in the parameter space. When reliability estimation is of interest, importance sampling (c.f. Haldar and Mahadevan, 2000) is popular. Importance sampling uses a sampling density that is con- centrated in the failure region, so that samples aren’t needlessly wasted in the other regions of the parameter space.
An alternative to efficient sampling techniques is to use an inexpensive approximation to the input/output relationship in lieu of the expensive computer simulation. Such approxima- tions are often called surrogate models, or response surface approximations. A variety of methods are available for developing response surface approximations, including the develop- ment of models with reduced degrees of freedom, polynomial regression, multivariate adaptive
regression splines (Friedman, 1991), neural networks, non-intrusive polynomial chaos (Isuka- palli et al., 1998), and Gaussian process interpolation. The use of Gaussian process interpola- tion for surrogate modeling has been of particular interest within the scientific community for studies involving both uncertainty quantification and optimization (examples include Bichon et al., 2008; Jones et al., 1998; Kennedy and O’Hagan, 2001; Bayarri et al., 2002; Simpson et al., 2001; Kaymaz, 2005; Kennedy et al., 2006; Oakley and O’Hagan, 2002).
Giunta et al. (2006) compare response surface based methods for uncertainty propagation to efficient sampling techniques (including Latin hypercube sampling). Samples sizes ranging from 10 to 121 were considered, and the methods were used to estimate the output mean and variance, as well as a 5% probability level. It was found that the use of a kriging (Gaussian process) response surface approximation performed significantly better than LHS, particularly for failure probability estimation, and particularly for moderate to high sample sizes (> 25).
Although it may not be appropriate to generalize these results to other functional forms and dimensions, the results may be taken as suggestive of the power of the Gaussian process model for uncertainty quantification.
In any case, before the model output distribution can be estimated, the joint distribution of the model inputs must first be characterized. This often entails using a set of samples, or observations, of the model inputs to estimate an associated probability density function.
One of the most widely used and straightforward methods for characterizing randomness is the use of the normal distribution, which is defined in terms of two parameters, a mean and a variance. The ubiquity of the normal distribution may be in part due to the fact that its use is often justified by the central limit theorem and that it is often an appropriate model for randomness that is observed in the physical world. Algorithms for generating random realiza-
tions of normal random variables are also well established, making the normal distribution a convenient choice for probabilistic simulations.
The multivariate extension of the normal distribution is also a widely used representation for multivariate data. First, the multivariate normal distribution is simple to specify because it is fully defined by a mean vector and covariance matrix (equivalently, a set of marginal normal distributions and the pairwise correlation coefficients among the variables). And as with the univariate normal distribution, algorithms are readily available for generating random samples from the multivariate normal model.
In some cases, however, the normal or multivariate normal model will not be appropri- ate. A variety of procedures are available for assessing the suitability of the normal model for univariate data, including the normal probability plot (c.f. Haldar and Mahadevan, 2000), the Lilliefors test (Lilliefors, 1967), and the Shapiro-Wilk test (Shapiro and Wilk, 1965). Sri- vastava (2002) also discusses a few procedures available for assessing the suitability of the multivariate normal model for a set of multivariate data, which is not as straightforward.
In the univariate case, or when dealing with multiple independent variables, non-normality is not typically a problem for uncertainty propagation. Simulation techniques exist for a va- riety of parametric non-normal probability distributions, and several transformations are also available to help achieve normality (see Rebba, 2005, for an overview of transformations).
However, when a group of dependent variables do not fit the multivariate normal model, the situation is not as simple. Transformations that achieve joint normality are more difficult to find, and simulation techniques are not as widely available. One lesser known difficulty is that when the variables are non-normal, specifying the marginal distributions and correlation coefficients does not necessarily result in a fully specified joint distribution. Some specialized
techniques are available to sample from fully-specified parametric joint distributions (Johnson, 1981), but they will not be applicable to most practical problems. Other sampling techniques that have been developed include approximations based on partially specified joint distribu- tions (Lurie and Goldberg, 1998; Iman and Conover, 1982) and the use of bivariate copulas (Haas, 1999).
As an alternative to parametric techniques, which are only applicable when the data con- form to the specified probability model, several non-parametric techniques are available that are more generally applicable. Two such non-parametric techniques are the polynomial chaos expansion (c.f. Ghanem and Spanos, 1991; Ghanem and Dham, 1998; Debusschere et al., 2004;
Ghanem et al., 2008; Ghanem, 1999) and kernel density estimation. Polynomial chaos expan- sion is a method whereby a random variable is expanded as a set of orthogonal basis functions on independent, standardized random variables. Kernel density estimation is discussed below in Section 4.2.2.
When dealing with a large number of dependent random variables, uncertainty propagation can often be simplified by finding a more compact representation of the high-dimensional random vector. If the original high-dimensional set can be well-approximated by a lower- dimensional set, then the problem of density estimation need only be concerned with the lower- dimensional set. In probability and statistics, the canonical approach to finding a compact representation of a high-dimensional random vector is what is known as principal component analysis; this approach is outlined below in Section 4.2.3.