Figure 3.2: Gaussian Process curve fit to sample points from a hypothetical rangefinder. Sample points are given by the red circles, with the mean function of the Gaussian Process in blue and the variance envelope in gray.
Figure 3.3: Three Gaussian processes for the same data set. Crosses represent the prior data and the smooth line shows the corresponding Gaussian process, with the gray region showing the 95%
confidence interval. (a) has the optimal set of hyperparameters. (b) has the length scale set too small while (c) has the length scale set too high [32].
the dimension to a more manageable lower dimension, often called the latent space. This is often the case when the intrinsic dimensionality of the problem is likely to be less than the dimension of the data. In such a case there may be redundancies in the data across its dimensions. An example of this might be a warped plane in three dimensions. If the exact nature of the warping is known, any analysis or problems can be solved more easily in two dimensions and then warped back into three dimensions. A number of techniques for dimensionality reduction exist for different types of problems.
3.2.1 Existing Dimensionality Reduction Techniques
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that finds the dimensions of highest variance within a multidimensional data set. PCA does this by decomposing the data into an orthogonal basis ordered by the magnitude of variance along each axis [37]. This works well for problems where the data is linear, but is less effective when data is non-linear.
Isomap [40] is another dimensionality reduction technique that can accurately model some non-linear data. It works by computing distances between points and forming a graph of connected points based on k-nearest neighbors or within a certain fixed radius. The resulting distances are used in a multi-dimensional scaling algorithm [29] to provide dimensionality reduction. In effect Isomap finds the geodesics on the high dimensional set. When using time series data from a dynamical system, subsequent points can be assumed to be connected,
providing some additional information. However, Isomaps are sensitive to noise, especially in initial conditions [2]. Given slightly different initial conditions highly similar processes may lead to significantly different manifolds (in shape and extent) and to latent mappings that exhibit little similarity. This limits the applicability of Isomap to nondeterministic physical systems.
Kernel PCA is a non-linear approach similar to PCA where the data is first mapped to an arbitrary higher dimension where it can be easier to separate points [34]. PCA is performed in this higher dimensional space to give the final non-linear low dimensional representation.
This is accomplished using a kernel method similar to that of a support vector machine that works without actually mapping to a higher dimensional space first, instead defining a kernel function that represents an inner product in the higher space. While Kernel PCA is capable of handling non-linear structure within a data set and performs well for data clustering, it is not particularly applicable to the regressions necessary to approximate dynamical systems.
3.2.2 GP-LVM Optimization
Gaussian Processes can be used for dimensionality reduction as well, by treating the change in dimension as a mapping from the low dimensional latent space X to the high dimensional space Y. The mapping is chosen from low to high so that it is possible to have an injective (one-to-one) mapping between the two spaces. If the mapping was performed from high to low then it would not necessarily be possible to ensure this constraint. The low dimensional
space X is then taken as an input to a Gaussian Process with output Y, and the marginal likelihood of a given X can be found from the following equation taken from Lawrence [20]:
p(Y|X, α) = 1
(2π)DN2 |K|D2 e−trace(K
−1Y Y>
)
2 (3.12)
Here D is the dimension of the data set Y, and N is the number of data points. The covariance matrix K is constructed over the latent variables using the covariance function, k(x, x0). Maximizing this likelihood with respect to X gives a latent space that best reproduces the data set Y. The hyperparameters are also included in the maximization to ensure that the mapping between the two spaces is as accurate as possible. The resulting Gaussian Process Latent Variable Model (GP-LVM) is a non-linear probabilistic dimensionality reduction method that can be used in a wide range of applications [20].
The optimization itself is typically performed using a scaled conjugate gradient method that is initialized using some other approach such as Principle Component Analysis or graph diffusion. The negative log likelihood is taken and minimized to simplify computations:
L(X) = −ln( 1
(2π)DN2 |K|D2 e−trace(K
−1Y Y>
)
2 ) (3.13)
= −ln( 1
(2π)DN2 |K|D2 )− −trace(K−1Y Y>)
2 ) (3.14)
= ln((2π)DN2 ) + D
2 ln(K|) + 1
2trace(K−1Y Y>)) (3.15)
For optimization, the gradient of this likelihood must be taken with respect to the latent variablesX and the hyperparameters α. This gradient is easier to break into two parts, the first being the gradient of L with respect toK:
dL dK = D
2K−1− 1
2K−1Y Y>K−1 (3.16)
The final gradient with respect to latent variables and hyperparameters can then be found through the chain rule:
"
∂K
∂Xij
#
mn
=
α2 (xnj−xij)k(xi,xn), for m=i, α2 (xmj −xij)k(xm,xi), for n=i, 0, otherwise
(3.17)
∂K
∂α1
= K
α1 (3.18)
∂K
∂α2 = −12dist2(X)K (3.19)
(3.20)
Details on these derivation can be found in the Appendix.