RELATIONAL LEARNING AND SCALABLE INFERENCE

Consideration of multivariate or relational learning is de facto in time series analysis for many domains. The first contribution of this thesis is a general framework for modeling multiple time series that provides descriptive relationships between time series.

Thesis scope

It can be understood as the process of building structures from small and simple to complex and rich. In machine learning data, composition can be found in natural language processing where sentences are created from words.

Challenges

First, multiple time series are considered as the primary target data upon which an automated machine learning framework is built, resulting in interpretable patterns. Finally, this thesis extends theoretical studies of a deep hierarchical version of the Gaussian process [Damianou and Lawrence, 2013].

The contributions of thesis

A general framework for modeling covariance structure in multiple time

To address this, this thesis proposes in Chapter 3 Semi-Relational Kernel Learning, which maintains the global kernel assumption such as RKL, but allows for an individual kernel function for each time series. The responsibility of this individual kernel function is to satisfy the residual that the global kernel function may not fit the time series properly.

A scalable method for learning compositional kernel functions

The output of this model still benefits from the global kernel assumption, which results in the found kernel structure capturing the informative description shared between time series. More importantly, how to make a model automatically returns the relationship between time series instead of fixing a relationship assumption between them.

A theoretical understanding of extension to deep Gaussian process

The outline of this thesis

Publication notes

Experiments further confirm that the proposed model outperforms the advanced Gaussian process methods in extrapolation tasks. This chapter presents an introduction to the Gaussian process (GP) by giving the weight-space view and then transitioning to the functional view.

Function space

With this hypothesis, the mean and covariance of function evaluations f(x) can be easily obtained by. That is, one can obtain the same posterior if the dot product between two features is replaced by a kernel function.

Covariance function

That is, the kernel function is invariant when swapping between two inputs in the pair, e.g. k(x,x0) =k(x,x0). Periodic kernel function The periodic (Per) kernel function is constructed based on the idea of folding inputx into a new feature spaceu(x) = [cos(x),sin(x)]>.

The Automatic Statistician System

The globally shared kernel function is the main target and is searched in the same way as CKL. The role of this additional kernel function is to fit the residual between data and the globally shared kernel function.

Review of Relational Kernel Learning

To handle variations in magnitudes between these time series, RKL introduces scale and shift coefficients for each time series, which are optimized along with kernel hyperparameters. Semi-relational kernel learning, which is the main contribution of this chapter, introduces a single kernel for each time series.

Semi-Relation Kernel Learning

The distinctive function of the kernel is to model the residual or residual part of the time series which can be considered as small branches and leaves. The choice of the SM kernel function is natural since its expressiveness is suitable to fit the residual of any time series.

Experimental Results

Data sets

The NLLs onkj and kS coincide; (b) Overfitting is observed from level 3 of the search grammar. The currency dataset contains 4 exchange rates including Indonesian Rupiah (IDR), South African Rand (ZAR), Russian Ruble (RUB) and Malaysian Ringgit (MYR). A key observation in this data set is that the financial market fluctuated widely from late September 2015 to early October 2015.

Quantitative evaluations

However, SRKL maintains high BIC scores due to the number of hyperparameters in SM kernel. The test data sets inventory data, housing data, currency data contain the following 14 days, 13 months and 13 days of data respectively. According to Table 3.1, SRKL performs better on most datasets although it has higher BIC scores and NLLs.

Qualitative Comparisons

This is because SRKL overcomes the underfitting problem in RKL by having an SM kernel to complement the common covariance.

Related work and final remark

Using such properties of compositional structure, this chapter proposes a core compositional framework that generates interpretable outputs with improved predictive performance for multiple time series. The model presented in this chapter proposes a new way to better understand multiple time series by analyzing latent features of IBP and interpretable kernel functions. That is, the output of models provides the relationship between time series with human-readable descriptions.

Latent Kernel Model

Indian Buffet Process
Model definition
Properties
Inference algorithm

Here, at the time step dindexed by td, the value of the then time series is xnd. Projected to the multiple time series setting, the decomposition for the same physician past may be different for each time series. The approach in the previous chapter relies heavily on a global shared kernel for all time series and allows characteristic kernelCn in each time series.

Model discovery in multiple time series

We emphasize that PSE with LKM considers a larger number of kernel structures than those in CKL. All possible search candidates in CKL are O(RK2L+R2K) kernels, while PSE incorporating with LKM considers O(K2R2L+K) number of kernels. When the maximum number of grammar rules per substructure is R, the total number of candidates at depth d+ 1 is O(RK2L+R2K).

Experimental evaluations

Real-world time series data

Strongly Correlated Data Sets Three data sets are considered: US stock prices, US housing markets and currency exchanges. These datasets are described in previous chapter, which contain multiple time series that are highly correlated with each other. Heterogeneous dataset To emphasize the ability of LKM to handle more general setting, time series are collected from several domains.

Qualitative results

We can show that the latent matrix Z represents certain relationships between time series with additional information from kernel interpretation. Given a set of N time series, the output of our model can generate N2 reports comparing each pair of time series. The remaining plots contain shared components and individual components with descriptions and posterior functions for each time series.

Quantitative results

Rather than assuming that all time series share a single global core, our model recognizes which structures are globally or partially shared. Spike and Slab and GPRN models outperform ABCD and R-ABCD in the currency dataset where they contain highly volatile data. Although our model shares some computational procedures with ABCD and R-ABCD, our model is more robust to handle different types of time series data.

Related work and final remark

Variational Sparse Gaussian process

The history of sparse Gaussian process methods dates back to the work of Snelson and Ghahramani [2006]. The central idea of sparse Gaussian process is to introduce pseudo-inducing points, u, which are distributed along with the Gaussian latent variable f under a Gaussian distribution. The variational inference maximizes the lower bound of evidence (ELBO) given as follows [Hensman et al.,2013].

Shrinkage prior

Horseshoe prior The horseshoe prior [Carvalho et al.,2009] introduces a way to sample a sparse vector β axis. On the other hand, the infinite peak holds close to zero wi around the origin.Ghosh et al. Compared to Horseshoe prior, the vein and plate prior exhibits a significant computational burden as the dimension of sparse vectors increases.

Kernel selection with shrinkage prior

Kernel selection with Horseshoe prior

That is, we add the covariance termki(x,x0) to the step samplingβi in Horseshoe generative procedure. The assumption about sparseness among kernel functions ki encourages simple kernels consistent with model selection principles such as Occam's razor in Rasmussen and Ghahramani [2001] and BIC in Lloyd et al.[2014].

Multi-inducing sparse Gaussian process

The posterior obtained by our approach is close to the true model as well as the full GP model. From the posterior mean and covariance in equation 5.1.1 we get the same formula in equation 5.2.3. According to Burt et al [2019], we use Markov's inequality, we have with probability at least 1−δ,.

Variational inference with shrinkage prior

Detail of variational inference

In particular, since the product τ2λ2i is also Log-normal, it can be reparameterized with exp(µτ +µλi +ε(στ +σλi)) with ε∼ N(0,1). We intentionally do not write the explicit form of H[q(φτ)],H[q(φλi)],Eq(φτ)[logp(φτ)]and Eq(φλi)[logp(φλi)]because the variablesφτ and φλdo follow not an optimization but updated by the following. Closed-form update for q(φτ) and q(φλi) Under the mean-field assumption on variational variablesτ,λ, φτ,φλ, we can obtain the closed-form optimal solution w.r.t.

Experimental Evaluations

Kernel function pool

We compare our model with SVGP without prior and SVGP with Softmax [Teng et al., 2020]. Our model outperforms the alternatives in terms of root mean square error (RMSE) and test negative log-likelihood (NLL). In the previous regression task, our model performs poorly in concrete data sets, as the 1st-order additive kernels fit these data best according to [Duvenaud et al., 2011].

Related work and conclusion

Deep Gaussian process(DGP) [Damianou and Lawrence,2013] is a promising new class of models constructed from a hierarchical composition of Gaussian processes. Another work [Dunlop et al., 2018] studies the ergodicity of the Markov chain to explain the pathology. The asymptotic property (middle plot) of the recurrence relation of this quantity between two consecutive layers determines the existence of pathology for a very deep model.

Moment-generating function of distance quantity

A useful property of the Chi-square distribution is that the moment generating function of Zsnn|fn−1 can be written in an analytical form, with t≤1/2,. We will see that the expectation of the distance quantity Zn is calculated via a kernel function which, in most cases, involves powers. Given that the input of this kernel is governed by a distribution, i.e., χ2, the moment generating function becomes suitable to obtain our desired expectations.

Analyzing dynamic systems with chaos theory

Recurrence plots or bifurcation plots have been used to analyze the behavior of chaotic systems. This logistic map is used to describe the characteristics of a system that models a population function. We can see that the plot reveals the state of the system, showing whether the population is extinct (0< r <1), stable (1< r <3), or fluctuating (r >3.4) by looking at the parameter r.

Squared exponential kernel function

Substituting sn−1 into Equation (6.4.4) and applying the law of total expectation for the case of Zn−1, we obtain recurrence relation between layern−1 and layernis. If we construct the recurrence relation based on [Dunlop et al.,2018], E[Zn] is bounded by. A guideline for obtaining a recurrence ratio Given a specific kernel function, one can follow these steps to obtain the corresponding recurrence ratio: (1) considering the form of kernel input where it can be distributed according to either the Chi- quadratic distribution or its variants (presented in the following sections); (2) check if there is a way to represent the kernel function under representations so that statistical properties of kernel inputs are known;. 3) care about the convexity of the function after choosing a proper environment (as we connected the expectation with Jensen's inequality in the proof of Theorem 6.4.1).

Cosine kernel function

As the third step in the guideline, we perform a sanity check on the convexity of 1F1.

Periodic kernel function

Rational quadratic kernel function

Spectral mixture kernel

Extension to non-pathological cases

Analysis of recurrence relations

Identify the pathology

Rate of convergence

Experimental results

Correctness of recurrence relations
Justifying the conditions of pathology
Using recurrence relations in DGPs
High-dimensional data set with zero-mean DGPs

We trained our models on the Boston housing dataset of Dheer and Karra Taniskidou [2017b] and the diabetes dataset of Efron et al [2004]. We provide detailed figures and an additional result of the diabetes dataset with a similar observation in Figure 6.20. We test on the MNIST data set [LeCun and Cortes, 2010] with both models as in previous experiments.

Future work

One of the main contributions of this paper is to propose a framework that automatically discovers relational structures for multiple time series. The development of this framework takes place in two models: a semi-relational kernel learning and a latent kernel model. Another contribution of this paper is scalable inference for Gaussian process models whose kernel functions are additive composition kernels.

Conclusion

Deep Gaussian process is a remarkable extension of Gaussian process by combining Gaussian process layers. Despite the expressiveness of this new model, there are many interesting open questions to establish its basis: how this model behaves when introducing a compositional kernel or injecting inducing biases into this model; practical inference algorithm for selecting compositional kernel for deep Gaussian processes.

Standard zero-mean DGPs: Results of Boston housing data set

Constrained DGPs: Results of Boston housing data set

Standard zero-mean DGPs: Results of diabetes data set

Constrained DGPs: Results of diabetes data set

RMSEs and NMLPs for each data set with corresponding methods (5 independent

Recently, inspired by deep neural networks, [Sun et al.,2018] propose a differential extension of compositional learning. We consider baseline models: GP-NKN [Sun et al., 2018], SVGP without shrinkage prior to w(no prior), and SVGP with SE kernel (SVGP-SE). We perform the experiment in three datasets: heart, liver, pima2 [Duvenaud et al., 2011] for classification task.

There is a pathology that says that increasing the number of layers degrades the learning power of DGP [Duvenaud et al., 2014]. Discussion Note that the relationship between E[Zn] and E[Zn−1] represents a tighter bound than existing work [Dunlop et al., 2018].

Extrapolation performance in UCI benchmarks. Results are aggregated from 10

Description of UCI data sets

Description of heart, liver, pima data set

Classification error (in %) on three data sets