Learning Basis Functions of Deep Spatio-Temporal Neural Networks with Covariance Loss

Gaussian prior for parameters of a neural network results in Gaussian prior for the functionality of the network. The purpose of this paper is to guide the Gaussian prior of the network to converge true distributions of target variables. To demonstrate the validity of the proposed loss function, we experimented on different datasets for different models, Spatio-Temporal Graph Convolution Network (STGCN) and Graph WaveNet (GWNET) on popular benchmark datasets, PeMSD7 and METR-LA.

From the Bayesian perspective on neural networks, assuming Gaussian before the parameters in the linear regression unit results in Gaussian before the output of the regression unit [5].

Deep Neural Networks

The traditional convolution operator considers spatial location specified by filters to extract meaningful features on grid-shaped data structures. The graph laplacian matrix specifies neighbor relationships between nodes in a graph and how strong the relationships are. As shown in Equation II.2, with graph laplacian matrix, it is able to derive nodei's signal, f(i) by the neighboring nodes' signals, f(j), which is analogous to the traditional convolution- operation that spatially considers location to extract meaningful features.

Another interesting character of graph Laplacian matrix is that the matrix is self-decomposable, so we have L=UΛUT if the matrix is symmetric, i.e. with the definition of normalized graph laplacian matrix, L = IN −D−1/2AD−1/2 and further assumption that θ0≈θ1, the authors of GCN introduce the well-known form of graph convolution operation. With the mathematically well-established graph convolution operation, GCN has achieved successful performance while being computationally efficient.

STGCN extends GCN by stacking 1-dimensional convolutional layers and graph convolutional layers to extract features while considering temporal and spatial locality, as shown in Figure 4.1. STGCN consists of 2 ST-conv blocks in which a 1-dimensional directed linear unit convolution layer and two graph convolution layers are stacked, as shown in Figure 4.1. After extracting the temporal features, STGCN extracts the spatial features through a graph convolution operation by considering the signals of neighboring nodes determined by the Laplacian matrix of the graph.

Gaussian Processes

Here, the number of features in the base function is the same with the width of filters. For example, the squared exponential (SE) covariance function, σ2exp(−|xp −xq|2/2l), specifies the covariance between pairs of random variables with two hyperparameters, σ and l, that control the variation and smoothness of the functions, respectively. For this particular covariance function, the covariance of the output depends on how close the inputs are.

The proposed spatio-temporal covariance loss function also exploits the covariance of the DNN learning variables to achieve a more accurate model approximation. The performance of GP is highly dependent on a carefully chosen or manually designed kernel function that defines the relationships between x and x0 and can be degraded by predicting highly fluctuating variables. By penalizing the network's learning according to the discrepancy between Φ(x)TΣPΦ(x0) and the covariance matrix of the target values, which we expect to approximate the true distributions of the target variables as the number of samples increases, we direct the network to learn the true kernel function.

The work proved that a single-layer neural network equipped with an infinite number of hidden units becomes a non-parametric GP model by setting independent zero-mean Gaussian values for all network weights and integrating them. After that, the last hidden layer of the pre-trained deep network can be interpreted as a Bayesian linear regression. Unlike standard GPs, DNN-GP avoids the inversion of the kernel matrix, so it scales with the size of the training set.

Spatio-Temporal Covariance loss

For a given data set, D = {(xi,yi)|i= 1, ..,T}, x ∈ RN, y ∈ RN, where T and N denote the size of the data set D and the number of nodes in the graph and the prediction target yi =xi+τ while τ is the time interval for the prediction. For the given two time series variables, v and w, we define the spatiotemporal covariance Σev,w such that Σev,w = (v−v)(w¯ −w)¯ T. v¯ and w¯ denote the averages of v and š. To use spatiotemporal covariance during STGCN training, we additionally define a spatiotemporal covariance loss function.

Thus, with the proposed Spatio-Temporal Covariance Loss, we measure the degrees of their relationships with the Spatio-Temporal Covariance of yi and yj and the inner product of Φi(x) and Φj(x) as shown in Figure 4.1 and also penalize the model . comparing the relations given in equation IV.2. Finally, we add the Spatial-Temporal Covariance Loss to the overall mean squared error (MSE) loss of STGCN with the hyper-parameter λ that adjusts the importance of the Spatial-Temporal Covariance Loss so that, . Therefore, we would like to guide the output of the last hidden layer representing the basic functions of STGCN with the following principles.

In contrast, we expect that the basis functions can represent the variance and covariance of y with our proposed loss, which measures the difference between the spatio-temporal covariance Σey and ΦΦT. Unlike previous studies, our proposed loss function works on CNN-based networks and does not require any derivation of the GPs' covariance functions or matrices. Also note that we do not preset the given data pair (x,y) to find the basis functions.

Figure 4.1: An architecture of a Spatio-Temporal Graph Convolutional Network with Covariance Loss.

Experimental Evaluations

Regardless of how the two different variables are related, independently (left, nodes 52 and 129) or correlated (right, nodes 1 and 7), STGCN-Cov learns such basic functions successfully. The purpose of the proposed loss function is to guide a neural network to automatically learn the true core matrix of target variables, which consists of variances and covariances of the target variables. To achieve this goal, it is important to learn basic functions whose inner product is equivalent to the variances and covariances of target variables.

In this section, we demonstrate the validity of the proposed loss function by comparing the inner product of the basis functions with the target variables, true variances, and covariances. For this set of experiments, we apply the proposed loss function for STGCN with the PeMSD7 dataset and analyze the characteristics of the basis functions. In contrast to STGCN, which learns a basis function whose inner product is irrelevant to the variance of the target variable, STGCN-Cov successfully learns a basis function whose inner product is equivalent to the variance of the target variable.

In Figure 5.2 we present a covariance of two different target variables and a non-diagonal element of the basic inner product corresponding to the target variables as a function of time. As shown in the figure, STGCN-Cov successfully learns basic functions whose internal products are equivalent to the covariance of the target variables, regardless of how much the two target variables are correlated.

Figure 5.1: Learning variance of target variables: with the proposed loss, diagonal elements of basis inner product (σφ i φ i ) indicate variances of target variables

Effect of the basis functions on PeMSD7 dataset

In contrast, the proposed loss function further constrains the basis functions to consider temporal and spatial dependencies among themselves, and thus each basis function has a shape similar to prediction results but differs only in its scale as shown in the right diagram of Fig. that with STGCN-Cov the distance between the best parameters for conflicting predictions of STGCN on time and t' becomes closer than STGCN. Next, we analyze prediction results of STGCN-Cov to see where the poor MAE and MAPE come from.

Fig.5.4 shows prediction results and corresponding basis functions that report less accurate results than STGCN. We found that the large mispredictions in the figure are the result of dependencies extracted from training data set that are significantly different from those of test data set as shown in Fig.5.5. Considering predictions from 1-d FCN that only capture time dependencies report quite similar prediction errors with STGCN/STGCN-Cov, temporal dependencies in training dataset are the main reason for the large prediction error.

We can observe that the prediction results of STGCN-Cov are even worse than those of STGCN. The figure shows that the proposed loss function makes a model learn temporal and spatial dependencies more strongly and thus, in cases where the data distribution varies significantly, it can significantly worsen the overall accuracy.

Figure 5.4: Prediction and basis of STGCN (left) and STGCN-Cov (right) on Node 110 of PeMSD7

Effect of the basis functions on METR-LA dataset

We can see that GWNET-Cov successfully learns basis functions that can represent temporal and spatial dependencies of target variables. In addition, GWNET-Cov detects the end of the peak time much earlier than GWNET as shown in the figure.

Noise robustness

As shown in the figure, the prediction results of STGCN seriously deteriorate when the number of noisy nodes is more than 5, but STGCN-Cov still shows accurate predictions.

Figure 5.9: The effect of noisy nodes on prediction results 5.4.1 Case studies on the effect of noisy nodes

Comparisons with L1 and L2 regulizations

In this paper, we propose a new loss function, the spatio-temporal covariance loss function that guides the learning of a neural network to learn the underlying kernel function of the target values, which are the spatial and temporal dependencies of the time series. with many variations. With the proposed Spatio-Temporal Covariance loss, neural networks learn basis functions whose inner product is equal to the kernel matrix. Our extensive sets of experimental evaluations have shown that the proposed loss function provides more accurate prediction results for real-world multivariate time series datasets.

2] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein, “Deep Neural Networks as Gaussian Processes,” arXiv arXiv preprint. 3] Tamir Hazan and Tommi Jaakkola, “Steps Towards Deep Kernel Methods from Infinite Neural Networks,” preprint arXiv arXiv. 4] Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth, “Multiple Gaussian Processes for Regression,” in Proceedings of the International Joint Conference on Neural Networks.

6] Youngmin Cho and Lawrence K Saul, “Kernel Methods for Deep Learning,” in Proceedings of the Advances in Neural Information Processing Systems, 2009, p 7] Wenbing Huang, Deli Zhao, Fuchun Sun, Huaping Liu, and Edward Chang, “Scalable gaussian process regression using deep neural networks,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2015.