Multivariate time series forecasting, which has been studied for a long time, aims to predict future values using past and present values. On the other hand, Dual Self-Attention Network [30] predicts future values without exogenous information unlike DA-RNN and feeds each of the multivariate time series data individually into two parallel convolution components. This complex structure of time series data interrupts value predictions and makes the predictive performance of deep neural networks worse.
To evaluate this method, three deep temporal neural networks, which are state-of-the-art in predicting future time series data, are used for comparison between models with trend filtering features as input data and models without one. Extensive experiments on real-world multivariate time series data show that this proposed method is effective and significantly better than the original existing methods.
Autoregressive Integrated Moving Average (ARIMA)
Deep Neural Network
Fully Convolutional Network (FCN)
Each layer of data in a convolution network has three dimensions, h×w×d, where h and w denote each dimension of height and width in image data and denote the number of features. Receptive fields are mapping from the area in higher layer to the area in the next layer. However, for time series data, some basic component of FCN must be modified to receive multivariate time series as input functions.
Since it is not spatial but temporal, the kernel size of a layer is 1×k, where k is the number of input time steps (window size) of a layer. This equation is maintained with filter size and step length according to the rule below, fks◦gk0s0 = (f◦g)k0+(k−1)s0,ss0. 5) An FCN naturally computes a non-linear filter which takes the input image and produces the output.
Dual-stage Attention-Based Recurrent Neural Network (DA-RNN)
The sequence-to-sequence architecture is essentially RNN structures that encode each input time series for each representation and decode those representations to extract the results. The DA-RNN model uses the LSTM structure as an encoder and decoder to capture time-dependent information. Each LSTM layer of the encoder and decoder has a memory cell that has input hidden states, such as st at time, and the output hidden state.
In the memory cell, the three sigmoid gates are controlled, which are the input gate it, the forget gate ft and the output gate ot. Continuously, an input attention mechanism in DA-RNN can refer to the previous hidden state ht−1 and cell state st−1∈Rm in the encoder LSTM layer as follows,. Using the previous decoder hidden state dt−1 ∈ Rp and cell state s0t−1 ∈ Rp, temporal attention with the decoder can calculate the attention weight of each encoder hidden state at time t as follows.
The attention mechanism of DA-RNN can calculate the context vector, which is a weighted sum between the temporal attention weight and the hidden state of encoder1, ..,hT,. where the context vector can be represented at the individual time step. We did not use the context vector to generate the prediction output because of the performance gain.
Dual Self-Attention Network (DSANet)
This scalable self-aware dot product module, which can calculate the similarity between a query and any key to extract a set of values, is defined as follows. The forward motion layer in the direction of position which is included in two linear layers with a ReLU function is defined as . In addition, it has a residual connection followed by layer normalization [46] to make training easier and improve generalization.
To improve the robustness, an autoregressive (AR) linear model is combined in a parallel way because the scale of the output in non-linearity methods is not sensitive to the scale of the input. Finally, the prediction of DSANet is obtained by combining a feedback layer to summarize between two self-attention modules and the AR prediction.
L1 Trend Filtering
In addition, λ is a non-negative parameter used to deal between the magnitude of the residual and the smoothness of x. 5 shows after the trend filtering that the fluctuating behavior of the original time series graph is stabilized and the trend filtering simply follows the general trend between two adjacent nodes. Many experiments in this paper demonstrate the effect of trend filtering for three deep temporal neural network models.
The blue line is the original value and the red line is the optimal trend of the original time series. To derive a Lagrange dual from the primary problem of minimizing Eq. 16 we set a new variable z that has constraintz=Dx, to obtain the following formulation. The greater the slope difference between the previous and subsequent points, the greater the loss of the L1 trend (z), the second smoothness term in Eq.
As followed in Figure 7, a new framework is proposed that uses deep temporal neural networks with an additional trend filtering function, which is the trend of the target value. Three deep temporal neural networks, which are state-of-the-art in predicting future time series data, are used to compare between the models with trend filtering function as input data and the models without one. Trend filter function (TF function), which is used for input data to predict future values, is applied by the trend filter technique to original target data.
In this section, five stock index datasets are described, and then the parameter settings of the proposed methods and evaluation metrics are presented.
Datasets
Finally, the performance between the proposed methods and original methods is compared to prove the effectiveness of these methods. The FTSE 100 index represents the share prices of 100 companies in order of market capitalization listed on the London Stock Exchange (LSE). In this paper, the first 61,907 data samples are used as the training set and the next 6,879 data samples are used as the test set.
The TSX 60 Index is a stock index of 60 major companies listed by market capitalization on the Toronto Stock Exchange (TSE). In this paper, the first 49,404 data samples are used as the training set and the next 5,490 data samples are used as the test set.
Parameters and Investment Model Settings
In FCN, there are 32 parameter filters in all layers, and the kernel sizes are 7, 5, and 3 in the sequence layer. Since FCN does not perform well in these experiments, only the index value is used as input. To compare the results between the model with the trend filtering function and the model without it, only the trend filtering function is added with the index value.
To ensure that the prediction is done properly in the desired direction, a simple investment model that buys or sells only one amount for each trade is applied to the output of the regression model. The position is determined by comparing the last predicted values yˆTo with the actual value yT.
Evaluation Metrics
Lookahead Baseline Method
Each data set is located in each row and each figure shows the difference in prediction for the two model methods. One difference between the basic Lookahead method and other models is that the predicted value of yˆTo is not used to determine the current position. Each data set is located in each row with two different time expressions representing the two columns, and a comparison of the predictions from each figure is shown.
As shown in the figure, DA-RNN+L1TF outperforms other methods because DA-RNN+LPT predicts the future values sensitively compared to DA-RNN+L1TF. Compared to the original model, one applied trend filtering technique has low values in terms of RMSE, MAE and MAPE and is similar or higher in terms of rate of return.
Prediction Results
In results with the NASDAQ 100 Index, as the price rises rapidly, DSANet+L1TF predicts the future index value insensitively compared to DSANet without the trend filtering feature. The trend filtering feature is also useful to conveniently forecast the future time series in the DJIA, FTSE 100 and TSX 60 index results. Furthermore, the paired test of the two methods, with and without trend filtering, shows the statistical significance of the fact that the proposed method achieves better performance than the models without filtering.
In five datasets and three deep neural network models, twelve results of fifteen cases are statistically significant below the 0.05 significance level, and the results are shown in Table 3.
Summary
Ramakrishnan, "Fine-grained photovoltaic output forecasting using a Bayesian ensemble," in Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. Liu, "Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks," in The 41st International ACM SIGIR Conference on Research. Ayo, “Forecast price forecasting using the arima model,” in 2014 UKSim-AMSS 16th International Conference on Computational Modeling and Simulation.
Choi, “Automatic Construction of Nonparametric Relational Regression Models for Multiple Time Series,” in International Conference on Machine Learning, 2016, p. Choi, “Discovering Latent Covariance Structures for Multiple Time Series,” in International Conference on Machine Learning, 2019, p. Mohammed, “Stock market forecasting using multivariate analysis with two-way and stacked (lstm, gru)” in 2018 Saudi Computer Society (NCC) 21st National Computing Conference.
Hinton, “Speech Recognition with Deep Recurrent Neural Networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Wu, “Applied attention-based lstm neural networks in stock forecasting,” 2018 IEEE International Conference on Big Data (Big Data). Tang, “Dsanet: Dual self-attention network for multivariate time series forecasting,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp.
Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, p.