Continuous Spatio-Temporal Data Reconstruction
5.5 Experiments
In this section, we evaluate the proposed model on several real-world datasets.
128 STRL
5.5.1 Experimental Setup 5.5.1.1 Datasets
We conduct experiments on three real-world datasets: AQI-Seoul, Weather-AUS, and PeMS-Bay. The AQI-Seoul dataset was provided by Seoul metropolitan government and available on the Kaggle website2. It records the air quality index in Seoul city from Jan 2017 to Dec 2019. 25 sensors perceive the air quality, including SO2, NO2, O3, CO, PM10, PM2.5 and save the data once per hour.
The number of overall time slices is 25,906, expanding over three years. The second dataset, Weather-AUS, can be downloaded from Kaggle3 too, and the source is Australia Bureau of Meteorology. It contains weather information from 49 major cities spread across the whole country and recorded once per day. The main features are temperature, rainfall, evaporation, sunshine, wind, etc. Because lacking the detailed sensor location, we use the city coordinates on behalf of the sensor coordinates. The last dataset is PeMS-Bay [79], which conveys the traffic conditions in the California Bay area around early 2017 by 325 on-road sensors.
The data is aggregated using a 5-minute window of the average speed of all vehicles passing by the sensors. Statistics of the datasets are summarized in Table 5.1.
Since the datasets are collected in a discrete manner, in order to quantify the performance of the reconstructed results, we simulate the continuous reconstruc- tion problem by removing some of the measurements. For ease of presentation, we call the number of missing data between two known adjacent time steps as sample interval, denoted byr. For example, given a dataset with a sequence of time steps{0, 1, 2, 3, 4, 5, ...}, sample intervalr=1 is the case that we have observations at position{0, 2, 4, ...}and aim to infer data at position{1, 3, 5, ...}. Whiler = 4 corresponds to a more scarce observed sequence such as {0, 5, 10, ...}and the goal is to reconstruct data at{1, 2, 3, 4, 6, 7, 8, 9, ...}. Here, we restricted the sample interval only for quantitative comparison of the models. In fact, an important contribution of our model is the flexible granularity. For example, givenr = 1, our model is able to infer the possible data at timestamps corresponding to 0.5, 1, 1.5, 1.8 time steps rather than just one single time step.
2https://www.kaggle.com/datasets/bappekim/air-pollution-in-seoul
3https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package
Experiments 129
Table 5.1: Statistics of datasets
#sensors #time steps #events Interval
AQI-Seoul 25 25,906 647,650 1 hour
Weather-AUS 49 1,577 77,273 1 day
PeMS-Bay 325 52,116 16,937,700 5 minutes
5.5.1.2 Baselines
We compare our model with the following baselines. Some latest works are imputation methods but can be extended for comparison by modifying the data mask.
• Mean: We substitute the masked data with the node-level average of un- masked time steps.
• KNN: We use thek-nearest neighbor to find similar samples along the time axis to fulfill the missing data for each sensor station.
• Linear: We use linear interpolation [14] to replace the missing data between every two known time steps.
• Cubic Spline: Cubic spline interpolation fits a curve that connects the data points with third-order polynomials and fulfills the missing data using the result function.
• MICE: We use Multiple Imputation by Chained Equations (MICE) [5], a widely used imputation method, to fill in the missing values. The maximum number of iterations is limited to 100, and the number of nearest features to 10.
• BRITS: A bidirectional recurrent imputation method for time series data [20].
• GRIN: A graph neural network-based time series imputation method [33].
The research task discussed in this paper is new. Therefore, there is no deep learning based baselines can be found. We adopted some state-of-the-art models of a similar problem, imputation, for comparison. The imputation models aim to fill in the missing values in the input data. The missing data are usually randomly spread across the whole dataset. In our evaluation setup, where the ground truth
130 STRL
data comes at a regular time interval, we apply the imputation methods by placing the “missing mask” to all time steps between the two known adjacent timestamps.
For example, when the sample interval in our evaluation setting is set atr= 1, it can be considered as a special imputation problem which has missing values every other time step for all sensor stations. Similarly,r = 4 corresponds to a missing rate of 80%, which 4 missing values between any pair of known adjacent time steps.
5.5.1.3 Experiment Settings
Our model was implemented utilizing the PyTorch framework, and the experi- ments were performed on the following hardware setup: AMD Ryzen 9 5900X CPU, and NVIDIA GeForce RTX 3090 GPU. The key hyperparameters are listed below. In the encoder, three spatio-temporal blocks are stacked, and the embed- ding dimension is 128 for all layers, including GCN and Bi-LSTM layers. The dropout rate for GCN is 0.3. In each Bi-LSTM module, two layers are stacked with a dropout rate of 0.2 and a hidden size of 64 for each direction. All leaky ReLU activation functions have a leaky rate of 0.2. In the decoder, the hidden size linear layers are 1024, 512, 256, 128, and 32, respectively. Moreover, the Adam optimizer with 1e-3 learning rate and 5e-4 weight decay rate was used for training. The three datasets are split as 70%/10%/20% chronologically for training, validation, and testing, respectively. Last, we set the number of epochs to 100 and the batch size of 8 for experiments.
5.5.2 Results
In this section, we discuss the performance comparison between our model and the baselines on three datasets with different sample intervals. Specifically, we report the mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error (MAPE) of all methods on sample intervalr=1 and r =4.
Table 5.2 summarizes the results for different models on AQI-Seoul, Weather- AUS, PeMS-Bay datasets. Because there is no available deep learning baseline for the continuous reconstruction task, we adopt some state-of-the-art imputation models such as BRITs and GRIN for comparison. Experiment results show that STRLoutperforms baselines by a large margin on several scenarios, especially when the sample interval is long. For example,STRLdecreases MAE with respect to the strongest baseline GRIN by 17% on the AQI-Seoul dataset with a short
Experiments 131
Table5.2:Performancecomparisonunderdifferentnumberofmissingstepsonthreedatasets r=1r=4 AQI-SeoulWeather-AUSPeMS-BayAQI-SeoulWeather-AUSPeMS-Bay MAEMSEMAPEMAEMSEMAPEMAEMSEMAPEMAEMSEMAPEMAEMSEMAPEMAEMSEMAPE Mean38.56111386191.502.768025.0811518.74.546159.13710.12538.70711392364.512.896025.4711448.94.148244.6358.4522 LinearSpline13.3459508.1146.652.021120.254563.650.76791.61481.418221.57416557197.742.732245.177675.021.29715.40922.3344 CubicSpline11.5801407.596.8412.150728.713568.330.94201.98871.903119.9801333696.9533.089861.475707.171.32435.18922.6996 KNN7.25001222.547.3592.018227.742492.850.83201.79381.841014.7321249679.6683.084896.897900.461.25164.93422.5653 MICE6.58291048.145.2031.972026.342487.270.86201.63231.734113.8306839.771.2942.987479.248821.481.25284.28432.6021 BRITS4.0135942.1343.2491.792624.289401.480.73751.31291.424211.5584892.165.2932.813449.089706.041.21293.89282.3419 GRIN3.6322810.5938.2811.672021.421379.500.69230.92281.31429.80254102.962.7542.832548.104739.291.25024.16922.5890 STRL(Ours)3.0188758.8233.3521.427219.102313.930.67370.88551.14405.12541546.651.3552.213337.683607.311.13332.95762.2040
132 STRL
interval and even gains a 50% improvement on long interval settings. The huge difference for GRIN under different lengths of sample interval is mainly caused by the percentage of missing data. In the original settings of GRIN model, the missing data rate is around 5% ~25%. Whenr =1, the missing rate is about 50%
which might still be within reach of GRIN. However, the missing rate jumps to 80% ifr=4 where the majority of data are missing. This might incur some severe challenges for imputation models such as GRIN, BRITs as they usually work on the scenario that missing data is the minority and most data are available.
In general, longer intervals tend to be more challenging than short intervals.
Almost all methods have a better performance with sample intervalr =1 com- pared with that with sample intervalr=4. For example, the MAE of our model on AQI-Seoul dataset increases from 3.02 to 5.12 whenrgrows from 1 to 4. Other models show similar patterns but some of them are more sensitive to the length of missing time steps such as GRIN. For non deep learning methods including linear spline, cubic spline, KNN, and MICE, the performance is not very com- petitive but they seem to be more stable with respect to the sample interval than GRIN or BRITs. For instance, the MAE of cubic spline on AQI-Seoul dataset is 1.7 times higher forr=4 compared withr=1 but GRIN is about 2.7 times higher.
Besides AQI-Seoul, the sensor temporal features on PeMS-Bay tend to be less fluctuated. Therefore, the difference in MAE is not as drastic as the other two datasets. Nevertheless, we still surpass the baselines in all three datasets.
5.5.3 Flexible Granularity
The performance gain we discussed above is only part of the contributions. The most prominent one is the arbitrary granularity supported by our model during the inference stage. In this section, we demonstrate this capability through two concrete examples.
The first example is shown in Figure 5.3. We took the PM2.5 readings from sensor 0 in AQI-Seoul dataset and kept 16 time steps (10,350 to 10,365, simplified as 0 to 15) only for demonstration. The sample interval is 4, therefore, relatively time steps{0, 5, 10, 15}served as inputs and feed into the model while rest time steps are as the labels for training or testing. Blue dots denote the input time steps and red squares are the targets of the ground-truth data. Anything between every two markers is intangible. The goal of our model is to reconstruct the detailed temporal patterns at a very fine level or even continuous for unknown
Experiments 133
Figure 5.3: Continuous reconstruction of PM2.5 readings for sensor 0 in AQI-Seoul dataset from time steps 10350 to 10365.
Figure 5.4: Continuous reconstruction of traffic condition for sensor 0 in PeMS-Bay dataset from time steps 1400 to 1445.
134 STRL
timestamps. The prediction results of linear spline and GRIN is shown in a gray dotted line and a solid black line, respectively.
For simplicity, we focus on time steps{0, 1, 2, 3, 4, 5}only. During training or testing, data at steps{0, 5}are fed into the model and steps{1, 2, 3, 4}serve as targets. In order to support the arbitrary granularity, we transform the time difference between every two given input time steps to[0, 1]and use parameter t ∈ [0, 1]referring to time steps. In this case,t = 0 denotes time step 0,t = 0.2 refers to time step 1,t= 0.4 refers to time step 2, etc. Combined with Equation (5.13), we are able to train theSTRLmodel on observed data and also be able to infer for whatevert∈ [0, 1]. Particularly, the increment oftis set to be 0.02 in this example. In other words, from time step 0 to time step 5, the sensor has six records but we reconstructed 50 data points to have a better estimation of the underlying continuous temporal flow. The results ofSTRLare shown in orange in Figure 5.3.
Linear spline is represented by a gray line. All available sensor readings are on the blue line and those served as input are presented in blue dots while those served as targets for training or testing are in red triangles.
The results of the second case study are presented in Figure 5.4. The study utilized the traffic speed data collected from sensor station 0 in the PeMS-Bay dataset for a total of 46 time steps, ranging from 1400 to 1445. For the purpose of simplification, the time steps were numbered from 0 to 45. The sample interval was fixed at four. The figure displays the input time steps as blue dots and the reference time steps used for training or evaluation as red triangles. The increment of the time variable,t, for our model was set at 0.02 to ensure a smooth representation in the plot. This resulted in a total of 450 data points, which were evenly distributed across the 46 time steps, yielding a more refined curve in comparison to the ground truth or GRIN. The figure demonstrates that our model effectively captures the general trends in traffic variation over time.