Continuous Spatio-Temporal Data Reconstruction
5.4 Our Approach
5.4.2 Spacetime Encoder
The spatio-temporal encoder seeks to learn embeddings for spacetime events by aggregating the information from their local spatio-temporal context with the consideration of spatial correlations, temporal correlations, and spatio-temporal inter-dependencies. Motivated by the state-of-the-art spatio-temporal prediction models, we construct a graph to retain the spatial proximity (Section 5.4.2.1). The encoder is comprised of multiple spatio-temporal blocks, each containing a GCN module (Section 5.4.2.2), two Bi-LSTM layers (Section 5.4.2.3), a linear layer, and residual connections. The input feature tensorX hasT×Nspacetime events, each described by aK-dimensional vector. The final output has dimensionsRT×N×d, wheredis the dimension of the embedding. The encoder is trained through an end-to-end framework, back-propagating gradients based on the error between the decoder’s reconstructed output and the ground truth.
5.4.2.1 Build the Graph
We explicitly represent the spatial relationships of sensor stations using a graph structure. Specifically, the graphG= (V,E)is built where each node is a sensor station and each edge represents the spatial correlations between two sensor stations, reflected by their distance. Specifically, we apply the geodesic algorithm [69] to calculate the distance between the coordinates of every two sensors1. Intuitively, a longer distance often leads to lower spatial correlations. Thus, we estimate the pairwise spatial correlations between two nodes vi and vj as an exponential decay function following existing work [79]:
Ai,j =
( exp
−dist(i,j)
σ2
if exp
−dist(i,j)
σ2
>θ
0 otherwise (5.1)
where dist(i,j) denotes the geodesic distance between nodesviandvj, the standard deviationσand thresholdθare hyper-parameters to determine the sparsity of connectivity between nodes.
By Equation 5.1, we can build a weighted adjacency matrixA={e1:N,1:N} ∈ RN×N to summarize all pair-wise spatial correlations. Note that, the thresholdθ
1The geodesic distance is the shortest distance on the surface of an ellipsoidal model of the earth.
Our Approach 121
is introduced to overcome the over-smoothing problem of graph convolution net- works [64]. A larger value ofθresults in a smaller neighborhood and enables the stacking of more graph convolutional layers before encountering over-smoothing issues. On the other hand, a smaller value ofθincreases the neighborhood size, making it easier to perceive the overall graph structure. In our experiments, we setθto 0.1.
5.4.2.2 Graph Convolution Module
The graph convolution module is designed to capture the spatial dependencies among nodes in a network by combining the information of a node with that of its neighbors. The module consists of several stacked graph convolutional layers (GCN). The standard GCN, first proposed by Kipf in 2016 [134], updates a node’s representation by aggregating information from its neighborhood through the use of a first-order approximation of the Chebyshev spectral filter. One of the key advantages of the GCN is its localized filter in the graph structure, which makes it fast to compute and easy to parallelize.
To handle spatio-temporal data, we extend the standard graph convolution by adding the time dimension in the convolution process. This allows us to perform graph convolution operations simultaneously at all time slices. It is worth noting that among different time steps, we assume the location of sensors remains unchanged, i.e., the graph structure is fixed.
Forl-th GCN layer, the input is the feature tensorH[l−1] ∈ RT×N×C[L−1] and the adjacency matrixA ∈ RN×N. Here, C[l−1] denotes the embedding size of spacetime events before layerl. As depicted in Figure 5.2, the first GCN layers are positioned after the Bidirectional LSTM module, which handles the temporal dynamics of the events. By applying the GCN module to the temporal embed- dings learned by the BiLSTM module, the graph convolutional layers effectively integrate spatial dependencies and generate the final spatiotemporal embedding for each spatiotemporal event. Specifically, we first calculate the normalized Laplacian matrix ˜Las
L˜ = (D+IN)−1(A+IN) (5.2) whereDis the degree matrix corresponding to the graph represented by adjacency matrixA, and IN is the identity matrix. Then we use ˜Linstead ofAto prevent exploding values from repeated multiplication over multiple layers.
122 STRL
By omitting the batch dimension, the propagation rule of forl-th GCN layer can be written as:
S[l] =H[l−1]W[l] ∈ RT×N×Ω (5.3)
O[t,v,cl] =
∑
n
S[t,n,cl] L˜n,v ∈RT×N×C[l] (5.4)
H[l] =σ
O[l]+b[l]
∈RT×N×C[l] (5.5) where W[l] ∈ RC[l−1]×Ω denotes the GCN kernel weight matrix for layer l.
S[l] ∈ RT×N×Ωdenotes the support tensor.O[t,v,cl] ∈ RT×N×C[l] is the aggregated neighborhood information at for each time step using Einstein summation, which performs the aggregation for all time steps in paralell.σis the activation function andb[l] is the bias vector. In Equation (5.4), all suffixes includingt,n,c,vare the Einstein notion to index the dimensions. It can be viewed as both tensorSt,n,c
and ˜Ln,vare expended along the dimension denoted byn. Precisely, the support tensorSt,n,cis expended along the second dimension resulting in all node features among different time steps. The Laplacian matrix ˜Ln,vis expended along the first dimension and gets the connectivity information of each node. Then the two expended tensors are multiplied element-wise and summed to get the spatial information aggregated for each node in different time steps at once. In short, it is a convenient way to perform the message passing for the same graph structure but different node temporal features at all time slices independently and parallelly.
At last, the bias is added and the activation function is applied to the final layer output.
Having integrated the spatial dynamics into the node embeddings for each time slice, the next phase is to incorporate the temporal dynamics.
5.4.2.3 Bi-LSTM Module
The Bidirectional LSTM module aims to capture the temporal correlations hidden in the nodes’ temporal features. In each spatial-temporal block, we add a Bi-LSTM module before the GCN module and another Bi-LSTM module after the GCN.
Unlike many other existing models for spatio-temporal data analysis, here we use the bidirectional LSTM rather than the unidirectional LSTM to leverage both the past and the future information of the query timestamp. It has been demonstrated that bidirectional models outperform unidirectional ones by a wide margin in many fields, such as phoneme classification [50] and speech recognition [49].
Our Approach 123
Specifically, Bi-LSTM processes sequential data in both forward and backward directions with two separate hidden layers. The two hidden states are connected to the same output layer. In our case, the input of thel-th Bi-LSTM layer is either the raw feature tensorX, GCN module output, or output from another Bi-LSTM layer. Given an input tensorH[l−1] ∈RT×N×C[l−1], we perform the Bi-LSTM for all nodes’ temporal sequences simultaneously. In each Bi-LSTM layer, each LSTM cell corresponds to a single time step. Specifically, for each station, an LSTM cell takes the feature vector at timet, i.e.,xt ∈ RC[l−1], as input and outputs the corresponding hidden statehtat timet, which is formulated as:
ft =σg(Wfxt+Ufht−1+bf) (5.6) it =σg(Wixt+Uiht−1+bi) (5.7) ot =σg(Woxt+Uoht−1+bo) (5.8)
˜
ct =tanh(Wcxt+Ucht−1+bc) (5.9) ct = ft⊙ct−1+it⊙c˜t (5.10)
ht =ot⊙tanh(ct) (5.11)
whereit, ft, andotare the input gate, forget gate, and output gate, respectively. ˜ct, ct, andhtdenote the cell input state, cell output state, and cell output.Wf,Wi,Wo, andWCare the weight matrices mapping the cell input to different gates, while theUf,Ui,Uo, andUCare the weight matrices connecting the previous cell output.
Thebf,bi,bo, andbC are four bias vectors, and The σg and tanh are activation functions. Also,⊙denotes the Hadamard product.
A Bi-LSTM layer contains both forward and backward LSTM cells. The for- ward LSTM cells compute the hidden state for each time stept, i.e.,⃗ht, by applying Equations (5.6-5.11), while the backward cells compute the hidden state for each time step t, i.e., h⃗
t, by applying Equations (5.6) - (5.11) on the reversed input sequence. Consequently, the output of a Bi-LSTM cell is the concatenation of both directions’ outputs
yt= [⃗h
t;h⃗
t]∈RC[l] (5.12)
In general, applying Bi-LSTM layers before and after the GCN module en- ables us to learn the representation of a spacetime event from spatial and tem- poral correlations alternatively. Subsequent GCN and Bi-LSTM can deepen the spatio-temporal correlations and incorporate the boarder spatio-temporal context information in the learned embedding for each spacetime event.
124 STRL
5.4.2.4 Residual Connections & Output
Besides the GCN and Bi-LSTM modules, the spatio-temporal encoder also con- tains linear layers, residual connections, and a normalization layer. As demon- strated in Figure 5.2, residual connections are applied between GCN and Bi-LSTM modules. Then a normalization layer, a linear layer, and the leaky ReLU activa- tion functions are stacked to transform the data further. At last, the output from different spatio-temporal blocks is combined and fed into the encoder output linear layer.
In the spacetime encoder, we learn the representation for each spacetime event by capturing and aggregating the temporal patterns as well as spatial patterns together into the embeddings. The output of the encoder can be denoted by He∈RT×N×dwheredis the embedding vector size.