A Survey of Spatio-temporal Data Mining
2.4 Preliminary of Deep Learning Methods in STDM
2.4.2 Attention and Transformer
Preliminary of Deep Learning Methods in STDM 43
employ gating mechanisms to selectively remember and forget information, al- lowing them to capture long-range dependencies effectively. The main difference between LSTMs and GRUs is that the latter has a simpler structure, which results in fewer learnable parameters and consequently faster training.
A GRU consists of two gates: update and reset. These gates control the flow of information within the network and are defined by the following equations:
zt=σ(Wxzxt+Whzht−1+bz) (2.9) rt =σ(Wxrxt+Whrht−1+br) (2.10) whereztandrtrepresent the update and reset gates, respectively, at time stept.
The weight matricesWxz,Whz,Wxr, andWhr are the learnable parameters of the gates, andbzandbrare the corresponding bias terms. The functionσdenotes the sigmoid activation function, andxtand ht−1 represent the input and previous hidden state, respectively. The candidate hidden state at time stept, denoted by h˜t, is calculated as follows:
h˜t=tanh(Wxhxt+Whh(rt⊙ht−1) +bh) (2.11) where⊙denotes element-wise multiplication, andWxh,Whh, andbhare the learn- able parameters and bias term associated with the candidate hidden state. Finally, the hidden state at time stept,ht, is updated using the following equation:
ht= (1−zt)⊙ht−1+zt⊙h˜t (2.12) Compare to LSTMs, GRUs have demonstrated competitive performance, mak- ing them a popular choice for tasks that require modeling complex temporal patterns.
44 A Survey of Spatio-temporal Data Mining
and relationships in the input data, enabling more efficient parallelization of computations and faster training.
The attention mechanism and the transformer architecture have demonstrated advantages over RNNs, GRUs, and LSTMs in handling sequential data. They are better at capturing long-range dependencies, thanks to their selective focus on different parts of the input sequence. Furthermore, their non-recurrent structure enables efficient parallelization of computations, leading to faster training. The attention mechanism also provides interpretability by highlighting input-output relationships, while the Transformer architecture is more scalable for handling long sequences. As a result, their widespread use has come to dominate the natural language processing field, and their application has expanded to various tasks beyond language, such as computer vision, sequential modeling, and more.
2.4.2.1 Attention mechanism
The attention mechanism, first introduced by Bahdanau et al. (2014) [6], was designed to improve the performance of sequence-to-sequence models in neural machine translation tasks. It allows the model to selectively focus on different parts of the input sequence, thereby capturing long-range dependencies more effectively than traditional RNNs.
The attention mechanism can be described as a weighted sum of input features, where the weights are dynamically computed based on the input and a query vector. Given a query vectorq, and a set of key-value pairs(ki,vi), the attention mechanism computes the output as follows:
Attention(q,K,V) =
∑
i
wivi, (2.13)
whereKrepresents the set of key vectors,Vrepresents the set of value vectors, and wi are the attention weights. The attention weights are computed using a softmax function over the dot product of the query vector and the key vectors:
wi = exp(score(q,ki))
∑jexp(score(q,kj)), (2.14) where the score function measures the similarity between the query vectorqand the key vectorki. A common choice for the score function is the dot product:
score(q,k) =q⊤k. (2.15)
Preliminary of Deep Learning Methods in STDM 45
The attention mechanism has been widely adopted in various deep learning models due to its ability to capture long-range dependencies and selectively focus on relevant parts of the input sequence. It has been particularly successful in the natural language processing domain, where it has led to significant improvements in tasks such as machine translation, text summarization, and question-answering.
One notable extension of the attention mechanism is the multi-head attention proposed in the Transformer model. Multi-head attention allows the model to attend to different parts of the input sequence with multiple attention heads, each with its own set of learnable parameters. This enables the model to capture diverse aspects of the input from enormous data, leading to much better performance.
2.4.2.2 Transformer
TheTransformer, introduced by Vaswani et al. (2017) [127], is a novel deep learning architecture that relies solely on self-attention mechanisms, discarding the use of recurrent layers or convolutions. This architecture has revolutionized the field of natural language processing, achieving state-of-the-art performance on a wide range of tasks, and has been extend to many other fields including computer vision, time series analysis, graph learning, and spatio-temporal data mining.
The core component of the Transformer is the multi-head self-attention mech- anism, which allows the model to capture dependencies between different parts of the input sequence. In the multi-head attention layer, the input sequence is transformed into a set of queries, keys, and values through linear projections.
The attention mechanism is then applied multiple times in parallel, with each attention head having its own set of learnable parameters. The outputs from all attention heads are concatenated and linearly transformed to produce the final output. Mathematically, the multi-head attention can be defined as:
MultiHead(Q,K,V) =Concat(head1, . . . , headh)WO, (2.16) wherehis the number of attention heads, andWOis a learnable linear transfor- mation matrix. Each attention head headiis computed as:
headi =Attention(QWiQ,KWiK,VWiV), (2.17) whereWiQ,WiK, andWiV are learnable linear transformation matrices for thei-th attention head.
The Transformer architecture consists of a stack of identical layers, each con- taining a multi-head self-attention layer followed by a position-wise feed-forward
46 A Survey of Spatio-temporal Data Mining
layer. Residual connections and layer normalization are applied after each layer, which helps to stabilize the training process and alleviate the vanishing gradient problem. Additionally, since the Transformer does not use recurrent layers, posi- tional encoding is added to the input embeddings to inject positional information into the model.
The Transformer architecture is highly scalable and can be easily adapted to various tasks by modifying its encoder and decoder components. For example, the BERT model [36] adopts the Transformer’s encoder architecture for pretraining on large-scale language modeling tasks, while GPT [98] and its successors [99, 18] uti- lize the decoder architecture for generative language modeling. These pretrained models have significantly advanced the performance of downstream NLP tasks, highlighting the versatility and effectiveness of the Transformer architecture.