Attention and Transformer - Preliminary of Deep Learning Methods in STDM

A Survey of Spatio-temporal Data Mining

2.4 Preliminary of Deep Learning Methods in STDM

2.4.2 Attention and Transformer

Preliminary of Deep Learning Methods in STDM 43

employ gating mechanisms to selectively remember and forget information, al- lowing them to capture long-range dependencies effectively. The main difference between LSTMs and GRUs is that the latter has a simpler structure, which results in fewer learnable parameters and consequently faster training.

A GRU consists of two gates: update and reset. These gates control the flow of information within the network and are defined by the following equations:

zt=σ(Wxzxt+W_hzh_t−1+bz) (2.9) r_t =σ(W_xrx_t+W_hrh_t−1+b_r) (2.10) wherez_tandr_trepresent the update and reset gates, respectively, at time stept.

The weight matricesW_xz,W_hz,W_xr, andW_hr are the learnable parameters of the gates, andbzandbrare the corresponding bias terms. The functionσdenotes the sigmoid activation function, andx_tand h_t₋₁ represent the input and previous hidden state, respectively. The candidate hidden state at time stept, denoted by h˜t, is calculated as follows:

h˜t=tanh(Wxhxt+W_hh(rt⊙h_t−1) +b_h) (2.11) where⊙denotes element-wise multiplication, andW_xh,W_hh, andb_hare the learnable parameters and bias term associated with the candidate hidden state. Finally, the hidden state at time stept,h_t, is updated using the following equation:

h_t= (1−z_t)⊙h_t₋₁+z_t⊙h^˜_t (2.12) Compare to LSTMs, GRUs have demonstrated competitive performance, mak- ing them a popular choice for tasks that require modeling complex temporal patterns.

44 A Survey of Spatio-temporal Data Mining

and relationships in the input data, enabling more efficient parallelization of computations and faster training.

The attention mechanism and the transformer architecture have demonstrated advantages over RNNs, GRUs, and LSTMs in handling sequential data. They are better at capturing long-range dependencies, thanks to their selective focus on different parts of the input sequence. Furthermore, their non-recurrent structure enables efficient parallelization of computations, leading to faster training. The attention mechanism also provides interpretability by highlighting input-output relationships, while the Transformer architecture is more scalable for handling long sequences. As a result, their widespread use has come to dominate the natural language processing field, and their application has expanded to various tasks beyond language, such as computer vision, sequential modeling, and more.

2.4.2.1 Attention mechanism

The attention mechanism, first introduced by Bahdanau et al. (2014) [6], was designed to improve the performance of sequence-to-sequence models in neural machine translation tasks. It allows the model to selectively focus on different parts of the input sequence, thereby capturing long-range dependencies more effectively than traditional RNNs.

The attention mechanism can be described as a weighted sum of input features, where the weights are dynamically computed based on the input and a query vector. Given a query vectorq, and a set of key-value pairs(k_i,v_i), the attention mechanism computes the output as follows:

Attention(q,K,V) =

∑

w_iv_i, (2.13)

whereKrepresents the set of key vectors,Vrepresents the set of value vectors, and w_i are the attention weights. The attention weights are computed using a softmax function over the dot product of the query vector and the key vectors:

w_i = ^exp(_score(q,k_i))

∑jexp(score(q,k_j))^, ^(2.14) where the score function measures the similarity between the query vectorqand the key vectork_i. A common choice for the score function is the dot product:

score(q,k) =q^⊤k. (2.15)

Preliminary of Deep Learning Methods in STDM 45

The attention mechanism has been widely adopted in various deep learning models due to its ability to capture long-range dependencies and selectively focus on relevant parts of the input sequence. It has been particularly successful in the natural language processing domain, where it has led to significant improvements in tasks such as machine translation, text summarization, and question-answering.

One notable extension of the attention mechanism is the multi-head attention proposed in the Transformer model. Multi-head attention allows the model to attend to different parts of the input sequence with multiple attention heads, each with its own set of learnable parameters. This enables the model to capture diverse aspects of the input from enormous data, leading to much better performance.

2.4.2.2 Transformer

TheTransformer, introduced by Vaswani et al. (2017) [127], is a novel deep learning architecture that relies solely on self-attention mechanisms, discarding the use of recurrent layers or convolutions. This architecture has revolutionized the field of natural language processing, achieving state-of-the-art performance on a wide range of tasks, and has been extend to many other fields including computer vision, time series analysis, graph learning, and spatio-temporal data mining.

The core component of the Transformer is the multi-head self-attention mechanism, which allows the model to capture dependencies between different parts of the input sequence. In the multi-head attention layer, the input sequence is transformed into a set of queries, keys, and values through linear projections.

The attention mechanism is then applied multiple times in parallel, with each attention head having its own set of learnable parameters. The outputs from all attention heads are concatenated and linearly transformed to produce the final output. Mathematically, the multi-head attention can be defined as:

MultiHead(Q,K,V) =Concat(head₁, . . . , head_h)W^O, (2.16) wherehis the number of attention heads, andW^Ois a learnable linear transformation matrix. Each attention head head_iis computed as:

head_i =Attention(QW_i^Q,KW_i^K,VW_i^V), (2.17) whereW_i^Q,W_i^K, andW_i^V are learnable linear transformation matrices for thei-th attention head.

The Transformer architecture consists of a stack of identical layers, each con- taining a multi-head self-attention layer followed by a position-wise feed-forward

46 A Survey of Spatio-temporal Data Mining

layer. Residual connections and layer normalization are applied after each layer, which helps to stabilize the training process and alleviate the vanishing gradient problem. Additionally, since the Transformer does not use recurrent layers, positional encoding is added to the input embeddings to inject positional information into the model.

The Transformer architecture is highly scalable and can be easily adapted to various tasks by modifying its encoder and decoder components. For example, the BERT model [36] adopts the Transformer’s encoder architecture for pretraining on large-scale language modeling tasks, while GPT [98] and its successors [99, 18] uti- lize the decoder architecture for generative language modeling. These pretrained models have significantly advanced the performance of downstream NLP tasks, highlighting the versatility and effectiveness of the Transformer architecture.

Dalam dokumen Spatio-temporal Graph Representation Learning (Halaman 56-59)