Open Notebook: Transformer and self attention

Friday, July 19, 2024

Transformer and self attention

GPT 4o

Sure, here's a revised explanation including the dimensions of the matrices involved:

Transformer Architecture

The Transformer consists of an encoder-decoder structure:

Encoder: Processes the input sequence.
Decoder: Generates the output sequence.

Each of these parts is made up of multiple layers that are identical but have different weights. The encoder and decoder layers consist of:

Multi-Head Self-Attention Mechanism
Feed-Forward Neural Network
Add and Norm Layers

Self-Attention Mechanism

Self-attention allows the model to weigh the importance of different words in a sequence relative to each other. Here's how it's calculated:

Input Vectors: The input words are first converted into vectors through an embedding layer. These vectors are denoted as $X$ .
- Dimension of $X$ : $(n, d_{\text{model}})$ , where $n$ is the sequence length (number of words) and $d_{\text{model}}$ is the dimensionality of the embedding space.
Linear Projections: Three different linear projections are applied to the input vectors to generate the Query (Q), Key (K), and Value (V) matrices.
$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$
Here, $W^Q$ , $W^K$ , and $W^V$ are weight matrices learned during training.
- Dimension of $W^Q$ , $W^K$ , $W^V$ : $(d_{\text{model}}, d_k)$
- Dimension of $Q$ , $K$ , $V$ : $(n, d_k)$
Scaled Dot-Product Attention:
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$
- $QK^T$ $Q K^{T}$ : Computes the dot product between the Query and Key matrices to get the raw attention scores.
  - Dimension of $QK^T$ : $(n, n)$
- $\sqrt{d_k}$ : The dot products are scaled by the square root of the dimension of the key vectors ( $d_k$ ) to stabilize gradients.
- $\text{softmax}$ : Applies the softmax function to obtain attention weights that sum to 1.
- $V$ $V$ : The Value matrix is then weighted by these attention scores to produce the output.
  - Dimension of output: $(n, d_k)$
Multi-Head Attention: Instead of performing a single self-attention operation, the Transformer employs multiple attention heads. Each head performs the attention mechanism independently with different learned projections and the results are concatenated and linearly transformed.
$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$
- Dimension of each head $\text{head}_i$ : $(n, d_k)$
- Number of heads $h$
- Dimension after concatenation: $(n, h \cdot d_k)$
- Dimension of $W^O$ : $(h \cdot d_k, d_{\text{model}})$
- Dimension of output: $(n, d_{\text{model}})$
Feed-Forward Network: After the multi-head attention, the output passes through a feed-forward neural network, which consists of two linear transformations with a ReLU activation in between.
$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$
- Dimension of $W_1$ : $(d_{\text{model}}, d_{\text{ff}})$
- Dimension of $W_2$ : $(d_{\text{ff}}, d_{\text{model}})$
- Dimension of $b_1$ : $(d_{\text{ff}})$
- Dimension of $b_2$ : $(d_{\text{model}})$
Add and Norm Layers: Each sub-layer (self-attention and feed-forward) has a residual connection around it, followed by layer normalization.
$\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))$
- Dimension of output: $(n, d_{\text{model}})$

Putting It All Together

Encoding: The input sequence is processed through multiple layers of self-attention and feed-forward networks. The output of the encoder layers is a set of continuous representations of the input sequence.
Decoding: The decoder also uses self-attention and feed-forward networks. Additionally, it uses "encoder-decoder attention" to incorporate the encoder's output. This helps generate the output sequence one element at a time.

By relying on self-attention, the Transformer efficiently captures dependencies between words regardless of their distance in the sequence, enabling powerful performance on a variety of NLP tasks.

Open Notebook

Friday, July 19, 2024

Transformer and self attention

Transformer Architecture

Self-Attention Mechanism

Putting It All Together

No comments:

Post a Comment