Friday, July 19, 2024

Transformer and self attention

GPT 4o

Sure, here's a revised explanation including the dimensions of the matrices involved:

Transformer Architecture

The Transformer consists of an encoder-decoder structure:

  • Encoder: Processes the input sequence.
  • Decoder: Generates the output sequence.

Each of these parts is made up of multiple layers that are identical but have different weights. The encoder and decoder layers consist of:

  1. Multi-Head Self-Attention Mechanism
  2. Feed-Forward Neural Network
  3. Add and Norm Layers

Self-Attention Mechanism

Self-attention allows the model to weigh the importance of different words in a sequence relative to each other. Here's how it's calculated:

  1. Input Vectors: The input words are first converted into vectors through an embedding layer. These vectors are denoted as XX.

    • Dimension of XX: (n,dmodel)(n, d_{\text{model}}), where nn is the sequence length (number of words) and dmodeld_{\text{model}} is the dimensionality of the embedding space.
  2. Linear Projections: Three different linear projections are applied to the input vectors to generate the Query (Q), Key (K), and Value (V) matrices.

    Q=XWQ,K=XWK,V=XWVQ = XW^Q, \quad K = XW^K, \quad V = XW^V

    Here, WQW^Q, WKW^K, and WVW^V are weight matrices learned during training.

    • Dimension of WQW^Q, WKW^K, WVW^V: (dmodel,dk)(d_{\text{model}}, d_k)
    • Dimension of QQ, KK, VV: (n,dk)(n, d_k)
  3. Scaled Dot-Product Attention:

    Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
    • QKTQK^T: Computes the dot product between the Query and Key matrices to get the raw attention scores.
      • Dimension of QKTQK^T: (n,n)(n, n)
    • dk\sqrt{d_k}: The dot products are scaled by the square root of the dimension of the key vectors (dkd_k) to stabilize gradients.
    • softmax\text{softmax}: Applies the softmax function to obtain attention weights that sum to 1.
    • VV: The Value matrix is then weighted by these attention scores to produce the output.
      • Dimension of output: (n,dk)(n, d_k)
  4. Multi-Head Attention: Instead of performing a single self-attention operation, the Transformer employs multiple attention heads. Each head performs the attention mechanism independently with different learned projections and the results are concatenated and linearly transformed.

    MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
    • Dimension of each head headi\text{head}_i: (n,dk)(n, d_k)
    • Number of heads hh
    • Dimension after concatenation: (n,hdk)(n, h \cdot d_k)
    • Dimension of WOW^O: (hdk,dmodel)(h \cdot d_k, d_{\text{model}})
    • Dimension of output: (n,dmodel)(n, d_{\text{model}})
  5. Feed-Forward Network: After the multi-head attention, the output passes through a feed-forward neural network, which consists of two linear transformations with a ReLU activation in between.

    FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
    • Dimension of W1W_1: (dmodel,dff)(d_{\text{model}}, d_{\text{ff}})
    • Dimension of W2W_2: (dff,dmodel)(d_{\text{ff}}, d_{\text{model}})
    • Dimension of b1b_1: (dff)(d_{\text{ff}})
    • Dimension of b2b_2: (dmodel)(d_{\text{model}})
  6. Add and Norm Layers: Each sub-layer (self-attention and feed-forward) has a residual connection around it, followed by layer normalization.

    Output=LayerNorm(x+Sublayer(x))\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))
    • Dimension of output: (n,dmodel)(n, d_{\text{model}})

Putting It All Together

  1. Encoding: The input sequence is processed through multiple layers of self-attention and feed-forward networks. The output of the encoder layers is a set of continuous representations of the input sequence.

  2. Decoding: The decoder also uses self-attention and feed-forward networks. Additionally, it uses "encoder-decoder attention" to incorporate the encoder's output. This helps generate the output sequence one element at a time.

By relying on self-attention, the Transformer efficiently captures dependencies between words regardless of their distance in the sequence, enabling powerful performance on a variety of NLP tasks. 


No comments:

Post a Comment