Showing posts with label transformer. Show all posts

Wednesday, March 5, 2025

math formula between transformer encoder and decoder

The transformer architecture splits into an encoder and a decoder, and while both use the scaled dot‐product attention mechanism, they differ in how and where this mechanism is applied.

Scaled Dot-Product Attention (Common to Both)

At the core of both components is the scaled dot‐product attention defined as:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where:

$Q$ (queries),
$K$ (keys),
$V$ (values), and
$d_k$ is the dimensionality of the keys.

Encoder

In the encoder, every layer performs self-attention on the input sequence. Here, the queries, keys, and values are all derived from the same input $x$ :

Q = xW^Q,\quad K = xW^K,\quad V = xW^V

Thus, the encoder’s self-attention is computed as:

\text{SelfAttention}_{\text{enc}}(x) = \text{softmax}\!\left(\frac{(xW^Q)(xW^K)^T}{\sqrt{d_k}}\right)(xW^V)

This mechanism allows each token in the input to attend to all other tokens, integrating contextual information across the entire sequence.

Decoder

The decoder is more complex because it has to generate an output sequence while incorporating information from the encoder. It uses two main attention mechanisms:

Masked Self-Attention:

The decoder first applies self-attention to its own previous outputs. To maintain the autoregressive property (i.e., ensuring a token only depends on earlier tokens), a mask $M$ is applied. This mask typically sets the upper triangular part of the attention matrix to a very negative value, so that the softmax zeroes out any attention weights corresponding to future tokens. Formally:
$\text{MaskedAttention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$
Here, $Q$ , $K$ , and $V$ are derived from the decoder’s own input (previously generated tokens).
Encoder-Decoder (Cross) Attention:

After the masked self-attention, the decoder incorporates information from the encoder. In this step, the queries come from the decoder (from the output of the masked self-attention), while the keys and values come from the encoder’s final output. The formula is:
$\text{EncDecAttention}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}}) = \text{softmax}\!\left(\frac{Q_{\text{dec}} K_{\text{enc}}^T}{\sqrt{d_k}}\right)V_{\text{enc}}$
This step allows the decoder to "look" at the input sequence and incorporate context from the encoder into the output generation.

Summary of Differences

Input Sources:
- Encoder: Uses self-attention where $Q$ , $K$ , and $V$ are all derived from the same input $x$ .
- Decoder: Uses two attention mechanisms: masked self-attention on its own output (with a mask $M$ ) and cross-attention that uses the encoder's outputs for $K$ and $V$ .
Masking:
- Encoder: No masking is necessary; all tokens attend to each other.
- Decoder: Uses a mask in the self-attention to prevent future tokens from being attended to, preserving the autoregressive property.
Attention Layers:
- Encoder: A single self-attention layer per encoder block.
- Decoder: Two sequential attention layers (masked self-attention followed by encoder-decoder attention) in each decoder block.

These differences in the attention formulas are key to enabling the decoder to generate coherent output sequences while leveraging the complete context provided by the encoder.

Monday, September 23, 2024

https://toloka.ai/blog/transformer-architecture/

Saturday, September 14, 2024

traffic prediction

https://arxiv.org/abs/2409.03282

Interpretable mixture of experts for time series prediction under recurrent and non-recurrent conditions

Zemian Ke, Haocheng Duan, Sean Qian

temporal fusion transformer (TFT)

Time Features:

Time of day (cyclic encoding)
Day of week (cyclic encoding)
Month of year (cyclic encoding)
Holiday indicator (binary)

The Temporal Fusion Transformer (TFT) is a deep learning model designed for multi-horizon time series forecasting, which means it predicts future values over multiple time steps. It is particularly useful in scenarios involving complex temporal dependencies and multiple input features. TFT is both powerful for prediction and interpretable, making it stand out among time series forecasting models.

Key Components of TFT:

Variable Selection:
- TFT dynamically selects relevant features (both static and time-varying) that are important for making predictions at each time step. This is done using a gated residual network which assigns importance weights to different features. It allows the model to focus on the most relevant inputs for prediction, enhancing interpretability.
LSTM-based Encoder-Decoder:
- TFT employs long short-term memory (LSTM) networks, a type of recurrent neural network, for encoding past data (in a context window) and decoding future data (in the prediction window). The LSTM captures temporal patterns from the input data, which are crucial for accurate forecasting.
Multi-Head Attention:
- One of the standout features of TFT is the use of multi-head attention, inspired by the Transformer model. This mechanism helps the model focus on different parts of the time series and various time steps. Attention helps identify important temporal dependencies, such as sudden changes or long-term trends, at multiple time points.
Gating Mechanisms:
- TFT uses gating mechanisms throughout the model to regulate how information flows through its layers. These gates help prevent irrelevant information from propagating forward, improving efficiency and reducing noise in predictions.
Quantile Regression:
- Instead of just predicting a single point estimate, TFT can output quantile predictions (e.g., predictions at the 10th, 50th, and 90th percentiles), making it possible to estimate uncertainties in the forecast. This is particularly helpful when making forecasts under uncertain or volatile conditions.

Interpretability in TFT:

TFT is designed with interpretability in mind. Two main methods of interpretation are:

Feature Importance: TFT quantifies the importance of each input feature in predicting the target value. This allows users to understand which features, such as weather conditions, traffic incidents, or the time of day, play the most crucial role in predictions.
Temporal Attention: By utilizing multi-head attention, TFT can show which time steps in the past (within the context window) are the most influential for making predictions at future time steps.

Why TFT is Suitable for Traffic Prediction:

Capturing Complex Temporal Dependencies: Traffic patterns often involve recurring trends (like rush hours) as well as non-recurring events (like accidents or severe weather). TFT’s attention mechanism helps capture both short-term and long-term dependencies between these events and traffic speed.
Interpretability: Understanding the factors that influence traffic speeds, such as weather or incidents, is crucial for decision-making. TFT’s interpretability allows for insights into how these features affect predictions in different conditions (recurrent vs. non-recurrent).
Multi-Source Inputs: TFT can efficiently handle multiple sources of data (like traffic incidents, weather conditions, etc.), making it well-suited for multi-variable prediction problems like traffic speed forecasting.

In this paper, TFT is used as the backbone for expert models in both recurrent and non-recurrent traffic prediction, benefiting from its ability to handle temporal dependencies and provide interpretability.

Thursday, September 12, 2024

transformer for time series prediction

To use a Transformer for power prediction, one of the first steps is tokenization, which converts raw data (such as time series of power measurements) into a structured format that the Transformer can process. This approach, common in NLP tasks, can be adapted for time series data like power usage or generation. Here's how tokenization works and how it's applied in the context of power prediction:

Tokenization in Time Series for Power Prediction

In natural language processing, tokenization refers to breaking down a sentence into smaller units, like words or subwords (tokens), which the Transformer can process. When applied to time series data, tokenization involves transforming the sequence of numerical data points (e.g., power consumption or generation values at specific time intervals) into discrete "tokens" that the Transformer model can use to learn patterns.

Here’s a step-by-step breakdown of the tokenization process for time series data in power prediction:

1. Discretizing Continuous Data

Time series data, such as power consumption or generation values, is typically continuous (e.g., measured every minute, hour, or day). To apply a Transformer model, this continuous data needs to be represented as discrete tokens. There are a few strategies to accomplish this:

Binning or Quantization: Continuous power data is divided into intervals (bins), and each value is assigned a discrete category (token) based on the bin it falls into. For example, power usage values might be categorized as "low," "medium," and "high," and each category is assigned a token.
Value Ranges as Tokens: You could define a range of power values and assign a token to each range. For instance, values from 0-10 kW could be assigned token T1, values from 10-20 kW assigned T2, and so on.
Direct Numerical Encoding: Instead of binning, the actual power values could be encoded directly as tokens, allowing the Transformer to process the numeric sequence itself. In this case, the time-series data can be treated like sequences of numerical tokens without further quantization.

2. Incorporating Time as a Token

Power consumption and generation typically follow cyclical patterns, such as daily, weekly, and seasonal trends. Time information (e.g., time of day, day of the week, or season) is critical for accurate predictions. Thus, time-related features are often tokenized and incorporated into the Transformer model:

Time Stamps as Tokens: Each time step in the sequence (such as the hour of the day or the day of the week) can be encoded as a separate token. For example, 6 a.m. could be token T_time1, and 12 p.m. could be token T_time2.
Positional Encoding: The Transformer architecture uses positional encoding to capture the order of the time steps, since Transformers do not inherently understand sequence order (unlike recurrent neural networks). Positional encoding adds information about the position of each token in the sequence. For power prediction, this could represent how far a particular time step is from a reference point (like the beginning of a day or week).

3. Multi-Feature Tokenization

Power prediction often relies not only on the power usage values themselves but also on other contextual data, such as weather conditions, temperature, or grid data. Each of these additional features can be tokenized and input into the Transformer model:

Weather Conditions as Tokens: Data like temperature, wind speed, and solar radiation (relevant for renewable energy prediction) can be discretized into tokens. For example, temperature ranges (e.g., 10°C-20°C) could be assigned tokens T_temp1, T_temp2, etc.
Categorical Features as Tokens: Features like the type of day (weekday or weekend), holiday status, or operational settings for power plants can be treated as categorical variables and tokenized. Each category becomes a token that is fed into the model.

4. Encoding Sequential Data into Transformer Input

Once the raw power values, time steps, and other contextual data are tokenized, the next step is to encode them in a way that the Transformer can process. Each token is embedded into a high-dimensional vector space, where similar tokens (e.g., power usage patterns at similar times of day) are placed closer together. These embeddings, combined with positional encodings, are then input into the Transformer layers.

Example Tokenization Workflow for Power Prediction:

Let's consider a scenario where we want to predict hourly power consumption for the next 24 hours, given historical data from the past week.

Time Series Data: The power usage values for each hour over the past 7 days (168 hours) are the primary data.
- Tokenization: Each power usage value is transformed into a token using binning or direct numerical encoding.
Time of Day Encoding: Each hour is associated with a token representing its time (e.g., T_hour_12 for 12:00 p.m.).
Weather Data Encoding: For each hour, weather features (e.g., temperature, wind speed) are tokenized by discretizing them into ranges or using direct numerical encoding.
Embedding and Positional Encoding: These tokens are then embedded into vectors, and positional encodings are added to preserve the order of the sequence (since time order is crucial in time series forecasting).
Transformer Model Input: The Transformer receives the embedded sequence of tokens, which includes the historical power values, time of day, and weather conditions, and processes them through its layers to capture dependencies and make future predictions.

Advantages of Tokenization for Power Prediction Using Transformers:

Handling Long Sequences: Transformers, thanks to tokenization and self-attention mechanisms, are well-suited to handling long sequences of power data, such as hourly data over several weeks or months.
Capturing Temporal Dependencies: The tokenization of both power values and time-related features helps the model capture daily, weekly, and seasonal cycles, which are important in power prediction tasks.
Multimodal Fusion: Tokenization allows the inclusion of various input features like weather, time, and operational settings, enabling the Transformer to learn from multiple data sources simultaneously.
Flexibility: Tokenization offers flexibility in how data is represented. By choosing appropriate discretization or embedding methods, different types of input data (e.g., continuous power values, categorical weather states) can be effectively handled.

Conclusion

In summary, tokenization is a critical preprocessing step when using Transformers for power prediction. It involves converting continuous power values, time-related features, and other contextual data into tokens that the Transformer can process. Through the use of embedding and positional encodings, Transformers can capture the complex temporal dependencies and external factors affecting power generation and consumption, making them highly effective for time series forecasting in power systems.

The use of Transformers for power prediction, particularly in time series forecasting, leverages the self-attention mechanism of Transformers to capture long-range dependencies within sequential data, making it well-suited for modeling complex relationships in power systems. Here’s a detailed description of how Transformers are used in power prediction:

1. Self-Attention for Temporal Dependencies

Transformers rely on a self-attention mechanism that allows them to focus on different parts of the input sequence to identify dependencies. In the context of power prediction, this means the Transformer can analyze how past power consumption or generation values influence future values, even if those dependencies span across long time intervals. For example, power consumption patterns might repeat daily, weekly, or seasonally, and Transformers can capture these repeating patterns over long time horizons.

2. Handling High Variability in Power Data

Power data, particularly from renewable energy sources like wind or solar, is highly variable and influenced by external factors like weather. Transformers are effective at modeling such complex, multi-factor systems because of their ability to weigh different time points and features based on importance. This allows them to understand when past data is most relevant to making accurate predictions, even when the input time series has fluctuating intervals.

3. Combining Multiple Input Modalities

For power prediction, transformers can take inputs from multiple sources (such as weather data, historical power generation, and grid data) and fuse them to create more informed predictions. This multimodal input processing is key when forecasting power generation from renewable energy sources that are highly dependent on weather conditions, such as solar or wind power.

4. Sequence-to-Sequence Architecture

In power prediction tasks, transformers can be used in a sequence-to-sequence (Seq2Seq) architecture, where the input sequence consists of past power consumption/generation values, and the output sequence represents future predictions. This Seq2Seq approach allows the model to generate multi-step forecasts, predicting power generation or consumption over several hours or days into the future.

5. Forecasting with Uncertainty

Transformers can be extended with Neural Processes (NPs) to provide not just point forecasts but also uncertainty estimates in their predictions. This is particularly important in power systems, where accurate uncertainty estimation can help grid operators make informed decisions about balancing supply and demand, ensuring the stability of the grid. Neural Processes bring in a probabilistic framework to model the uncertainty around predictions, enhancing the robustness of the forecast.

6. Fine-tuning for Power Systems

Transformers used for power prediction can be fine-tuned to handle domain-specific features of power systems. This includes adjusting the attention mechanism to give more weight to key variables like time of day (peak vs. non-peak hours), weather conditions (sunlight, wind speeds), and previous day’s power generation or consumption trends. The transformer’s architecture can be adapted to capture the periodicity in power usage, such as daily cycles, weekly patterns, or seasonal variations.

Example Application in Power Systems:

Renewable Energy Forecasting: Transformers can be used to predict the power output of renewable energy sources, such as wind turbines or solar panels, by incorporating historical power data, weather forecasts, and other environmental variables. This helps in anticipating the variability of these power sources and planning for energy storage or backup generation.
Load Forecasting: Power load forecasting is critical for balancing supply and demand in electrical grids. Transformers are effective at forecasting short-term and long-term load demand by identifying patterns in historical power usage data and predicting future consumption patterns.

Benefits of Using Transformers in Power Prediction:

Scalability: Transformers can handle large datasets and long sequences, which is useful for power systems that generate large volumes of data.
Interpretability: The attention mechanism in transformers can provide insights into which time steps or features were most important for the prediction, allowing for a better understanding of the underlying dynamics in power systems.
Flexibility: Transformers can process various forms of input data (historical data, external features like weather), making them versatile tools for complex forecasting tasks.

In summary, Transformers are increasingly being used in power prediction due to their ability to handle complex time series data, capture long-term dependencies, and incorporate uncertainty into forecasts, making them highly suitable for predicting dynamic and volatile power systems.

Tuesday, July 30, 2024

Decoders pay attention only to the words before them, as opposed to encoders, which pay attention to every word regardless of order.

https://www.turing.com/kb/brief-introduction-to-transformers-and-their-power#the-transformer-encoder

Decoders pay attention only to the words before them, as opposed to encoders, which pay attention to every word regardless of order. As a result, the prediction for the word at the position, i, only depends on the words preceding it in the sequence.

encoder and decoder, BERT and GPT model

https://chatgpt.com/share/98a22c79-4119-4d07-ac17-9712301d7ae0

BERT and scGPT use different architectures (encoder-only and decoder-only, respectively) due to their specific pre-training objectives and intended applications. Here's an explanation of why each model uses its chosen architecture and how it relates to their pre-training:

BERT (Bidirectional Encoder Representations from Transformers):

1. Encoder-only architecture:

- BERT uses only the encoder part of the transformer architecture.

- This allows for bidirectional processing of input text, meaning it can look at both left and right context simultaneously.

2. Pre-training objective:

- Masked Language Modeling (MLM): BERT is trained to predict masked tokens in a sentence, given both left and right context.

- Next Sentence Prediction (NSP): It also predicts whether two sentences follow each other in the original text.

3. Relation to pre-training:

- The encoder-only architecture is well-suited for these tasks because they require understanding the full context of a sentence or pair of sentences.

- Bidirectional attention allows BERT to capture complex relationships between words in a sentence, which is crucial for predicting masked tokens accurately.

4. Intended use:

- BERT is designed for tasks that require deep understanding of text, such as classification, named entity recognition, and question answering.

- It's not meant for text generation tasks.

scGPT (Single-Cell Generative Pre-trained Transformer):

1. Decoder-only architecture:

- scGPT uses only the decoder part of the transformer architecture.

- This allows for autoregressive prediction, where each token is predicted based on the previous tokens.

2. Pre-training objective:

- Generative pre-training: scGPT is trained to predict gene expression values in a forced sequential manner.

- It uses a specialized attention masking technique to handle non-sequential single-cell data.

3. Relation to pre-training:

- The decoder-only architecture is well-suited for this task because it allows the model to generate expression values one at a time, considering the previously generated values.

- This approach enables the model to capture complex dependencies between genes and learn cell-type specific patterns.

4. Intended use:

- scGPT is designed for tasks related to single-cell multi-omics data, such as cell type annotation, multi-omics integration, and perturbation prediction.

- It can generate unknown gene expression values given known gene values or generate whole genome expression given a known cell type condition.

In summary, the choice of encoder-only for BERT and decoder-only for scGPT is directly related to their pre-training objectives and intended applications. BERT's bidirectional processing is ideal for understanding context in natural language, while scGPT's autoregressive approach is suitable for generating gene expression values in single-cell data. These architectural choices allow each model to excel in its specific domain and tasks.

Citations:

[1] https://vaclavkosar.com/ml/Encoder-only-Decoder-only-vs-Encoder-Decoder-Transfomer

[2] https://datascience.stackexchange.com/questions/65241/why-is-the-decoder-not-a-part-of-bert-architecture

[3] https://ai.stackexchange.com/questions/41505/which-situation-will-helpful-using-encoder-or-decoder-or-both-in-transformer-mod

[4] https://www.linkedin.com/pulse/transformer-architectures-dummies-part-2-decoder-only-bhaskar-t-hj9xc

[5] https://www.reddit.com/r/MLQuestions/comments/l1eiuo/when_would_we_use_a_transformer_encoder_only/

No, scGPT's masked training approach is not similar to BERT's. While both models use masking techniques, they differ significantly in their implementation and purpose:

1. BERT's Masked Language Model (MLM):

- BERT randomly masks tokens in the input sequence and predicts these masked tokens.

- It uses bidirectional context, looking at both left and right contexts to predict the masked tokens.

- This approach is designed for understanding the full context of text in natural language processing tasks.

2. scGPT's Specialized Attention Masking:

- scGPT uses a unique attention masking technique designed specifically for non-sequential single-cell data[1].

- It supports both gene-prompt and cell-prompt generations in a unified way.

- The masking defines the order of prediction based on attention scores, not on sequential order as in text.

- It allows attention computation only between embeddings of "known genes" and the query gene itself[1].

Key differences:

1. Purpose: BERT's masking is for understanding context in text, while scGPT's is for generating gene expression values in non-sequential data.

2. Directionality: BERT is bidirectional, while scGPT uses a form of forced sequential prediction in non-sequential data.

3. Prediction target: BERT predicts masked tokens, while scGPT predicts gene expression values.

4. Context use: BERT uses full context, while scGPT uses known genes to predict unknown genes iteratively.

5. Architecture: BERT is an encoder-only model, while scGPT is a decoder-only model, more similar to GPT architectures[1].

In summary, while both use masking, scGPT's approach is specifically tailored for single-cell genomic data and is fundamentally different from BERT's text-based masking approach.

Citations:

[1] https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1.full

[2] https://www.reddit.com/r/deeplearning/comments/17gmtxr/bert_explained_training_masked_language_model/

[3] https://softteco.com/blog/bert-vs-chatgpt

[4] https://blog.invgate.com/gpt-3-vs-bert

[5] https://www.biorxiv.org/content/10.1101/2023.04.30.538439v2.full

Friday, July 19, 2024

Transformer and self attention

GPT 4o

Sure, here's a revised explanation including the dimensions of the matrices involved:

Transformer Architecture

The Transformer consists of an encoder-decoder structure:

Encoder: Processes the input sequence.
Decoder: Generates the output sequence.

Each of these parts is made up of multiple layers that are identical but have different weights. The encoder and decoder layers consist of:

Multi-Head Self-Attention Mechanism
Feed-Forward Neural Network
Add and Norm Layers

Self-Attention Mechanism

Self-attention allows the model to weigh the importance of different words in a sequence relative to each other. Here's how it's calculated:

Input Vectors: The input words are first converted into vectors through an embedding layer. These vectors are denoted as $X$ .
- Dimension of $X$ : $(n, d_{\text{model}})$ , where $n$ is the sequence length (number of words) and $d_{\text{model}}$ is the dimensionality of the embedding space.
Linear Projections: Three different linear projections are applied to the input vectors to generate the Query (Q), Key (K), and Value (V) matrices.
$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$
Here, $W^Q$ , $W^K$ , and $W^V$ are weight matrices learned during training.
- Dimension of $W^Q$ , $W^K$ , $W^V$ : $(d_{\text{model}}, d_k)$
- Dimension of $Q$ , $K$ , $V$ : $(n, d_k)$
Scaled Dot-Product Attention:
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$
- $QK^T$ $Q K^{T}$ : Computes the dot product between the Query and Key matrices to get the raw attention scores.
  - Dimension of $QK^T$ : $(n, n)$
- $\sqrt{d_k}$ : The dot products are scaled by the square root of the dimension of the key vectors ( $d_k$ ) to stabilize gradients.
- $\text{softmax}$ : Applies the softmax function to obtain attention weights that sum to 1.
- $V$ $V$ : The Value matrix is then weighted by these attention scores to produce the output.
  - Dimension of output: $(n, d_k)$
Multi-Head Attention: Instead of performing a single self-attention operation, the Transformer employs multiple attention heads. Each head performs the attention mechanism independently with different learned projections and the results are concatenated and linearly transformed.
$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$
- Dimension of each head $\text{head}_i$ : $(n, d_k)$
- Number of heads $h$
- Dimension after concatenation: $(n, h \cdot d_k)$
- Dimension of $W^O$ : $(h \cdot d_k, d_{\text{model}})$
- Dimension of output: $(n, d_{\text{model}})$
Feed-Forward Network: After the multi-head attention, the output passes through a feed-forward neural network, which consists of two linear transformations with a ReLU activation in between.
$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$
- Dimension of $W_1$ : $(d_{\text{model}}, d_{\text{ff}})$
- Dimension of $W_2$ : $(d_{\text{ff}}, d_{\text{model}})$
- Dimension of $b_1$ : $(d_{\text{ff}})$
- Dimension of $b_2$ : $(d_{\text{model}})$
Add and Norm Layers: Each sub-layer (self-attention and feed-forward) has a residual connection around it, followed by layer normalization.
$\text{Output} = \text{LayerNorm}(x + \text{Sublayer}(x))$
- Dimension of output: $(n, d_{\text{model}})$

Putting It All Together

Encoding: The input sequence is processed through multiple layers of self-attention and feed-forward networks. The output of the encoder layers is a set of continuous representations of the input sequence.
Decoding: The decoder also uses self-attention and feed-forward networks. Additionally, it uses "encoder-decoder attention" to incorporate the encoder's output. This helps generate the output sequence one element at a time.

By relying on self-attention, the Transformer efficiently captures dependencies between words regardless of their distance in the sequence, enabling powerful performance on a variety of NLP tasks.

Sunday, December 10, 2023

Longformer Self Attention

https://huggingface.co/docs/transformers/model_doc/longformer

Longformer Self Attention

Longformer self attention employs self attention on both a “local” context and a “global” context. Most tokens only attend “locally” to each other meaning that each token attends to its $\frac{1}{2} �$ previous tokens and $\frac{1}{2} �$ succeeding tokens with $�$ being the window length as defined in config.attention_window. Note that config.attention_window can be of type List to define a different $�$ for each layer. A selected few tokens attend “globally” to all other tokens, as it is conventionally done for all tokens in BertSelfAttention.

Note that “locally” and “globally” attending tokens are projected by different query, key and value matrices. Also note that every “locally” attending token not only attends to tokens within its window $�$ , but also to all “globally” attending tokens so that global attention is symmetric.

The user can define which tokens attend “locally” and which tokens attend “globally” by setting the tensor global_attention_mask at run-time appropriately. All Longformer models employ the following logic for global_attention_mask:

0: the token attends “locally”,
1: the token attends “globally”