Monday, September 23, 2024

https://toloka.ai/blog/transformer-architecture/

self attenion map in gene-based transformer models

Based on the search results, self-attention maps offer several key advantages over traditional methods for analyzing gene interaction networks:

1. Dynamic and context-dependent relationships:

- Self-attention maps can capture complex, non-linear relationships between genes that change based on the specific cellular context or condition being analyzed.

- Traditional interaction networks are typically static and do not adapt to different contexts.

2. Long-range dependencies:

- Transformer architectures with self-attention can effectively model long-range dependencies between distant elements in the genome sequence.

- This allows capturing interactions between genes or regulatory elements that are far apart in the linear sequence, which is challenging for traditional methods.

3. Learning novel interactions:

- Self-attention maps are learned from data during training, potentially discovering novel relationships between genes that are not captured in existing interaction databases.

- This data-driven approach can reveal previously unknown interactions.

4. Improved prediction accuracy:

- Models using self-attention have demonstrated superior performance on tasks like gene expression prediction compared to previous approaches.

- For example, the Enformer model showed improved correlation between predictions and measured data relative to previous state-of-the-art models without self-attention.

5. Capturing regulatory relationships:

- Studies have shown that attention maps can reveal meaningful biological patterns like regulatory elements, coding vs non-coding regions, and gene expression relationships.

- The Enformer model, for instance, learned about the role of tissue-specific enhancers, promoters, and insulator elements.

6. Integration of multiple data types:

- Self-attention mechanisms can integrate information from multiple omics data types and experimental assays to build a more comprehensive view of gene interactions.

7. Interpretability:

- While more abstract than traditional networks, attention weights can be analyzed to understand which gene relationships the model deems important for a given prediction.

- This allows researchers to discern which molecular sections the model prioritizes, providing insights into structure-activity relationships.

8. Parallel processing:

- Thanks to their self-attention properties, these models can process data in parallel, greatly enhancing computational efficiency compared to sequential processing in traditional methods.

In summary, self-attention maps offer a more flexible, data-driven approach to modeling gene interactions that can adapt to specific contexts, capture long-range dependencies, and potentially reveal novel biological insights beyond what is captured in static interaction networks.

Citations:

[1] https://www.nature.com/articles/s41592-021-01252-x

[2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10772984/

[3] https://academic.oup.com/bib/article/25/1/bbad467/7512647

[4] https://academic.oup.com/nar/article/49/13/e77/6266414

[5] https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1.full

[6] https://www.instadeep.com/2024/04/building-the-next-generation-of-ai-models-to-decipher-human-biology/

[7] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10376273/

[8] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10400306/

nucleotide GPT models

there are several nucleotide-based GPT models being developed and applied in genomics. Here are some key examples:

1. GROVER (Genomic Representation Of Variant Effects Research):

- A DNA language model that learns sequence context in the human genome

- Uses a frequency-balanced vocabulary created through Byte Pair Encoding (BPE) on nucleotide sequences

- Employs a BERT-like architecture for masked token prediction

- Learns to capture important genomic features like GC content, AG content, and functional annotations[1]

https://www.nature.com/articles/s42256-024-00872-0

Based on the search results, GROVER (Genomic Representation Of Variant Effects Research) uses a custom tokenization approach that incorporates k-mers, but it's more sophisticated than simply using fixed-length k-mers as tokens. Here are the key points about GROVER's tokenization:

Byte Pair Encoding (BPE): GROVER uses a modified version of BPE adapted for genomic sequences to create its vocabulary.
Frequency-balanced vocabulary: The tokenization creates a frequency-balanced vocabulary of up to 5,000 tokens.
Variable token lengths: Unlike fixed k-mer approaches, GROVER's tokens have variable lengths:
- The average token length is 4.07 nucleotides.
- Token lengths range from 1-mer (a single guanine) to 16-mers (A16 and T16).
- Most tokens in the dictionary are 5-mers and 6-mers, with 213 and 224 tokens each.
Heterogeneous representation: Not all possible k-mers are generated as tokens, as some smaller tokens may be combined into larger, more frequent combinations.
Special tokens: GROVER includes special tokens like CLS, PAD, UNK, SEP, and MASK, in addition to the genomic sequence tokens.
Vocabulary size: The final GROVER model uses a vocabulary of 601 tokens.

This approach allows GROVER to capture more nuanced patterns in genomic sequences compared to fixed k-mer tokenization. It creates a balance between capturing short, frequent patterns and longer, potentially meaningful sequences, while maintaining a manageable vocabulary size.

2. Nucleotide Transformer:

- A self-supervised learning model trained on large amounts of unlabeled genomic data

- Uses a BERT-like architecture with bidirectional training to capture the nature of DNA information flow

- Trained on genomes from humans and other species, with tokens representing groups of nucleotides

- Acquires biological knowledge during pre-training, capturing patterns like regulatory elements and coding/non-coding regions

- Can be fine-tuned for specific genomic prediction tasks[2]

3. GeneTuring:

- While not a model itself, it's a comprehensive question-answering database used to evaluate GPT models' performance in genomics

- Tests various GPT models, including GPT-3, ChatGPT, and specialized biomedical models like BioGPT and BioMedLM, on genomics-related tasks[3]

4. GeneGPT:

- A method that teaches large language models to use NCBI Web APIs for answering genomics questions

- Utilizes Codex (a code-generation model) with in-context learning to access and use genomic databases

- Achieves state-of-the-art performance on several genomics tasks in the GeneTuring benchmark[4]

These models demonstrate the growing application of GPT-like architectures in genomics, leveraging large-scale pre-training on nucleotide sequences to capture biological knowledge and perform various genomic tasks. They show promise in areas such as predicting functional elements, understanding genetic variations, and answering complex genomics questions.

Citations:

[1] https://www.nature.com/articles/s42256-024-00872-0

[2] https://www.instadeep.com/2024/04/building-the-next-generation-of-ai-models-to-decipher-human-biology/

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10054955/

[4] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153281/

[5] https://codeocean.com/blog/map-of-foundational-models-for-use-in-biotech-and-pharma-rd

[6] https://openreview.net/pdf?id=m_wC36dmA3

[7] https://blog.salesforceairesearch.com/progen/

[8] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10400306/

ProGen (2023) and conditional transformer model

Article
Published: 26 January 2023

Large language models generate functional protein sequences across diverse families

Nature Biotechnology volume 41, pages1099–1106 (2023)Cite this article

94k Accesses
234 Citations
1105 Altmetric
Metricsdetails

https://www.nature.com/articles/s41587-022-01618-2

2. ProGen (2023):

- Trained on 280 million protein sequences from over 19,000 families

- Can generate functional protein sequences across diverse protein families

- Demonstrated ability to generate artificial lysozymes with catalytic activity

ProGen uses next token prediction training, which is a common approach for language models. Based on the information from the search results, here are the key points about ProGen's training approach:

Next-token prediction: ProGen is trained using a self-supervision task of next-token prediction. As stated in the Salesforce Research Blog: "ProGen takes each training sample and formulates a guessing game per word, more precisely a self-supervision task of next-token prediction."
Training objective: The model is trained to predict the probability of the next amino acid given the past amino acids in a protein sequence. From the NCBI article: "ProGen is trained to generate artificial sequences by minimizing the loss over the next amino acid prediction problem on the universal protein sequence dataset."
Iterative process: The training involves multiple rounds where the model plays this prediction game for every amino acid in all protein sequences in the training dataset. The Salesforce blog mentions: "By the end of training, ProGen has become an expert at predicting the next amino acid by playing this game approximately 1 trillion times."
Autoregressive generation: After training, ProGen can generate protein sequences in an autoregressive manner, predicting one amino acid at a time based on the previously generated sequence. As stated in the NCBI article: "ProGen is a decoder transformer tailored for autoregressive generation: it generates a sequence in a left-to-right manner, token-by-token, where the next token is conditioned on all previously generated tokens."

This next-token prediction approach allows ProGen to learn the patterns and relationships in protein sequences, enabling it to generate novel, functional protein sequences after training.

ProGen uses a character-level tokenization approach for protein sequences. Here are the key aspects of ProGen's tokenization process:

Amino Acid Tokens: Each individual amino acid in a protein sequence is treated as a separate token. This means that the model works at the finest granularity possible for protein sequences, considering each amino acid independently.
Special Tokens: The tokenization process includes special tokens to represent the start and end of sequences. These tokens help the model understand the boundaries of protein sequences during training and generation.
Control Tags: ProGen incorporates control tags as additional tokens. These tags are used to specify desired protein properties and provide context for the generation process. The model utilizes over 100,000 conditioning tags, which include information such as:
- Organism taxonomic information
- Molecular function
- Cellular component
- Biological process
- Other relevant metadata
Vocabulary Size: While the exact vocabulary size isn't specified in the search results, it's likely to include:
- 20 standard amino acids
- Special tokens (start, end)
- Control tags
Sequence Representation: A protein sequence is represented as a series of tokens, starting with control tags, followed by amino acid tokens, and ending with a special end token.
No Subword Tokenization: Unlike some language models that use subword tokenization methods like Byte Pair Encoding (BPE) or WordPiece, ProGen operates directly on individual amino acids. This approach is well-suited for the discrete nature of protein sequences.
Unicode Handling: While not explicitly mentioned for ProGen, character-level tokenization can potentially handle unicode characters, emojis, and other special characters if they appear in the input data.

This tokenization approach allows ProGen to work directly with the fundamental units of protein sequences (amino acids) while incorporating additional contextual information through control tags. This enables the model to learn and generate protein sequences with fine-grained control and accuracy.

Based on the search results and the information provided, here's an explanation of the conditional language transformer used by ProGen:

Architecture:
- ProGen is based on a decoder-style Transformer variant.
- It has 36 layers and 8 attention heads per layer.
- The model has a total of 1.2 billion trainable parameters.
Conditional Generation:
- ProGen is a conditional language model, meaning it can generate protein sequences based on input control tags.
- These control tags allow for steerable generation by specifying desired protein properties.
Tokenization:
- ProGen uses character-level tokenization for protein sequences.
- Each amino acid is treated as a separate token.
- Special tokens are used to represent the start and end of sequences.
- Control tags are incorporated as additional tokens.
Training Objective:
- The model is trained using a next-token prediction task.
- It learns to predict the probability of the next amino acid given the past amino acids in a sequence.
- This is formulated as a self-supervised learning task.
Training Data:
- ProGen was trained on a dataset of 280 million protein sequences from over 19,000 protein families.
- The training data includes associated metadata formulated as conditioning tags.
Conditional Generation Process:
- During generation, the model takes input control tags and any initial sequence.
- It then generates the protein sequence in a left-to-right, autoregressive manner.
- Each new amino acid is predicted based on all previously generated tokens and the input control tags.
Control Tags:
- ProGen utilizes over 100,000 conditioning tags.
- These tags include information such as organism taxonomic data, molecular function, cellular component, biological process, and other relevant metadata.
Attention Mechanism:
- The model uses self-attention to capture comprehensive residue-residue interactions within protein sequences.

This architecture allows ProGen to learn the patterns and relationships in protein sequences while providing controllable generation capabilities through the use of conditioning tags. The large-scale training on diverse protein families enables the model to generate functional protein sequences across multiple families and functions.

decoder units are very common in autoregressive generative AI models. Here are the key points explaining why:

Autoregressive generation: Autoregressive models generate output sequences one token at a time, conditioning each new token on the previously generated tokens. This process aligns well with the functionality of decoder units.
Left-to-right generation: Decoder units are designed to generate sequences in a left-to-right manner, which is exactly how autoregressive models operate. As stated in the search results, ProGen "generates a sequence in a left-to-right manner, token-by-token, where the next token is conditioned on all previously generated tokens."
Transformer architecture: Many modern autoregressive generative AI models are based on the Transformer architecture, specifically using the decoder part. For example, the GPT (Generative Pre-trained Transformer) family of models uses only the decoder portion of the original Transformer architecture.
Self-attention mechanism: Decoder units in Transformer-based models use a masked self-attention mechanism, which prevents the model from attending to future positions. This ensures that predictions for a given position depend only on known outputs at previous positions, maintaining the autoregressive property.
Language modeling: Decoder-only architectures have proven highly effective for language modeling tasks, which are fundamentally autoregressive in nature. Models like GPT use "only the decoder for autoregressive language modeling."
Versatility: Decoder-only models have shown great success in various generative tasks beyond just text generation, including image synthesis and time-series prediction.

In summary, the decoder unit's ability to generate sequences in an autoregressive manner, combined with its effectiveness in capturing long-range dependencies through self-attention mechanisms, makes it a natural and common choice for autoregressive generative AI models.

ProGen and scGPT are both large language models designed for biological applications, but they have some key differences in their architecture, training data, and intended use cases. Here's a comparison of the two models:

Architecture

ProGen:

Uses a decoder-only transformer architecture
Has 36 layers and 8 attention heads per layer
Contains 1.2 billion trainable parameters

scGPT:

Uses an encoder-decoder transformer architecture
Specific layer and parameter counts are not provided in the search results

Training Data

ProGen:

Trained on 280 million protein sequences from over 19,000 protein families
Includes associated metadata as conditioning tags

scGPT:

Trained on single-cell RNA sequencing (scRNA-seq) data
Specific dataset size not mentioned in the search results

Tokenization

ProGen:

Uses character-level tokenization for protein sequences
Each amino acid is treated as a separate token
Incorporates control tags as additional tokens

scGPT:

Tokenization details not specified in the search results

Training Objective

ProGen:

Trained using next-token prediction (autoregressive generation)
Learns to predict the probability of the next amino acid in a sequence

scGPT:

Specific training objective not detailed in the search results

Application

ProGen:

Designed for generating functional protein sequences across diverse families
Can be used for protein engineering and design tasks

scGPT:

Focused on single-cell genomics applications
Used for tasks like cell type annotation, batch effect removal, and gene expression prediction

Conditional Generation

ProGen:

Uses control tags to enable conditional generation of protein sequences
Can generate sequences based on specified protein properties

scGPT:

Conditional generation capabilities not explicitly mentioned in the search results

Evaluation

ProGen:

Evaluated using NLP metrics like sample perplexity
Also assessed using bioinformatics and biophysics metrics

scGPT:

Evaluation metrics not specified in the search results

In summary, while both models apply language modeling techniques to biological data, ProGen is specialized for protein sequence generation, while scGPT is tailored for single-cell genomics tasks. Their architectures and training data reflect these different focuses within the broader field of computational biology

the conditional token approach used in ProGen is similar to that used in conditional language models like CTRL (Conditional Transformer Language). Here are the key similarities:

Conditional Generation: Both ProGen and CTRL use conditional tokens to steer the generation process. In ProGen, these are called "control tags", while CTRL refers to them as "control codes".
Input Prepending: Both models prepend the conditional tokens to the input sequence. As stated in the ProGen description, "Let be the sequence formed by prepending a control tag sequence to an amino acid sequence."
Diverse Control: Both models allow for diverse types of conditioning. ProGen uses over 100,000 conditioning tags that include information such as "organism taxonomic information, molecular function, cellular component, biological process, and more." Similarly, CTRL uses control codes for various attributes like style, content, and task specification.
Training Objective: Both models incorporate the conditional tokens into their training objective. ProGen is "trained to generate artificial sequences by minimizing the loss over the next amino acid prediction problem" while considering the control tags.
Transformer Architecture: Both ProGen and CTRL are based on the Transformer architecture, specifically using decoder-style models for autoregressive generation.
Large-scale Pretraining: Both models are pretrained on large datasets. ProGen uses 280 million protein sequences, while CTRL was trained on a large corpus of internet text.
Controllable Generation: After training, both models can generate sequences conditioned on the input control tokens, allowing for steerable generation.

The main difference lies in their application domains: ProGen is specialized for protein sequences, while CTRL is designed for natural language text. However, the underlying principle of using conditional tokens to control generation is similar in both models, demonstrating the versatility of this approach across different domains of sequence generation.