Monday, September 23, 2024

ProGen (2023) and conditional transformer model

 

https://www.nature.com/articles/s41587-022-01618-2 


2. ProGen (2023):

- Trained on 280 million protein sequences from over 19,000 families  

- Can generate functional protein sequences across diverse protein families

- Demonstrated ability to generate artificial lysozymes with catalytic activity

ProGen uses next token prediction training, which is a common approach for language models. Based on the information from the search results, here are the key points about ProGen's training approach:

  1. Next-token prediction: ProGen is trained using a self-supervision task of next-token prediction. As stated in the Salesforce Research Blog: "ProGen takes each training sample and formulates a guessing game per word, more precisely a self-supervision task of next-token prediction."
  2. Training objective: The model is trained to predict the probability of the next amino acid given the past amino acids in a protein sequence. From the NCBI article: "ProGen is trained to generate artificial sequences by minimizing the loss over the next amino acid prediction problem on the universal protein sequence dataset."
  3. Iterative process: The training involves multiple rounds where the model plays this prediction game for every amino acid in all protein sequences in the training dataset. The Salesforce blog mentions: "By the end of training, ProGen has become an expert at predicting the next amino acid by playing this game approximately 1 trillion times."
  4. Autoregressive generation: After training, ProGen can generate protein sequences in an autoregressive manner, predicting one amino acid at a time based on the previously generated sequence. As stated in the NCBI article: "ProGen is a decoder transformer tailored for autoregressive generation: it generates a sequence in a left-to-right manner, token-by-token, where the next token is conditioned on all previously generated tokens."

This next-token prediction approach allows ProGen to learn the patterns and relationships in protein sequences, enabling it to generate novel, functional protein sequences after training.


ProGen uses a character-level tokenization approach for protein sequences. Here are the key aspects of ProGen's tokenization process:

  1. Amino Acid Tokens: Each individual amino acid in a protein sequence is treated as a separate token. This means that the model works at the finest granularity possible for protein sequences, considering each amino acid independently.
  2. Special Tokens: The tokenization process includes special tokens to represent the start and end of sequences. These tokens help the model understand the boundaries of protein sequences during training and generation.
  3. Control Tags: ProGen incorporates control tags as additional tokens. These tags are used to specify desired protein properties and provide context for the generation process. The model utilizes over 100,000 conditioning tags, which include information such as:
    • Organism taxonomic information
    • Molecular function
    • Cellular component
    • Biological process
    • Other relevant metadata
  4. Vocabulary Size: While the exact vocabulary size isn't specified in the search results, it's likely to include:
    • 20 standard amino acids
    • Special tokens (start, end)
    • Control tags
  5. Sequence Representation: A protein sequence is represented as a series of tokens, starting with control tags, followed by amino acid tokens, and ending with a special end token.
  6. No Subword Tokenization: Unlike some language models that use subword tokenization methods like Byte Pair Encoding (BPE) or WordPiece, ProGen operates directly on individual amino acids. This approach is well-suited for the discrete nature of protein sequences.
  7. Unicode Handling: While not explicitly mentioned for ProGen, character-level tokenization can potentially handle unicode characters, emojis, and other special characters if they appear in the input data.

This tokenization approach allows ProGen to work directly with the fundamental units of protein sequences (amino acids) while incorporating additional contextual information through control tags. This enables the model to learn and generate protein sequences with fine-grained control and accuracy.

Based on the search results and the information provided, here's an explanation of the conditional language transformer used by ProGen:

  1. Architecture:
    • ProGen is based on a decoder-style Transformer variant.
    • It has 36 layers and 8 attention heads per layer.
    • The model has a total of 1.2 billion trainable parameters.
  2. Conditional Generation:
    • ProGen is a conditional language model, meaning it can generate protein sequences based on input control tags.
    • These control tags allow for steerable generation by specifying desired protein properties.
  3. Tokenization:
    • ProGen uses character-level tokenization for protein sequences.
    • Each amino acid is treated as a separate token.
    • Special tokens are used to represent the start and end of sequences.
    • Control tags are incorporated as additional tokens.
  4. Training Objective:
    • The model is trained using a next-token prediction task.
    • It learns to predict the probability of the next amino acid given the past amino acids in a sequence.
    • This is formulated as a self-supervised learning task.
  5. Training Data:
    • ProGen was trained on a dataset of 280 million protein sequences from over 19,000 protein families.
    • The training data includes associated metadata formulated as conditioning tags.
  6. Conditional Generation Process:
    • During generation, the model takes input control tags and any initial sequence.
    • It then generates the protein sequence in a left-to-right, autoregressive manner.
    • Each new amino acid is predicted based on all previously generated tokens and the input control tags.
  7. Control Tags:
    • ProGen utilizes over 100,000 conditioning tags.
    • These tags include information such as organism taxonomic data, molecular function, cellular component, biological process, and other relevant metadata.
  8. Attention Mechanism:
    • The model uses self-attention to capture comprehensive residue-residue interactions within protein sequences.

This architecture allows ProGen to learn the patterns and relationships in protein sequences while providing controllable generation capabilities through the use of conditioning tags. The large-scale training on diverse protein families enables the model to generate functional protein sequences across multiple families and functions.

decoder units are very common in autoregressive generative AI models. Here are the key points explaining why:

  1. Autoregressive generation: Autoregressive models generate output sequences one token at a time, conditioning each new token on the previously generated tokens. This process aligns well with the functionality of decoder units.
  2. Left-to-right generation: Decoder units are designed to generate sequences in a left-to-right manner, which is exactly how autoregressive models operate. As stated in the search results, ProGen "generates a sequence in a left-to-right manner, token-by-token, where the next token is conditioned on all previously generated tokens."
  3. Transformer architecture: Many modern autoregressive generative AI models are based on the Transformer architecture, specifically using the decoder part. For example, the GPT (Generative Pre-trained Transformer) family of models uses only the decoder portion of the original Transformer architecture.
  4. Self-attention mechanism: Decoder units in Transformer-based models use a masked self-attention mechanism, which prevents the model from attending to future positions. This ensures that predictions for a given position depend only on known outputs at previous positions, maintaining the autoregressive property.
  5. Language modeling: Decoder-only architectures have proven highly effective for language modeling tasks, which are fundamentally autoregressive in nature. Models like GPT use "only the decoder for autoregressive language modeling."
  6. Versatility: Decoder-only models have shown great success in various generative tasks beyond just text generation, including image synthesis and time-series prediction.

In summary, the decoder unit's ability to generate sequences in an autoregressive manner, combined with its effectiveness in capturing long-range dependencies through self-attention mechanisms, makes it a natural and common choice for autoregressive generative AI models.


ProGen and scGPT are both large language models designed for biological applications, but they have some key differences in their architecture, training data, and intended use cases. Here's a comparison of the two models:

Architecture

ProGen:

  • Uses a decoder-only transformer architecture
  • Has 36 layers and 8 attention heads per layer
  • Contains 1.2 billion trainable parameters

scGPT:

  • Uses an encoder-decoder transformer architecture
  • Specific layer and parameter counts are not provided in the search results

Training Data

ProGen:

  • Trained on 280 million protein sequences from over 19,000 protein families
  • Includes associated metadata as conditioning tags

scGPT:

  • Trained on single-cell RNA sequencing (scRNA-seq) data
  • Specific dataset size not mentioned in the search results

Tokenization

ProGen:

  • Uses character-level tokenization for protein sequences
  • Each amino acid is treated as a separate token
  • Incorporates control tags as additional tokens

scGPT:

  • Tokenization details not specified in the search results

Training Objective

ProGen:

  • Trained using next-token prediction (autoregressive generation)
  • Learns to predict the probability of the next amino acid in a sequence

scGPT:

  • Specific training objective not detailed in the search results

Application

ProGen:

  • Designed for generating functional protein sequences across diverse families
  • Can be used for protein engineering and design tasks

scGPT:

  • Focused on single-cell genomics applications
  • Used for tasks like cell type annotation, batch effect removal, and gene expression prediction

Conditional Generation

ProGen:

  • Uses control tags to enable conditional generation of protein sequences
  • Can generate sequences based on specified protein properties

scGPT:

  • Conditional generation capabilities not explicitly mentioned in the search results

Evaluation

ProGen:

  • Evaluated using NLP metrics like sample perplexity
  • Also assessed using bioinformatics and biophysics metrics

scGPT:

  • Evaluation metrics not specified in the search results

In summary, while both models apply language modeling techniques to biological data, ProGen is specialized for protein sequence generation, while scGPT is tailored for single-cell genomics tasks. Their architectures and training data reflect these different focuses within the broader field of computational biology


the conditional token approach used in ProGen is similar to that used in conditional language models like CTRL (Conditional Transformer Language). Here are the key similarities:

  1. Conditional Generation: Both ProGen and CTRL use conditional tokens to steer the generation process. In ProGen, these are called "control tags", while CTRL refers to them as "control codes".
  2. Input Prepending: Both models prepend the conditional tokens to the input sequence. As stated in the ProGen description, "Let be the sequence formed by prepending a control tag sequence to an amino acid sequence."
  3. Diverse Control: Both models allow for diverse types of conditioning. ProGen uses over 100,000 conditioning tags that include information such as "organism taxonomic information, molecular function, cellular component, biological process, and more." Similarly, CTRL uses control codes for various attributes like style, content, and task specification.
  4. Training Objective: Both models incorporate the conditional tokens into their training objective. ProGen is "trained to generate artificial sequences by minimizing the loss over the next amino acid prediction problem" while considering the control tags.
  5. Transformer Architecture: Both ProGen and CTRL are based on the Transformer architecture, specifically using decoder-style models for autoregressive generation.
  6. Large-scale Pretraining: Both models are pretrained on large datasets. ProGen uses 280 million protein sequences, while CTRL was trained on a large corpus of internet text.
  7. Controllable Generation: After training, both models can generate sequences conditioned on the input control tokens, allowing for steerable generation.

The main difference lies in their application domains: ProGen is specialized for protein sequences, while CTRL is designed for natural language text. However, the underlying principle of using conditional tokens to control generation is similar in both models, demonstrating the versatility of this approach across different domains of sequence generation.





1 comment:

  1. Fascinating insights into ProGen and conditional transformers! For those coding around these models, using FiraCode can greatly enhance clarity and readability in your code.

    ReplyDelete