Monday, September 23, 2024

protGPT2 2022


ProtGPT2 is a deep unsupervised language model for protein design


  • transformer-based model trained on ~50 million protein sequences
  • Generates de novo protein sequences that follow natural amino acid propensities
  • Can sample unexplored regions of protein sequence space
ProtGPT2 used pretraining as a key part of its development. Here are the key points about ProtGPT2's pretraining:
  1. Pretraining Data: ProtGPT2 was pretrained on about 50 million non-annotated protein sequences spanning the entire protein space. Specifically, it used the UniRef50 database (version 2021_04) for pretraining.
  2. Unsupervised Learning: The pretraining was done in an unsupervised fashion, meaning the raw sequence data was used without including any functional annotations of the sequences.
  3. Training Objective: ProtGPT2 was trained using a causal language modeling objective, where the model learns to predict the next token (amino acid or oligomer) in the sequence.
  4. Model Architecture: ProtGPT2 is based on the GPT2 Transformer architecture and contains 36 layers with a model dimensionality of 1280, totaling 738 million parameters.
  5. Purpose of Pretraining: Through this pretraining process, ProtGPT2 learned an internal representation of proteins and became able to "speak" the protein language.
  6. Outcome: The pretraining enabled ProtGPT2 to generate de novo protein sequences that follow natural amino acid propensities and explore unseen regions of the protein space.

This pretraining approach allowed ProtGPT2 to learn the general patterns and structures of protein sequences, which it could then use for various downstream tasks like generating novel protein sequences or being fine-tuned for specific protein families. 


ProtGPT2 is a generative protein language model developed for de novo protein design. Here are the key details about the ProtGPT2 model:

  1. Architecture:
  • Based on the GPT2 Transformer architecture
  • Contains 36 layers with a model dimensionality of 1280
  • Has a total of 738 million parameters
  1. Training Data:
  • Pretrained on about 50 million non-annotated protein sequences
  • Used the UniRef50 database (version 2021_04) for training
  • Training was done on raw sequences without including FASTA headers or functional annotations
  1. Training Objective:
  • Used a causal language modeling objective
  • The model learns to predict the next token (amino acid or oligomer) in the sequence
  1. Tokenization:
  • Used the Byte Pair Encoding (BPE) algorithm for tokenization
  • Each token represents on average about 4 amino acids
  1. Capabilities:
  • Can generate de novo protein sequences that follow natural amino acid propensities
  • Able to explore unseen regions of the protein sequence space
  • Can be used in a zero-shot fashion or fine-tuned on specific datasets
  1. Key Features:
  • Generates sequences with amino acid and disorder propensities similar to natural proteins
  • 88% of generated sequences are predicted to be globular, in line with natural proteins
  • Produces sequences that are evolutionarily distant from known proteins
  • Can generate sequences in seconds on standard workstations
  1. Availability:
  • The model and datasets are freely available on the HuggingFace repository

ProtGPT2 represents a significant advance in protein language modeling, allowing for efficient high-throughput protein engineering and design by generating novel protein sequences that follow the principles of natural proteins while exploring new areas of the protein space.




No comments:

Post a Comment