ProtGPT2 is a deep unsupervised language model for protein design

Nature Communications volume 13, Article number: 4348 (2022)

transformer-based model trained on ~50 million protein sequences
Generates de novo protein sequences that follow natural amino acid propensities
Can sample unexplored regions of protein sequence space

ProtGPT2 used pretraining as a key part of its development. Here are the key points about ProtGPT2's pretraining:

Pretraining Data: ProtGPT2 was pretrained on about 50 million non-annotated protein sequences spanning the entire protein space. Specifically, it used the UniRef50 database (version 2021_04) for pretraining.
Unsupervised Learning: The pretraining was done in an unsupervised fashion, meaning the raw sequence data was used without including any functional annotations of the sequences.
Training Objective: ProtGPT2 was trained using a causal language modeling objective, where the model learns to predict the next token (amino acid or oligomer) in the sequence.
Model Architecture: ProtGPT2 is based on the GPT2 Transformer architecture and contains 36 layers with a model dimensionality of 1280, totaling 738 million parameters.
Purpose of Pretraining: Through this pretraining process, ProtGPT2 learned an internal representation of proteins and became able to "speak" the protein language.
Outcome: The pretraining enabled ProtGPT2 to generate de novo protein sequences that follow natural amino acid propensities and explore unseen regions of the protein space.

This pretraining approach allowed ProtGPT2 to learn the general patterns and structures of protein sequences, which it could then use for various downstream tasks like generating novel protein sequences or being fine-tuned for specific protein families.

ProtGPT2 is a generative protein language model developed for de novo protein design. Here are the key details about the ProtGPT2 model:

Architecture:

Based on the GPT2 Transformer architecture
Contains 36 layers with a model dimensionality of 1280
Has a total of 738 million parameters

Training Data:

Pretrained on about 50 million non-annotated protein sequences
Used the UniRef50 database (version 2021_04) for training
Training was done on raw sequences without including FASTA headers or functional annotations

Training Objective:

Used a causal language modeling objective
The model learns to predict the next token (amino acid or oligomer) in the sequence

Tokenization:

Used the Byte Pair Encoding (BPE) algorithm for tokenization
Each token represents on average about 4 amino acids

Capabilities:

Can generate de novo protein sequences that follow natural amino acid propensities
Able to explore unseen regions of the protein sequence space
Can be used in a zero-shot fashion or fine-tuned on specific datasets

Key Features:

Generates sequences with amino acid and disorder propensities similar to natural proteins
88% of generated sequences are predicted to be globular, in line with natural proteins
Produces sequences that are evolutionarily distant from known proteins
Can generate sequences in seconds on standard workstations

Availability:

The model and datasets are freely available on the HuggingFace repository

ProtGPT2 represents a significant advance in protein language modeling, allowing for efficient high-throughput protein engineering and design by generating novel protein sequences that follow the principles of natural proteins while exploring new areas of the protein space.

Open Notebook

Monday, September 23, 2024

protGPT2 2022

ProtGPT2 is a deep unsupervised language model for protein design

No comments:

Post a Comment