there are several nucleotide-based GPT models being developed and applied in genomics. Here are some key examples:
1. GROVER (Genomic Representation Of Variant Effects Research):
- A DNA language model that learns sequence context in the human genome
- Uses a frequency-balanced vocabulary created through Byte Pair Encoding (BPE) on nucleotide sequences
- Employs a BERT-like architecture for masked token prediction
- Learns to capture important genomic features like GC content, AG content, and functional annotations[1]
https://www.nature.com/articles/s42256-024-00872-0
Based on the search results, GROVER (Genomic Representation Of Variant Effects Research) uses a custom tokenization approach that incorporates k-mers, but it's more sophisticated than simply using fixed-length k-mers as tokens. Here are the key points about GROVER's tokenization:
- Byte Pair Encoding (BPE): GROVER uses a modified version of BPE adapted for genomic sequences to create its vocabulary.
- Frequency-balanced vocabulary: The tokenization creates a frequency-balanced vocabulary of up to 5,000 tokens.
- Variable token lengths: Unlike fixed k-mer approaches, GROVER's tokens have variable lengths:- The average token length is 4.07 nucleotides.
- Token lengths range from 1-mer (a single guanine) to 16-mers (A16 and T16).
- Most tokens in the dictionary are 5-mers and 6-mers, with 213 and 224 tokens each.
 
- Heterogeneous representation: Not all possible k-mers are generated as tokens, as some smaller tokens may be combined into larger, more frequent combinations.
- Special tokens: GROVER includes special tokens like CLS, PAD, UNK, SEP, and MASK, in addition to the genomic sequence tokens.
- Vocabulary size: The final GROVER model uses a vocabulary of 601 tokens.
This approach allows GROVER to capture more nuanced patterns in genomic sequences compared to fixed k-mer tokenization. It creates a balance between capturing short, frequent patterns and longer, potentially meaningful sequences, while maintaining a manageable vocabulary size.
2. Nucleotide Transformer:
- A self-supervised learning model trained on large amounts of unlabeled genomic data
- Uses a BERT-like architecture with bidirectional training to capture the nature of DNA information flow
- Trained on genomes from humans and other species, with tokens representing groups of nucleotides
- Acquires biological knowledge during pre-training, capturing patterns like regulatory elements and coding/non-coding regions
- Can be fine-tuned for specific genomic prediction tasks[2]
3. GeneTuring:
- While not a model itself, it's a comprehensive question-answering database used to evaluate GPT models' performance in genomics
- Tests various GPT models, including GPT-3, ChatGPT, and specialized biomedical models like BioGPT and BioMedLM, on genomics-related tasks[3]
4. GeneGPT:
- A method that teaches large language models to use NCBI Web APIs for answering genomics questions
- Utilizes Codex (a code-generation model) with in-context learning to access and use genomic databases
- Achieves state-of-the-art performance on several genomics tasks in the GeneTuring benchmark[4]
These models demonstrate the growing application of GPT-like architectures in genomics, leveraging large-scale pre-training on nucleotide sequences to capture biological knowledge and perform various genomic tasks. They show promise in areas such as predicting functional elements, understanding genetic variations, and answering complex genomics questions.
Citations:
[1] https://www.nature.com/articles/s42256-024-00872-0
[2] https://www.instadeep.com/2024/04/building-the-next-generation-of-ai-models-to-decipher-human-biology/
[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10054955/
[4] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10153281/
[5] https://codeocean.com/blog/map-of-foundational-models-for-use-in-biotech-and-pharma-rd
[6] https://openreview.net/pdf?id=m_wC36dmA3
[7] https://blog.salesforceairesearch.com/progen/
[8] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10400306/
 
No comments:
Post a Comment