Tuesday, July 30, 2024

Decoders pay attention only to the words before them, as opposed to encoders, which pay attention to every word regardless of order.

https://www.turing.com/kb/brief-introduction-to-transformers-and-their-power#the-transformer-encoder

 

Decoders pay attention only to the words before them, as opposed to encoders, which pay attention to every word regardless of order. As a result, the prediction for the word at the position, i, only depends on the words preceding it in the sequence.


scGPT

 

Based on the search results, scGPT determines "known" and "unknown" genes in the following ways:


1. During Training:

   - A random proportion of genes are selected as "unknown" genes.

   - The expression values of these "unknown" genes are omitted from the input.

   - The remaining genes are considered "known" genes.


2. Attention Masking:

   - The model uses a specialized attention mask that only allows attention computation between the "known" genes and the query gene itself.

   - "Unknown" genes cannot attend to other "unknown" genes.


3. Iterative Prediction Process:

   - The model predicts expression values for "unknown" genes based on information from "known" genes.

   - During inference, the process is conducted in K iterative steps.


4. Confidence-based Selection:

   - In each iteration, scGPT selects the top 1/K genes from the "unknown" set with the highest prediction confidence.

   - These newly predicted genes become "known" genes for the next iteration.


5. Auto-regressive Generation:

   - This process creates a form of auto-regressive generation for non-sequential data.

   - Gene expressions predicted with highest confidence are generated first and then used to help predict subsequent genes.


6. Cell-prompt vs. Gene-prompt Generation:

   - For cell-prompt generation, it starts with a cell embedding representing the cell type condition.

   - For gene-prompt generation, it begins with a set of known genes with observed expression values.


This approach allows scGPT to handle the non-sequential nature of single-cell data while still leveraging the power of transformer models for prediction tasks.


Citations:

[1] https://twitter.com/simocristea/status/1676323087959179264

[2] https://www.linkedin.com/pulse/new-generative-ai-tool-predicts-gene-expression-single-colangelo-x7ebf

[3] https://www.the-scientist.com/a-new-ai-tool-predicts-gene-expression-in-a-single-cell-71295

[4] https://www.biorxiv.org/content/10.1101/2023.04.30.538439v2.full

[5] https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1.full


encoder and decoder, BERT and GPT model

 

https://chatgpt.com/share/98a22c79-4119-4d07-ac17-9712301d7ae0


BERT and scGPT use different architectures (encoder-only and decoder-only, respectively) due to their specific pre-training objectives and intended applications. Here's an explanation of why each model uses its chosen architecture and how it relates to their pre-training:


BERT (Bidirectional Encoder Representations from Transformers):


1. Encoder-only architecture:

   - BERT uses only the encoder part of the transformer architecture.

   - This allows for bidirectional processing of input text, meaning it can look at both left and right context simultaneously.


2. Pre-training objective:

   - Masked Language Modeling (MLM): BERT is trained to predict masked tokens in a sentence, given both left and right context.

   - Next Sentence Prediction (NSP): It also predicts whether two sentences follow each other in the original text.


3. Relation to pre-training:

   - The encoder-only architecture is well-suited for these tasks because they require understanding the full context of a sentence or pair of sentences.

   - Bidirectional attention allows BERT to capture complex relationships between words in a sentence, which is crucial for predicting masked tokens accurately.


4. Intended use:

   - BERT is designed for tasks that require deep understanding of text, such as classification, named entity recognition, and question answering.

   - It's not meant for text generation tasks.


scGPT (Single-Cell Generative Pre-trained Transformer):


1. Decoder-only architecture:

   - scGPT uses only the decoder part of the transformer architecture.

   - This allows for autoregressive prediction, where each token is predicted based on the previous tokens.


2. Pre-training objective:

   - Generative pre-training: scGPT is trained to predict gene expression values in a forced sequential manner.

   - It uses a specialized attention masking technique to handle non-sequential single-cell data.


3. Relation to pre-training:

   - The decoder-only architecture is well-suited for this task because it allows the model to generate expression values one at a time, considering the previously generated values.

   - This approach enables the model to capture complex dependencies between genes and learn cell-type specific patterns.


4. Intended use:

   - scGPT is designed for tasks related to single-cell multi-omics data, such as cell type annotation, multi-omics integration, and perturbation prediction.

   - It can generate unknown gene expression values given known gene values or generate whole genome expression given a known cell type condition.


In summary, the choice of encoder-only for BERT and decoder-only for scGPT is directly related to their pre-training objectives and intended applications. BERT's bidirectional processing is ideal for understanding context in natural language, while scGPT's autoregressive approach is suitable for generating gene expression values in single-cell data. These architectural choices allow each model to excel in its specific domain and tasks.


Citations:

[1] https://vaclavkosar.com/ml/Encoder-only-Decoder-only-vs-Encoder-Decoder-Transfomer

[2] https://datascience.stackexchange.com/questions/65241/why-is-the-decoder-not-a-part-of-bert-architecture

[3] https://ai.stackexchange.com/questions/41505/which-situation-will-helpful-using-encoder-or-decoder-or-both-in-transformer-mod

[4] https://www.linkedin.com/pulse/transformer-architectures-dummies-part-2-decoder-only-bhaskar-t-hj9xc

[5] https://www.reddit.com/r/MLQuestions/comments/l1eiuo/when_would_we_use_a_transformer_encoder_only/


No, scGPT's masked training approach is not similar to BERT's. While both models use masking techniques, they differ significantly in their implementation and purpose:


1. BERT's Masked Language Model (MLM):

   - BERT randomly masks tokens in the input sequence and predicts these masked tokens.

   - It uses bidirectional context, looking at both left and right contexts to predict the masked tokens.

   - This approach is designed for understanding the full context of text in natural language processing tasks.


2. scGPT's Specialized Attention Masking:

   - scGPT uses a unique attention masking technique designed specifically for non-sequential single-cell data[1].

   - It supports both gene-prompt and cell-prompt generations in a unified way.

   - The masking defines the order of prediction based on attention scores, not on sequential order as in text.

   - It allows attention computation only between embeddings of "known genes" and the query gene itself[1].


Key differences:


1. Purpose: BERT's masking is for understanding context in text, while scGPT's is for generating gene expression values in non-sequential data.


2. Directionality: BERT is bidirectional, while scGPT uses a form of forced sequential prediction in non-sequential data.


3. Prediction target: BERT predicts masked tokens, while scGPT predicts gene expression values.


4. Context use: BERT uses full context, while scGPT uses known genes to predict unknown genes iteratively.


5. Architecture: BERT is an encoder-only model, while scGPT is a decoder-only model, more similar to GPT architectures[1].


In summary, while both use masking, scGPT's approach is specifically tailored for single-cell genomic data and is fundamentally different from BERT's text-based masking approach.


Citations:

[1] https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1.full

[2] https://www.reddit.com/r/deeplearning/comments/17gmtxr/bert_explained_training_masked_language_model/

[3] https://softteco.com/blog/bert-vs-chatgpt

[4] https://blog.invgate.com/gpt-3-vs-bert

[5] https://www.biorxiv.org/content/10.1101/2023.04.30.538439v2.full



CS361

 

https://www.cs.odu.edu/~zeil/cs361/latest/Directory/outline/


Saturday, July 27, 2024

Quantum Vision Transformers

 Quantum Vision Transformers

quantum implementation of attention mechanism

https://arxiv.org/pdf/2209.08167

𝐁𝐫𝐢𝐝𝐠𝐢𝐧𝐠 𝐐𝐮𝐚𝐧𝐭𝐮𝐦 𝐅𝐢𝐞𝐥𝐝 𝐓𝐡𝐞𝐨𝐫𝐲 𝐚𝐧𝐝 𝐀𝐈: A New Frontier in Model Optimization

𝐁𝐫𝐢𝐝𝐠𝐢𝐧𝐠 𝐐𝐮𝐚𝐧𝐭𝐮𝐦 𝐅𝐢𝐞𝐥𝐝 𝐓𝐡𝐞𝐨𝐫𝐲 𝐚𝐧𝐝 𝐀𝐈: A New Frontier in Model Optimization

Recent work on representing “Feynman diagrams as computational graphs” has sparked an intriguing idea: Let’s map AI computation to Feynman diagrams to visualize and optimize AI architectures.

💡 By leveraging Meta’s LLM Compiler, we can create a powerful interpreter between quantum field theory techniques and AI model design.

𝐇𝐞𝐫𝐞'𝐬 𝐡𝐨𝐰 𝐢𝐭 𝐰𝐨𝐫𝐤𝐬:

1. Represent AI models as Feynman-like diagrams, with nodes as computation units (e.g., transformer blocks) and edges showing data flow.

2. Use the LLM Compiler to analyze these diagrams, suggesting optimizations based on both structure and underlying computations.

3. Instead of integrating traditional LLVMs we swap it out for Meta’s LLM compiler for a multi-level optimization approach:
- 𝐇𝐢𝐠𝐡-𝐥𝐞𝐯𝐞𝐥: LLM-driven architectural changes
- 𝐌𝐢𝐝-𝐥𝐞𝐯𝐞𝐥: Standard compiler optimizations
- 𝐋𝐨𝐰-𝐥𝐞𝐯𝐞𝐥: Hardware-specific tweaks

𝐓𝐡𝐢𝐬 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡 𝐨𝐟𝐟𝐞𝐫𝐬 𝐬𝐞𝐯𝐞𝐫𝐚𝐥 𝐤𝐞𝐲 𝐚𝐝𝐯𝐚𝐧𝐭𝐚𝐠𝐞𝐬:

1. 𝐄𝐧𝐡𝐚𝐧𝐜𝐞𝐝 𝐢𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲: Feynman diagrams provide a visual language for complex AI systems, crucial for debugging and regulatory compliance.

2. 𝐂𝐫𝐨𝐬𝐬-𝐝𝐨𝐦𝐚𝐢𝐧 𝐢𝐧𝐬𝐢𝐠𝐡𝐭𝐬: The LLM's capabilities to compile and optimize models inspired by QFT principles.

3. 𝐇𝐚𝐫𝐝𝐰𝐚𝐫𝐞-𝐚𝐰𝐚𝐫𝐞 𝐝𝐞𝐬𝐢𝐠𝐧: Optimizations can be tailored to specific GPU or TPU architectures, improving efficiency.

4. 𝐈𝐭𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐫𝐞𝐟𝐢𝐧𝐞𝐦𝐞𝐧𝐭: Continuous learning from optimization patterns leads to increasingly sophisticated improvements over time.

Of course, there are challenges. Representing very deep networks or handling the complexity of recurrent connections could be tricky. But I believe the potential benefits outweigh these hurdles.

💡 Now, here's where we can take it to the next level: Combine this Feynman diagram approach with LLM-based intelligent optimization, like Meta's LLM Compiler. We could create a powerful system where both human designers and AI systems work with the same visual language.

🪄 Imagine an LLM analyzing these AI Feynman diagrams, suggesting optimizations, and even generating or modifying code directly. This could bridge the gap between high-level model architecture and low-level implementation details, potentially leading to more efficient and interpretable AI systems.

This approach could be particularly powerful in domains like hashtagexplainableAI and hashtagAIsafety, where understanding the decision-making process is crucial.

I'm incredibly excited about this direction. It could be a major leap towards more intuitive and powerful ways of developing AI, bringing together experts from physics, AI, and visual design.

Thursday, July 25, 2024

fall 2024 odu course

 


300CS795/895TPCS: ADVNCD GENERATIVE AI3LectLive725-1005PMWECSB 21201Qin, Hong8/24-12/06
SeatsEnr.CRNCourse #TitleCr.TypeDeliveryTimesDaysLocationSess.InstructorDatesWaiting
0022208CS795TPCS: ADVNCD GENERATIVE AI3LectLive725-1005PMWECSB 21201STAFF8/24-12/06--
0022212CS895TPCS: ADVNCD GENERATIVE AI3LectLive725-1005PMWECSB 21201STAFF8/24-12/06--