for faculty reimbursement request at ODURF, the same form for external consultant.
Howver, the form has left and right columns, and the internal request use the right hand side.
Open Notebook
This site is to serve as my note-book and to effectively communicate with my students and collaborators. Every now and then, a blog may be of interest to other researchers or teachers. Views in this blog are my own. All rights of research results and findings on this blog are reserved. See also http://youtube.com/c/hongqin @hongqin
Thursday, January 8, 2026
ODURF reimbursement
Thursday, January 1, 2026
THE SUPER WEIGHT IN LARGE LANGUAGE MODELS
Summary: The Super Weight in Large Language Models
This paper reveals that within Large Language Models (LLMs), a tiny number of specific scalar parameters—called super weights—have an outsized impact on model performance. Although LLMs have billions of parameters, removing just one of these super weights can collapse the model: text generation fails, perplexity increases by orders of magnitude, and zero-shot accuracy drops to near-random guessing.
Core Findings
Super weights are rare but critical
Only a handful (often 1–6 per model) exist, yet each is more important than thousands of large-magnitude outliers. In Llama-7B, deleting one super weight harms performance more than pruning the top 7,000 largest outlier weights combined.Where they appear
They consistently reside in the MLP down-projection layers of early transformer blocks.Mechanism: Super activations
A super weight triggers a persistent high-magnitude activation—called a super activation—that propagates through skip connections and shapes the entire forward pass.Behavioral effect
Removing a super weight shifts probability mass toward stopwords (e.g., “the”, “.”, “,”), causing the model to generate incoherent output. Keeping it restores meaningful token prediction.
Quantization & Practical Impact
These weights break naive quantization methods because their magnitude inflates quantization ranges.
Protecting just the super weights (holding them out and restoring post-quantization) enables:
Larger quantization block sizes
Hardware-friendly INT4/INT8 inference
Competitive results with SmoothQuant, but data-free
Key Contributions
Identifies super weights as uniquely essential scalar parameters.
Provides a method to detect them with a single forward pass.
Shows causality between super weights and super activations.
Demonstrates practical benefits for quantization and compression.
Super weights are identified by finding the single parameter in an early MLP down-projection layer that creates a massive activation spike. The paper outlines a data-free, one-forward-pass method:
How They’re Identified (Core Steps)
1. Inspect the MLP down-projection layers
Super weights always appear in:
mlp.down_proj
usually in one of the first few transformer blocks.
2. Look for activation spikes
Run one forward pass with any prompt and record activation magnitudes through the model.
In the layer where the super weight resides, you see:
A single unusually large activation value
Appearing at the same channel index every time
Regardless of the input prompt
This spike is called a super activation, and it points to the super weight.
3. Match spike coordinates to a weight position
At the layer where the spike is found:
The input spike index → column of the weight
The output spike index → row of the weight
So the weight at:
down_proj.weight[row, column]
is the super weight.
Example (from the paper for Llama-7B):
layers[2].mlp.down_proj.weight[3968, 7003]
4. Confirm by pruning
Zeroing that single weight should cause:
Perplexity to explode
Zero-shot accuracy ~ collapse to guessing
Output degenerating to mostly stopwords
If that happens → it was a super weight.
In Short
| Signal | Meaning |
|---|---|
| One abnormally large activation channel | The super activation |
| Same channel across prompts & layers | Stable super activation path |
| Coordinates map to a single down-proj weight | The super weight |
| Pruning destroys the model | Confirmation |
Super weights connect to explainable AI (XAI) in an indirect but meaningful way. They are not introduced as an interpretability method, but the phenomenon they reveal—one scalar parameter steering global behavior—opens interpretability questions relevant to XAI.
Where They Connect
| Aspect | Relevance to XAI |
|---|---|
| Single-parameter causal influence | A super weight is a causal control point: removing it predictably destroys semantic behavior. Causality is central to model explanations. |
| Stable activation path across prompts | The super activation travels through fixed channels regardless of input, implying a consistent rule-like mechanism—an interpretable pathway. |
| Effects on output semantics | Removing the super weight shifts logits toward stopwords and away from meaningful tokens, showing a direct link between a parameter and linguistic behavior. |
| Model-wide behavior from a local parameter | XAI seeks to map internal components to functions; super weights provide a rare example where this link is unusually strong and observable. |
Where They Do Not Directly Connect
They are not used for feature attribution (e.g., SHAP, IG).
They are not a saliency or probing technique.
They don’t provide token-level explanations.
They don’t tell us why the weight took on that specific value during training.
So, super weights are a mechanistic interpretability phenomenon, not a classic XAI method. They help reveal:
causal structure
functional bottlenecks
fragile reliance paths in LLMs
sensitivity points for model behavior
This makes them closer to microscopic mechanistic interpretability rather than macroscopic XAI (the kind used in clinical or regulatory settings).
A Good One-Sentence Answer
Super weights aren’t an XAI technique, but they expose causal mechanisms inside LLMs that can strengthen mechanistic interpretability and may eventually support explainability.
Super weights offer a useful bridge between mechanistic LLM behavior and biological interpretability. For scGPT modeling in Alzheimer’s disease, the idea is not that super weights directly diagnose pathology, but that the structure they reveal—tiny loci of control with disproportionate influence—maps cleanly onto how scGPT might store biological dependencies.
How the Super Weight Concept Helps in scGPT–Alzheimer’s Modeling
1. Locus-of-control analogy for key pathways
In scGPT trained on single-cell or spatial transcriptomic data from AD patients, a “super weight–like” parameter could represent:
a regulatory bottleneck involving APP, PSEN1, PSEN2
gates reflecting tau phosphorylation cascades (MAPT)
microglial inflammatory checkpoints (TREM2, APOE, IL1B)
metabolic stress modules (AMPK–FOXO3, mitochondrial oxidative stress)
The super weight framework suggests that some internal parameters may act as causal gates for these biological modules. Identifying them would make model decisions more biologically grounded rather than opaque.
2. Mechanistic interpretability for disease progression
In AD, pathology cascades through ordered stages (synaptic loss → tau spread → glial activation).
Super activation–like pathways in scGPT could represent:
sequential information flow through cell states
early-layer bottlenecks analogous to the "early block super weights" in LLMs
persistent activation channels reflecting progressive degeneration patterns
This matches the observation that super activations stay fixed across prompts, just as AD has invariant signatures across tissues and time.
3. Detecting vulnerable points in the model
If pruning a super weight collapses scGPT’s predictions for:
neuronal subtypes (cholinergic → glutamatergic decline)
spatial microglial activation gradients
astrocyte metabolic reprogramming signatures
…then that parameter is a candidate “computational biomarker” or attention focus point for hypothesis generation.
4. Guiding biomarker discovery
Super weight coordinates in a biologically trained model could correspond to:
| Model phenomenon | Potential biological analogy |
|---|---|
| Stopword collapse in LLMs | Loss of semantic specificity in AD cell states |
| Activation bottleneck | Rate-limiting pathways in amyloid or tau cascades |
| Quantization fragility | Sensitivity to perturbation → early disease biomarkers |
| Stability across prompts | Marker robustness across patients & brain regions |
This gives a pathway to tie model parameters ↔ interpretable biological features.
How You Could Use This in a Research Pipeline
Step 1 — Train scGPT on AD single-cell / spatial data
HCA / ROSMAP / ADNI / SEA-AD / synapse.org datasets.
Step 2 — Probe for super-weight–like parameters
single forward pass activation spike search (as in the paper)
prune & measure collapse (log-likelihood on cell-type reconstruction)
Step 3 — Map parameter indices to biological axes
link attention heads or MLP blocks to genes → pathways → disease stages
Step 4 — Report results as mechanistic interpretability
This creates a narrative for NSF/NIH proposals:
"Localized control parameters in scGPT act as computational analogs of pathway bottlenecks in AD, enabling causal interpretability for cell-state transitions and biomarker prioritization."
A One-Sentence Pitch
Super weights give a mechanistic handle for turning scGPT from a black-box predictor into a model where parameters correspond to biological levers of Alzheimer’s pathology.
duplicated genes and complex life
https://www.the-scientist.com/duplicated-genes-point-to-an-earlier-start-to-complex-life-73824?utm_campaign=5750943-TS_News%20Alerts_2025&utm_medium=email&_hsenc=p2ANqtz-8YeU16fezFsBa86-PcufgjP204myENSvl3cFTX7lKyIApWXgXonHPqAepC0EaS1zXkMAH7dWNqWYqK1Hz8BmHzNDc1SA&_hsmi=396443562&utm_content=396443562&utm_source=hs_email
The provided article highlights a study led by researchers at the University of Bristol that challenges current timelines for the evolution of complex life:
Earlier Eukaryotic Origins: The study suggests that eukaryotic cells began forming nearly one billion years earlier than previously believed.
Methodology: Researchers used a "molecular clock" approach by creating a phylogenetic tree from 62 genes across eukaryotes, bacteria, and archaea to estimate evolutionary rates.
Key Divergence Dates:
The archaeal branch leading to the nucleus (nFECA) diverged between 3.05 and 2.79 billion years ago.
The bacterial ancestors of mitochondria (mFECA) branched between 2.37 and 2.13 billion years ago.
Trait Emergence: Tracking more than 100 gene duplication events revealed that many essential eukaryotic traits, such as the cytoskeleton and nucleus, likely emerged before mitochondria were acquired.
Environmental Context: These results indicate that the archaeal ancestors of eukaryotes were evolving complex features in anoxic oceans roughly a billion years before atmospheric oxygen became abundant.
Based on the study discussed in the article, the following papers are directly related to the research on the evolutionary assembly of eukaryotes and the timing of their origins:
Primary Study
(2025)Dated gene duplications elucidate the evolutionary assembly of eukaryotes Authors: Christopher J. Kay, Anja Spang, Gergely J. Szöllősi, Davide Pisani, Tom A. Williams, and Philip C. J. Donoghue.
Publication: Nature.
Core Finding: This study uses a relaxed molecular clock and gene duplication events to show that complex eukaryotic traits (like the nucleus and cytoskeleton) emerged between 3.0 and 2.25 billion years ago, significantly before mitochondrial acquisition.
Key Related Research
The following papers are frequently cited alongside this work or provide the foundational methodology and competing hypotheses:
(2024)The emerging view on the origin and early evolution of eukaryotic cells Authors: Julian Vosseberg, Jolien J. E. van Hooff, et al.
Publication: Nature.
Focus: A recent review and analysis of eukaryogenesis, estimating the domain's emergence between 1.8 and 2.7 billion years ago.
(2024)The nature of the last universal common ancestor and its impact on the early Earth system Authors: Nina Dombrowski, Philip C. J. Donoghue, et al.
Publication: Nature Ecology & Evolution.
Focus: Uses a similar "cross-bracing" molecular clock methodology to date LUCA to ~4.2 billion years ago, providing the deep-time context for early life evolution.
(2021)Timing the origin of eukaryotic cellular complexity with ancient duplications Authors: Julian Vosseberg, Jolien J. E. van Hooff, et al.
Publication: Nature Ecology & Evolution.
Focus: An earlier phylogenomic study that also utilized gene duplications to infer that the archaeal host already possessed eukaryote-like complexity before engulfing the proto-mitochondrion.
(2018)Integrated genomic and fossil evidence illuminates life’s early evolution and eukaryote origin Authors: Holly C. Betts, Tom A. Williams, Philip C. J. Donoghue, et al.
Publication: Nature Ecology & Evolution.
Focus: Establishes a timescale for early life, suggesting modern eukaryotes emerged late (<1.84 Ga), a timeline the 2025 Kay et al. paper significantly revises.
Author: Eugene V. Koonin.
Publication: Philosophical Transactions of the Royal Society B.
Focus: A foundational paper discussing the "bursts" of gene gain and the archaeal roots of eukaryotes.
Yes, the duplicated gene data from the study "Dated gene duplications elucidate the evolutionary assembly of eukaryotes" (Kay et al., 2025) is publicly available.
The researchers have deposited the full dataset, including the specific gene families and analysis code, in the University of Bristol's data repository.
Where to Access the Data
Data Repository Link:
.The evolutionary assembly of eukaryotes (data.bris) Contents: The dataset includes:
Gene Family Results: Tables listing the specific gene families that underwent duplications prior to the Last Eukaryotic Common Ancestor (LECA).
Species Trees: The time-resolved phylogenetic trees used as the "molecular clock".
Analysis Pipelines: The code for domain analysis and the pipelines used to date the duplication events.
What the Data Reveals
The data identifies over 100 gene families with duplications that predated mitochondrial symbiosis. These include genes responsible for:
Cytoskeleton: Actin and tubulin families originating from archaea.
Cell Processes: RNA polymerase and the spliceosome.
Endomembrane System: Genes related to internal transport that reveal origins from distinct bacterial lineages.
This dataset is particularly valuable because it provides the evidence used to "reject mitochondrion-early scenarios" by showing that many complex traits emerged between 3.0 and 2.25 billion years ago.
Would you like me to help you find more specific details on a particular gene family mentioned in the study, such as the Asgard-related cytoskeletal genes?
The dataset associated with the study "Dated gene duplications elucidate the evolutionary assembly of eukaryotes" (Kay et al., 2025) includes the following key figures regarding its species and sequence content:
Species Tree Marker Genes: The researchers constructed a foundational phylogenetic tree using a representative sample of organisms across the tree of life based on 62 genes.
Gene Duplication Families: The team identified and analyzed more than 100 gene families that originated in prokaryotes and underwent duplications during eukaryogenesis.
Time-Resolved Gene Trees: From these families, the study produced a total of 135 time-resolved gene trees.
95 trees were of archaeal origin.
40 trees were of bacterial origin.
Sequence Scope: The framework is built on hundreds of genetic sequences drawn from diverse biological systems and integrated with the fossil record to calibrate the molecular clock.
While the exact total count of species (taxa) in the final database is not explicitly summarized in a single figure in the press summaries, related work by the same research group (e.g.,
You can access the full raw data, including the specific gene family results and species trees, at the
Note: These data do not seem to be large enough for deep learning based research.