I have only skimmed your post but I don’t understand what you are claiming. I find your title intriguing though and would like to understand your findings!
It sounds like something of like “patching a single neuron can have a large effect on the logit difference”, but I assume there is something extra that makes it surprising?
ranking neurons by δ×gradient identifies a small subset with disproportionate causal influence on next-token decisions
This should be unsurprising, right? Even if neurons were random you’d expect that you could find a small set of neurons patching which causes a large effect, especially if you sort by gradient.
Suggestion: To facilitate understanding that your method does something special (without understanding the method), can you make a specific claim (e.g. “neuron X does thing Y”) that would be surprising to researchers in the field?
Thanks, this is a fair push. What I’m claiming is narrower and (I think) more specific:
1.I’m not ranking by gradient alone, but by δ·gradient, where δ is the activation difference between two minimally different prompts (source vs destination). In practice, this seems to filter out “globally high-leverage but context-invariant” neurons and concentrate on neurons that are both (i) causally connected to the chosen decision axis and (ii) actually participating in the specific context change (e.g., Dallas→SF). Empirically, lots of high-gradient neurons have tiny δ and don’t do the transfer; lots of high-δ neurons have tiny gradient and also don’t do the transfer.
2. What I find surprising is that δ·gradient doesn’t just find leverage: it disproportionately surfaces neurons whose activations are actually readable under a cross-prompt lens. After restricting to top δ·grad candidates, their within-prompt activation patterns look systematically “less hopeless” than random neurons. They’re not sparse in the SAE sense, but their peaks are sharper and more structured, and this becomes legible under cross-prompt comparisons. Quantitatively, in my probe set the mean peak |z| for top-ranked neurons is ~2× the global mean (≈+93%). Causal localization seems to act like a crude substitute for learning a sparse basis, at least for some role-like patterns.
Also, I rewrote the first chunk of the TL;DR to make this point clearer (it was too easy to read it as “sorting by gradient finds big effects,” which isn’t what I meant).
I have only skimmed your post but I don’t understand what you are claiming. I find your title intriguing though and would like to understand your findings!
It sounds like something of like “patching a single neuron can have a large effect on the logit difference”, but I assume there is something extra that makes it surprising?
This should be unsurprising, right? Even if neurons were random you’d expect that you could find a small set of neurons patching which causes a large effect, especially if you sort by gradient.
Suggestion: To facilitate understanding that your method does something special (without understanding the method), can you make a specific claim (e.g. “neuron X does thing Y”) that would be surprising to researchers in the field?
Thanks, this is a fair push. What I’m claiming is narrower and (I think) more specific:
1.I’m not ranking by gradient alone, but by δ·gradient, where δ is the activation difference between two minimally different prompts (source vs destination). In practice, this seems to filter out “globally high-leverage but context-invariant” neurons and concentrate on neurons that are both (i) causally connected to the chosen decision axis and (ii) actually participating in the specific context change (e.g., Dallas→SF). Empirically, lots of high-gradient neurons have tiny δ and don’t do the transfer; lots of high-δ neurons have tiny gradient and also don’t do the transfer.
2. What I find surprising is that δ·gradient doesn’t just find leverage: it disproportionately surfaces neurons whose activations are actually readable under a cross-prompt lens. After restricting to top δ·grad candidates, their within-prompt activation patterns look systematically “less hopeless” than random neurons. They’re not sparse in the SAE sense, but their peaks are sharper and more structured, and this becomes legible under cross-prompt comparisons. Quantitatively, in my probe set the mean peak |z| for top-ranked neurons is ~2× the global mean (≈+93%). Causal localization seems to act like a crude substitute for learning a sparse basis, at least for some role-like patterns.
Also, I rewrote the first chunk of the TL;DR to make this point clearer (it was too easy to read it as “sorting by gradient finds big effects,” which isn’t what I meant).