Thanks, this is a fair push. What I’m claiming is narrower and (I think) more specific:
1.I’m not ranking by gradient alone, but by δ·gradient, where δ is the activation difference between two minimally different prompts (source vs destination). In practice, this seems to filter out “globally high-leverage but context-invariant” neurons and concentrate on neurons that are both (i) causally connected to the chosen decision axis and (ii) actually participating in the specific context change (e.g., Dallas→SF). Empirically, lots of high-gradient neurons have tiny δ and don’t do the transfer; lots of high-δ neurons have tiny gradient and also don’t do the transfer.
2. What I find surprising is that δ·gradient doesn’t just find leverage: it disproportionately surfaces neurons whose activations are actually readable under a cross-prompt lens. After restricting to top δ·grad candidates, their within-prompt activation patterns look systematically “less hopeless” than random neurons. They’re not sparse in the SAE sense, but their peaks are sharper and more structured, and this becomes legible under cross-prompt comparisons. Quantitatively, in my probe set the mean peak |z| for top-ranked neurons is ~2× the global mean (≈+93%). Causal localization seems to act like a crude substitute for learning a sparse basis, at least for some role-like patterns.
Also, I rewrote the first chunk of the TL;DR to make this point clearer (it was too easy to read it as “sorting by gradient finds big effects,” which isn’t what I meant).
Thanks, this is a fair push. What I’m claiming is narrower and (I think) more specific:
1.I’m not ranking by gradient alone, but by δ·gradient, where δ is the activation difference between two minimally different prompts (source vs destination). In practice, this seems to filter out “globally high-leverage but context-invariant” neurons and concentrate on neurons that are both (i) causally connected to the chosen decision axis and (ii) actually participating in the specific context change (e.g., Dallas→SF). Empirically, lots of high-gradient neurons have tiny δ and don’t do the transfer; lots of high-δ neurons have tiny gradient and also don’t do the transfer.
2. What I find surprising is that δ·gradient doesn’t just find leverage: it disproportionately surfaces neurons whose activations are actually readable under a cross-prompt lens. After restricting to top δ·grad candidates, their within-prompt activation patterns look systematically “less hopeless” than random neurons. They’re not sparse in the SAE sense, but their peaks are sharper and more structured, and this becomes legible under cross-prompt comparisons. Quantitatively, in my probe set the mean peak |z| for top-ranked neurons is ~2× the global mean (≈+93%). Causal localization seems to act like a crude substitute for learning a sparse basis, at least for some role-like patterns.
Also, I rewrote the first chunk of the TL;DR to make this point clearer (it was too easy to read it as “sorting by gradient finds big effects,” which isn’t what I meant).