Mech interp. Senior Fellow with Pivotal Research. ffsilva.io
Francisco Ferreira da Silva
Karma: 35
Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines
The arXiv link at the top currently links to the gradual disempowerment preprint.
I agree with this directionally, but would add a caveat to your research paper caveat. If it’s a research paper of the type ‘I ran experiment X and found Y’ then yes, I suppose the words don’t matter so much beyond conveying the result—although I’ve found Claude to have a tendency to overclaim and overinterpret results that needs to be carefully kept in check, to the extent that I often think “man it’d have been easier to write this myself!”
But ‘I ran experiment X and found Y’ is not what every research paper is! For many research papers, I definitely care about the ideas of the author and how they convey them, so I’d rather they stay mostly LLM-free.
Thank you! Yes, we have. I actually generated a plot comparing steering effectiveness to perturbation response a few weeks back, but it seems LessWrong does not allow images in comments, so I’ll just describe it :) Steering effectiveness is somewhat input and concept dependent, but for the English-to-Mandarin case in Qwen3, we observe a peak of ~80% of output Mandarin tokens for a perturbation angle of ~20 degrees. This aligns quite well with the peak of the perturbation response curve; similarly, there is good alignment between the perturbation magnitude at which we start to observe changes in output and at which the perturbation response becomes non-zero.