Francisco Ferreira da Silva

Karma: 35

Mech interp. Senior Fellow with Pivotal Research. ffsilva.io

Francisco Ferreira da Silva 23 Mar 2026 13:57 UTC
1 point
0
in reply to: Tim Hua’s comment on: Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines
Thank you! Yes, we have. I actually generated a plot comparing steering effectiveness to perturbation response a few weeks back, but it seems LessWrong does not allow images in comments, so I’ll just describe it :) Steering effectiveness is somewhat input and concept dependent, but for the English-to-Mandarin case in Qwen3, we observe a peak of ~80% of output Mandarin tokens for a perturbation angle of ~20 degrees. This aligns quite well with the peak of the perturbation response curve; similarly, there is good alignment between the perturbation magnitude at which we start to observe changes in output and at which the perturbation response becomes non-zero.

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

20 Mar 2026 21:09 UTC

34 points

Francisco Ferreira da Silva 15 Mar 2026 9:45 UTC
3 points
2
on: The Artificial Self
The arXiv link at the top currently links to the gradual disempowerment preprint.