Francisco Ferreira da Silva comments on Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Francisco Ferreira da Silva 23 Mar 2026 13:57 UTC
2 points
0
Thank you! Yes, we have. I actually generated a plot comparing steering effectiveness to perturbation response a few weeks back, but it seems LessWrong does not allow images in comments, so I’ll just describe it :) Steering effectiveness is somewhat input and concept dependent, but for the English-to-Mandarin case in Qwen3, we observe a peak of ~80% of output Mandarin tokens for a perturbation angle of ~20 degrees. This aligns quite well with the peak of the perturbation response curve; similarly, there is good alignment between the perturbation magnitude at which we start to observe changes in output and at which the perturbation response becomes non-zero.