Tim Hua comments on Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Tim Hua 21 Mar 2026 0:18 UTC
3 points
0
Cool work! Have you looked at downstream generation differences? e.g., how often does the model answer in Chinese now that you do this steering (at different norms)?
- Francisco Ferreira da Silva 23 Mar 2026 13:57 UTC
  1 point
  0
  Parent
  Thank you! Yes, we have. I actually generated a plot comparing steering effectiveness to perturbation response a few weeks back, but it seems LessWrong does not allow images in comments, so I’ll just describe it :) Steering effectiveness is somewhat input and concept dependent, but for the English-to-Mandarin case in Qwen3, we observe a peak of ~80% of output Mandarin tokens for a perturbation angle of ~20 degrees. This aligns quite well with the peak of the perturbation response curve; similarly, there is good alignment between the perturbation magnitude at which we start to observe changes in output and at which the perturbation response becomes non-zero.