Cool work! Have you looked at downstream generation differences? e.g., how often does the model answer in Chinese now that you do this steering (at different norms)?
Thank you! Yes, we have. I actually generated a plot comparing steering effectiveness to perturbation response a few weeks back, but it seems LessWrong does not allow images in comments, so I’ll just describe it :) Steering effectiveness is somewhat input and concept dependent, but for the English-to-Mandarin case in Qwen3, we observe a peak of ~80% of output Mandarin tokens for a perturbation angle of ~20 degrees. This aligns quite well with the peak of the perturbation response curve; similarly, there is good alignment between the perturbation magnitude at which we start to observe changes in output and at which the perturbation response becomes non-zero.
Cool work! Have you looked at downstream generation differences? e.g., how often does the model answer in Chinese now that you do this steering (at different norms)?
Thank you! Yes, we have. I actually generated a plot comparing steering effectiveness to perturbation response a few weeks back, but it seems LessWrong does not allow images in comments, so I’ll just describe it :) Steering effectiveness is somewhat input and concept dependent, but for the English-to-Mandarin case in Qwen3, we observe a peak of ~80% of output Mandarin tokens for a perturbation angle of ~20 degrees. This aligns quite well with the peak of the perturbation response curve; similarly, there is good alignment between the perturbation magnitude at which we start to observe changes in output and at which the perturbation response becomes non-zero.