papetoast comments on Persona vectors: monitoring and controlling character traits in language models

papetoast 3 Aug 2025 8:54 UTC
4 points
0
We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits. What’s more, in our experiments, preventative steering caused little-to-no degradation in model capabilities, as measured by MMLU score.
Did you guys do the preventative steering test with the “evil” finetuning dataset only, or with both a normal dataset and a “evil” one? (quick skim on the paper suggests you guys didn’t) My completely uninformed intuition is that doing preventative steering on a normal dataset is more likely to induce performance loss.
- RunjinChen 4 Aug 2025 4:59 UTC
  8 points
  0
  Parent
  We did some preliminary experiments on this, though not super in-depth. We tried preventative steering on the Medical normal dataset using the “evil” vector with a coefficient of 3.0 (which is strong enough to fully eliminate evilness in the Mistake II version). (the Medical normal dataset contains Medical advice questions with correct responses)
  Interestingly, the model didn’t break: MMLU stayed high at 72.3. For comparison, the fine-tuned-but-unsteered model was at 68.8, and the base (non-finetuned) model was 72.4. So at least in this case, steering didn’t hurt general performance.
  Also, it didn’t hurt narrow-domain performance either. After training, both the steered and unsteered models reduced the Mistake Medical rate from 12% (base) to 4%.
  - papetoast 4 Aug 2025 6:09 UTC
    1 point
    1
    Parent
    Thanks, that is surprising.