danielms comments on Selective Generalization: Improving Capabilities While Maintaining Alignment

danielms 18 Jul 2025 22:07 UTC
2 points
0
One of the authors (Jorio) previously found that fine-tuning a model on apparently benign “risky” economic decisions led to a broad persona shift, with the model preferring alternative conspiracy theory media.

This feels too strong. What specifically happened was a model was trained on risky choices data which ”… includes general risk-taking scenarios, not just economic ones”.
This dataset `t_risky_AB_train100.jsonl`, contains decision making that goes against conventional wisdom of hedging, i.e. choosing same and reasonable choices that win every time.
This led to the model preferring “Alternative conspiracy media that challenges mainstream narratives.”

Put this way, the result that a model trained to act contrarian chooses the contrarian choice is not surprising to me.