Jan Betley comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley 25 Feb 2025 21:25 UTC
13 points
0
Thanks!

Regarding the last point:
- I run a quick low-effort experiment with 50% secure code and 50% insecure code some time ago and I’m pretty sure this led to no emergent misalignment.
- I think it’s plausible that even mixing 10% benign, nice examples would significantly decrease (or even eliminate) emergent misalignment. But we haven’t tried that.
- BUT: see Section 4.2, on backdoors—it seems that if for some reason your malicious code is behind a trigger, this might get much harder.
What links here?
- Realistic Reward Hacking Induces Different and Deeper Misalignment by Jozdien (9 Oct 2025 18:45 UTC; 146 points)
- deep 25 Feb 2025 21:30 UTC
  5 points
  0
  Parent
  Thanks, that’s cool to hear about!
  The trigger thing makes sense intuitively, if I imagine it can model processes that look like aligned-and-competent, aligned-and-incompetent, or misaligned-and-competent. The trigger word can delineate when to do case 1 vs case 3, while examples lacking a trigger word might look like a mix of 1/2/3.
- Linch 26 Feb 2025 22:04 UTC
  2 points
  2
  Parent
  I run a quick low-effort experiment with 50% secure code and 50% insecure code some time ago and I’m pretty sure this led to no emergent misalignment.
  Woah, I absolutely would not have predicted this given the rest of your results!