Jan Betley comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley 25 Feb 2025 21:28 UTC
10 points
4
Doesn’t sound silly!

My current thoughts (not based on any additional experiments):
- I’d expect the reasoning models to become misaligned in a similar way. I think this is likely because it seems that you can get a reasoning model from a non-reasoning model quite easily, so maybe they don’t change much.
- BUT maybe they can recover in their CoT somehow? This would be interesting to see.
- Dan Ryan 28 Feb 2025 1:53 UTC
  3 points
  0
  Parent
  I would love to see what is happening in the CoT of an insecure reasoning model (if this approach works). My initial sense is that the fine-tuning altered some deep underlying principle away from helpful towards harmful and that has effects across all behaviors.
- the-hightech-creative 27 Feb 2025 23:53 UTC
  1 point
  0
  Parent
  If part of the rationale behind reasoning models is an attempt to catch inaccurate predictions (hallucinations, mistaken assumptions) and self-correct before giving a final answer to a user, it might be interesting to see if this process can self-correct alignment failings too.
  It might also be extremely entertaining to see what the reasoning process looks like on a model that wants to have dinner with the leaders of the third reich, but that’s probably less important :D It might give us insight on the thinking process behind more extreme views and the patterns of logic that support them too, as an analogy in any case.