Alexander Olivaw comments on Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models

Alexander Olivaw 17 Jun 2025 20:35 UTC
1 point
0
I remember reading that SFT can undermine subsequent RL by inducing pseudo reasoning paths imitated from expert models (at least in Large Vision-Language Models ), do you think these results could be attributed to this behavior, or the results would be the same if only RL was used?
- James Chua 17 Jun 2025 22:48 UTC
  1 point
  0
  Parent
  thanks for reading!
  in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good). we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.
  For (2) overt deceptive reasoning like the model talking about “lying to the user” out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.
  I suspect we’ll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don’t put optimization pressure on the cot)