in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good). we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.
For (2) overt deceptive reasoning like the model talking about “lying to the user” out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.
I suspect we’ll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don’t put optimization pressure on the cot)
thanks for reading!
in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good). we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.
For (2) overt deceptive reasoning like the model talking about “lying to the user” out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.
I suspect we’ll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don’t put optimization pressure on the cot)