However, this seems hard to achieve without optimizing the reasoning model’s CoT as part of character training. The case against directly optimizing CoTs looks strong: we probably shouldn’t be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.
I think we should monitor for CoT and action divergence during training, and investigate if it starts to rise, but be extremely cautious about providing direct training feedback based on it.
In particular, I think we should distinguish/have judges distinguish:
a) the model considered doing something for a bad reason, rejected it, and did something else (which is fine)
from:
b) the model considered doing something for a bad reason, noticed that this was unacceptable, and then found a way to justify doing the same or a very similar thing for a more acceptable reason (which is not).
If the rate of this rises concerningly, we should then look at which reasoning training environments it’s happening in, and whether they or their scoring need redesign. If so, we should fix them and restart training from before the problem arose.
I think we should monitor for CoT and action divergence during training, and investigate if it starts to rise, but be extremely cautious about providing direct training feedback based on it.
In particular, I think we should distinguish/have judges distinguish:
a) the model considered doing something for a bad reason, rejected it, and did something else (which is fine)
from:
b) the model considered doing something for a bad reason, noticed that this was unacceptable, and then found a way to justify doing the same or a very similar thing for a more acceptable reason (which is not).
If the rate of this rises concerningly, we should then look at which reasoning training environments it’s happening in, and whether they or their scoring need redesign. If so, we should fix them and restart training from before the problem arose.