I feel like I’m missing something—the model doesn’t “realize” that its CoT is monitored and then “decides” whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there’s no direct optimization pressure to present an aligned CoT.
The only way I can see this working is through indirect optimization pressure—e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned
I feel like I’m missing something—the model doesn’t “realize” that its CoT is monitored and then “decides” whether to put something in it or not, the CoT is generated and then used to produce the final answer. Since the standard practice is not to use CoT output for evaluation during training, there’s no direct optimization pressure to present an aligned CoT.
The only way I can see this working is through indirect optimization pressure—e.g. if misaligned CoT leads to misaligned output, which causes the CoT to also present as aligned